Data Preprocessing
🧹 Text Cleaning & Preprocessing
To prepare the news article content for machine learning, we performed a series of text preprocessing steps on the text
column. This ensures that the model learns from clean, consistent, and meaningful text.
🪛 Steps Applied:
Lowercasing
Converts all characters to lowercase to avoid treating “News” and “news” as different words.Removing Punctuation & Special Characters
Strips out characters like!
,.
,?
, and other non-alphabetic symbols to reduce noise.Tokenization
Splits sentences into individual words (tokens) usingnltk.word_tokenize
.Stopword Removal
Eliminates common, uninformative words such as “the”, “is”, “and”, using NLTK’s stopword list.Lemmatization
Reduces words to their base form (e.g., “running” → “run”) usingWordNetLemmatizer
to normalize similar words.Whitespace Cleanup
Joins cleaned tokens back into a single string with extra spaces removed.
📄 Output
- A new column called
clean_text
was created to store the processed text. - The final dataset (
cleaned_news.csv
) was used for model training and evaluation.
These preprocessing steps help remove noise, standardize input, and improve model performance by focusing on the most important words in each article.