Data Preprocessing

🧹 Text Cleaning & Preprocessing

To prepare the news article content for machine learning, we performed a series of text preprocessing steps on the text column. This ensures that the model learns from clean, consistent, and meaningful text.

🪛 Steps Applied:

Lowercasing
Converts all characters to lowercase to avoid treating “News” and “news” as different words.
Removing Punctuation & Special Characters
Strips out characters like !, ., ?, and other non-alphabetic symbols to reduce noise.
Tokenization
Splits sentences into individual words (tokens) using nltk.word_tokenize.
Stopword Removal
Eliminates common, uninformative words such as “the”, “is”, “and”, using NLTK’s stopword list.
Lemmatization
Reduces words to their base form (e.g., “running” → “run”) using WordNetLemmatizer to normalize similar words.
Whitespace Cleanup
Joins cleaned tokens back into a single string with extra spaces removed.

📄 Output

A new column called clean_text was created to store the processed text.
The final dataset (cleaned_news.csv) was used for model training and evaluation.

These preprocessing steps help remove noise, standardize input, and improve model performance by focusing on the most important words in each article.