Vectorization
🔠 Text Vectorization
After cleaning and preprocessing the news articles, the next step was to convert the text data into a numerical format that machine learning models can understand.
We used TF-IDF Vectorization (Term Frequency–Inverse Document Frequency
) to represent each article as a vector of weighted word features.
📊 Why TF-IDF?
TF-IDF helps: - Emphasize important words in a document that are less frequent across all documents - Reduce the impact of common words that appear in many articles - Create a sparse, high-dimensional feature space suitable for models like Logistic Regression and Naive Bayes
🛠️ How It Works:
TF (Term Frequency)
Measures how often a word appears in a single document.IDF (Inverse Document Frequency)
Measures how rare a word is across all documents in the dataset.TF × IDF = TF-IDF
The final score reflects both importance and uniqueness of each word per document.
⚙️ Implementation
- We used
TfidfVectorizer
fromscikit-learn
- Parameters:
stop_words='english'
— ignores common English stopwordsmax_df=0.7
— ignores words that appear in more than 70% of documents
- The output is a sparse matrix used as input features
X
for model training.
This vectorized data formed the foundation for training our machine learning models.