Vectorization

🔠 Text Vectorization

After cleaning and preprocessing the news articles, the next step was to convert the text data into a numerical format that machine learning models can understand.

We used TF-IDF Vectorization (Term Frequency–Inverse Document Frequency) to represent each article as a vector of weighted word features.

📊 Why TF-IDF?

TF-IDF helps: - Emphasize important words in a document that are less frequent across all documents - Reduce the impact of common words that appear in many articles - Create a sparse, high-dimensional feature space suitable for models like Logistic Regression and Naive Bayes

🛠️ How It Works:

TF (Term Frequency)
Measures how often a word appears in a single document.
IDF (Inverse Document Frequency)
Measures how rare a word is across all documents in the dataset.
TF × IDF = TF-IDF
The final score reflects both importance and uniqueness of each word per document.

⚙️ Implementation

We used TfidfVectorizer from scikit-learn
Parameters:
- stop_words='english' — ignores common English stopwords
- max_df=0.7 — ignores words that appear in more than 70% of documents
The output is a sparse matrix used as input features X for model training.

This vectorized data formed the foundation for training our machine learning models.