This is our project for our Aplied Machine Learning (CIS 5190) class. In this project we investigate two central questions:
1) “Does ensembling of transformers improve performance in news source classification?”
2) “Does increasing model size (parameter count) correlate with better classification performance?”
Special thanks to my teammates, Adam and Charlie for their collaboration on this project.
Data Collection & Cleaning
We collected 3,805 headlines from Fox News and NBC, cleaned the data by removing publisher identifiers and faulty URLs, and ended up with 3,713 usable headlines. We performed a stratified 60%/20%/20% split into train, validation, and test sets to preserve class balance across splits.
Exploratory Data Analysis
We began by examining headline lengths, finding that Fox headlines average ˜14 words, compared to ˜12 for NBC (Figure 1). The normalized histogram suggests Fox uses longer, more descriptive titles, indicating a stylistic difference. VADER sentiment analysis (Figure 2) showed both sources cluster around neutral sentiment, though Fox headlines skew marginally negative—potentially reflecting political framing. Finally, token frequency analysis (Figure 3) revealed “trump” is common to both, but Fox emphasized political figures (“harris,” “biden”) while NBC featured lifestyle topics (“best,” “shop”), highlighting content differences we aimed to exploit in classification.
Figure 1: Headline length by source: (left) raw frequency; (right) percent within each source.
Figure 2: VADER sentiment ternary plot of headlines by source.
Figure 3: Top 12 words by percentage frequency in Fox (top) and NBC (bottom) headlines.
Model Design & Iteration
We built a baseline with TF-IDF and Logistic Regression, then fine-tuned five transformer models (BERT, RoBERTa, DistilBERT, ELECTRA, and BERT-xlarge) using standard hyperparameters and early stopping. Input sequences were truncated to 64 tokens for efficiency, and training times ranged from under 2 minutes (DistilBERT) to about 10 minutes (BERT-xlarge).
Evaluation Protocol and Results
All models were evaluated on the held-out 20% test set using accuracy, precision, recall, and macro-F1 score. Transformer models outperformed the TF–IDF baseline by 12–14 percentage points: ELECTRA-base led with 84.52% accuracy (macro-F1 = 0.85), while BERT-xlarge reached 82.37%.
Central Questions
1) Does ensembling of transformers improve performance?
We explored whether ensembling transformers could outperform individual models by testing three methods: simple probability averaging, a logistic regression meta-classifier, and a random forest meta-classifier. Both averaging and logistic regression achieved the highest accuracy (85.6%), improving upon the best single model (ELECTRA at 84.5%). Analysis showed that different transformers contributed complementary strengths—RoBERTa handled political headlines better, while BERT-xlarge captured subtler neutral styles—and ensembles reduced errors on ambiguous cases where models disagreed. Since all ensemble methods performed similarly, linear weighting proved sufficient, with little benefit from more complex approaches. Overall, ensembling boosted accuracy by about 1.08% from the best single model (ELECTRA), at the cost of higher inference time, making it valuable when maximizing classification performance is a priority.
2) Does increasing model size (parameter count) correlate with better classification performance?
We tested whether larger models improve performance by comparing BERT-base (110M parameters) and BERT-xlarge (334M). Despite being three times bigger, BERT-xlarge performed slightly worse (82.4% vs. 83.6% accuracy) while requiring over three times the training time. Error patterns were nearly identical, suggesting that headline classification doesn’t benefit from the extra capacity. This is likely due to short input length, task simplicity, and strong transfer learning from pretraining. The results show that bigger isn’t always better—smaller, diverse models are more efficient and effective for this task, a finding that directly informed our ensembling approach.
Conclusion
Our research demonstrates that transformer ensemble methods significantly enhance news source classification performance. The logistic regression ensemble achieved 85.60% validation accuracy, outperforming our best single model by 1.08 percentage points. The complementary strengths of diverse transformer architectures—rather than increased model size—proved most valuable for this task.
This research has implications beyond academic interest. With increasing concerns about media polarization, tools that automatically identify news sources based on linguistic patterns could help readers diversify their information diet and recognize framing effects. Our ensemble approach provides a robust foundation for such applications.