Executive Technical Report
NLP Sentiment Intelligence System
End-to-end ML pipeline for automated text sentiment analysis with confidence-based routing, SHAP explainability, and production monitoring with confidence-based routing, SHAP explainability, and production monitoring.
System Active
47,240 Reviews Analyzed
24 Experiments Tracked
April 2026
Executive Summary
Key Performance Indicators
Champion model selected on calibrated probability quality for production routing system.
Champion F1 Score
0.8948
95% CI: [0.8906 - 0.8988]
Brier Score
0.0789
Calibrated probability quality
Inference Speed
2.9ms
Per review prediction
Models Compared
8
3 families + tuning + calibration
Test Samples
23,483
One-time locked evaluation
Model Performance
8 Models. 1 Champion. Here is the evidence.
All models evaluated on the same locked test set with bootstrap confidence intervals and McNemar statistical significance tests.
| Model |
Test F1 |
95% CI |
ROC-AUC |
Precision |
Recall |
ms/pred |
Status |
| LR Tuned Calibrated |
0.8948 |
[0.8906, 0.8988] |
0.9605 |
0.8906 |
0.8991 |
2.89 |
CHAMPION |
| LR Default |
0.8954 |
[0.8908, 0.8992] |
0.9599 |
0.8902 |
0.9007 |
0.34 |
p=0.54 |
| LR Tuned |
0.8948 |
[0.8907, 0.8989] |
0.9605 |
0.8914 |
0.8983 |
0.31 |
|
| LR Default Calibrated |
0.8934 |
[0.8891, 0.8973] |
0.9592 |
0.8890 |
0.8979 |
3.02 |
|
| LGBM Calibrated |
0.8828 |
[0.8782, 0.8870] |
0.9542 |
0.8750 |
0.8907 |
33.65 |
p<0.001 |
| DistilBERT + LoRA |
0.8796 |
[0.8753, 0.8841] |
0.9523 |
0.8724 |
0.8870 |
24.93 |
p<0.001 |
| LGBM Default |
0.8792 |
[0.8745, 0.8834] |
0.9517 |
0.8716 |
0.8869 |
8.98 |
|
| LGBM Tuned |
0.8763 |
[0.8715, 0.8807] |
0.9483 |
0.8697 |
0.8830 |
8.17 |
|
F1 Score Comparison
All models ranked by test F1 with champion highlighted
Inference Latency (log scale)
Milliseconds per prediction on test hardware
Linear Separability Confirmed
LR with TF-IDF bigrams (F1=0.8948) outperforms DistilBERT (F1=0.8796), confirming sentiment is linearly separable in this dataset.
Tuning Diminishing Returns
Optuna tuning improved LR by only +0.0012 F1. LGBM tuning made performance worse (8 trials insufficient for 7D space).
McNemar Significance
Champion vs LR Default: p=0.54 (not significant, identical predictions). Champion vs BERT: p<0.001 (significant difference).
Explainability
Every prediction is auditable. No black boxes.
SHAP analysis on 23,483 test samples reveals which words drive sentiment predictions.
Top 20 SHAP Features (Global)
Mean |SHAP value| across all test predictions
Word Highlighting Demo
Real-time SHAP visualization on example predictions
STRONG POSITIVE | pred=1.000 | actual=Positive
This is one of Bruce's most underrated films in my opinion, its an awesome heartwarming film, with a neat story and an amazing performance from Bruce Willis! All the characters are great, and I thought it was very well written.
STRONG NEGATIVE | pred=0.000 | actual=Negative
This was a very disappointing movie. I would definitely call this the worst movie of all time. The acting and writing were poor. And the jokes were not funny. A stupid piece of crap.
BORDERLINE | pred=0.500 | actual=Negative
The film has excellent cinematography and great performances but the plot was bad and the ending was terrible. Mixed signals where the model correctly assigns maximum uncertainty.
Production System
Confidence-Based Routing
Calibrated probabilities enable intelligent triage: auto-classify easy cases, escalate ambiguous ones to human reviewers.
prob > 0.85 or < 0.15
AUTO-CLASSIFY
Model is highly confident. No human review needed. Covers the majority of reviews.
prob 0.60-0.85 or 0.15-0.40
HUMAN REVIEW
Mixed signals detected. Routed to human reviewer for verification.
prob 0.40-0.60
ESCALATE
Model genuinely uncertain. Escalated to senior analyst for decision.
Prediction Confidence Distribution
How the champion model distributes confidence across the test set
Business Case
Cost-Performance Tradeoff
Estimated monthly costs for processing 1M reviews on AWS on-demand instances, US East, April 2026.
LR Champion (CPU)
$50/mo
F1 = 0.8948 | 2.9ms/pred
vs
DistilBERT (GPU)
$500/mo
F1 = 0.8796 | 24.9ms/pred
10x Cost Savings
LR achieves higher F1 at 1/10th the cost. No GPU required. CPU-only deployment on t3.medium covers 1M reviews/month with capacity to spare.
9x Faster Inference
LR processes reviews in 2.9ms vs BERT's 24.9ms. At scale, this means 0.8 compute hours vs 6.9 hours for 1M reviews.
Data Foundation
Data Quality Assessment
47,240 clean text reviews from a curated dataset spanning decades of content.
Retention Rate
99.8%
91 duplicates removed
Class Balance
49.9/50.1
No SMOTE needed
Unique Movies
6,648
Median 4 reviews each
Validation Checks
5/5
Great Expectations
Review Length Distribution
Character count distribution by sentiment class
Temporal Sentiment Trends
Positive review ratio by decade with review counts
Production Monitoring
Is the model still reliable?
Evidently AI drift detection comparing reference and current prediction distributions.
✓
NO DRIFT DETECTED
Prediction distributions are stable. Drift score 0.025 is well below the 0.05 threshold.
Drift Score
0.025
Threshold: 0.05
Mean Prob Diff
0.025
0.487 vs 0.512
Retrain Trigger: Data Drift
If drift score exceeds 0.05, new review vocabulary or sentiment patterns have emerged.
Retrain Trigger: F1 Drop
If monitored F1 on labeled production data drops below 0.85 threshold.
Retrain Trigger: Business Change
If routing thresholds need adjustment or multi-class sentiment (1-5 stars) is required.