Executive Technical Report

NLP Sentiment Intelligence System

End-to-end ML pipeline for automated text sentiment analysis with confidence-based routing, SHAP explainability, and production monitoring with confidence-based routing, SHAP explainability, and production monitoring.

System Active 47,240 Reviews Analyzed 24 Experiments Tracked April 2026
Key Performance Indicators
Champion model selected on calibrated probability quality for production routing system.
Champion F1 Score
0.8948
95% CI: [0.8906 - 0.8988]
Brier Score
0.0789
Calibrated probability quality
Inference Speed
2.9ms
Per review prediction
Models Compared
8
3 families + tuning + calibration
Test Samples
23,483
One-time locked evaluation
8 Models. 1 Champion. Here is the evidence.
All models evaluated on the same locked test set with bootstrap confidence intervals and McNemar statistical significance tests.
Model Test F1 95% CI ROC-AUC Precision Recall ms/pred Status
LR Tuned Calibrated 0.8948 [0.8906, 0.8988] 0.9605 0.8906 0.8991 2.89 CHAMPION
LR Default 0.8954 [0.8908, 0.8992] 0.9599 0.8902 0.9007 0.34 p=0.54
LR Tuned 0.8948 [0.8907, 0.8989] 0.9605 0.8914 0.8983 0.31
LR Default Calibrated 0.8934 [0.8891, 0.8973] 0.9592 0.8890 0.8979 3.02
LGBM Calibrated 0.8828 [0.8782, 0.8870] 0.9542 0.8750 0.8907 33.65 p<0.001
DistilBERT + LoRA 0.8796 [0.8753, 0.8841] 0.9523 0.8724 0.8870 24.93 p<0.001
LGBM Default 0.8792 [0.8745, 0.8834] 0.9517 0.8716 0.8869 8.98
LGBM Tuned 0.8763 [0.8715, 0.8807] 0.9483 0.8697 0.8830 8.17

F1 Score Comparison

All models ranked by test F1 with champion highlighted

Inference Latency (log scale)

Milliseconds per prediction on test hardware

Linear Separability Confirmed

LR with TF-IDF bigrams (F1=0.8948) outperforms DistilBERT (F1=0.8796), confirming sentiment is linearly separable in this dataset.

Tuning Diminishing Returns

Optuna tuning improved LR by only +0.0012 F1. LGBM tuning made performance worse (8 trials insufficient for 7D space).

McNemar Significance

Champion vs LR Default: p=0.54 (not significant, identical predictions). Champion vs BERT: p<0.001 (significant difference).

Every prediction is auditable. No black boxes.
SHAP analysis on 23,483 test samples reveals which words drive sentiment predictions.

Top 20 SHAP Features (Global)

Mean |SHAP value| across all test predictions

Word Highlighting Demo

Real-time SHAP visualization on example predictions
STRONG POSITIVE | pred=1.000 | actual=Positive
This is one of Bruce's most underrated films in my opinion, its an awesome heartwarming film, with a neat story and an amazing performance from Bruce Willis! All the characters are great, and I thought it was very well written.
STRONG NEGATIVE | pred=0.000 | actual=Negative
This was a very disappointing movie. I would definitely call this the worst movie of all time. The acting and writing were poor. And the jokes were not funny. A stupid piece of crap.
BORDERLINE | pred=0.500 | actual=Negative
The film has excellent cinematography and great performances but the plot was bad and the ending was terrible. Mixed signals where the model correctly assigns maximum uncertainty.
Confidence-Based Routing
Calibrated probabilities enable intelligent triage: auto-classify easy cases, escalate ambiguous ones to human reviewers.
prob > 0.85 or < 0.15
AUTO-CLASSIFY
Model is highly confident. No human review needed. Covers the majority of reviews.
prob 0.60-0.85 or 0.15-0.40
HUMAN REVIEW
Mixed signals detected. Routed to human reviewer for verification.
prob 0.40-0.60
ESCALATE
Model genuinely uncertain. Escalated to senior analyst for decision.

Prediction Confidence Distribution

How the champion model distributes confidence across the test set
Cost-Performance Tradeoff
Estimated monthly costs for processing 1M reviews on AWS on-demand instances, US East, April 2026.
LR Champion (CPU)
$50/mo
F1 = 0.8948 | 2.9ms/pred
vs
DistilBERT (GPU)
$500/mo
F1 = 0.8796 | 24.9ms/pred

10x Cost Savings

LR achieves higher F1 at 1/10th the cost. No GPU required. CPU-only deployment on t3.medium covers 1M reviews/month with capacity to spare.

9x Faster Inference

LR processes reviews in 2.9ms vs BERT's 24.9ms. At scale, this means 0.8 compute hours vs 6.9 hours for 1M reviews.

Data Quality Assessment
47,240 clean text reviews from a curated dataset spanning decades of content.
Retention Rate
99.8%
91 duplicates removed
Class Balance
49.9/50.1
No SMOTE needed
Unique Movies
6,648
Median 4 reviews each
Validation Checks
5/5
Great Expectations

Review Length Distribution

Character count distribution by sentiment class

Temporal Sentiment Trends

Positive review ratio by decade with review counts
Is the model still reliable?
Evidently AI drift detection comparing reference and current prediction distributions.

NO DRIFT DETECTED

Prediction distributions are stable. Drift score 0.025 is well below the 0.05 threshold.

Reference Samples
11,741
Current Samples
11,742
Drift Score
0.025
Threshold: 0.05
Mean Prob Diff
0.025
0.487 vs 0.512

Retrain Trigger: Data Drift

If drift score exceeds 0.05, new review vocabulary or sentiment patterns have emerged.

Retrain Trigger: F1 Drop

If monitored F1 on labeled production data drops below 0.85 threshold.

Retrain Trigger: Business Change

If routing thresholds need adjustment or multi-class sentiment (1-5 stars) is required.