NLP Sentiment Intelligence System | Executive Technical Report

Executive Summary

Key Performance Indicators

Champion model selected on calibrated probability quality for production routing system.

Champion F1 Score

0.8948

95% CI: [0.8906 - 0.8988]

Brier Score

0.0789

Calibrated probability quality

Inference Speed

2.9ms

Per review prediction

Models Compared

3 families + tuning + calibration

Test Samples

23,483

One-time locked evaluation

Model Performance

8 Models. 1 Champion. Here is the evidence.

All models evaluated on the same locked test set with bootstrap confidence intervals and McNemar statistical significance tests.

Model	Test F1	95% CI	ROC-AUC	Precision	Recall	ms/pred	Status
LR Tuned Calibrated	0.8948	[0.8906, 0.8988]	0.9605	0.8906	0.8991	2.89	CHAMPION
LR Default	0.8954	[0.8908, 0.8992]	0.9599	0.8902	0.9007	0.34	p=0.54
LR Tuned	0.8948	[0.8907, 0.8989]	0.9605	0.8914	0.8983	0.31
LR Default Calibrated	0.8934	[0.8891, 0.8973]	0.9592	0.8890	0.8979	3.02
LGBM Calibrated	0.8828	[0.8782, 0.8870]	0.9542	0.8750	0.8907	33.65	p<0.001
DistilBERT + LoRA	0.8796	[0.8753, 0.8841]	0.9523	0.8724	0.8870	24.93	p<0.001
LGBM Default	0.8792	[0.8745, 0.8834]	0.9517	0.8716	0.8869	8.98
LGBM Tuned	0.8763	[0.8715, 0.8807]	0.9483	0.8697	0.8830	8.17

F1 Score Comparison

All models ranked by test F1 with champion highlighted

Inference Latency (log scale)

Milliseconds per prediction on test hardware

Linear Separability Confirmed

LR with TF-IDF bigrams (F1=0.8948) outperforms DistilBERT (F1=0.8796), confirming sentiment is linearly separable in this dataset.

Tuning Diminishing Returns

Optuna tuning improved LR by only +0.0012 F1. LGBM tuning made performance worse (8 trials insufficient for 7D space).

McNemar Significance

Champion vs LR Default: p=0.54 (not significant, identical predictions). Champion vs BERT: p<0.001 (significant difference).

Explainability

Every prediction is auditable. No black boxes.

SHAP analysis on 23,483 test samples reveals which words drive sentiment predictions.

Top 20 SHAP Features (Global)

Mean |SHAP value| across all test predictions

Word Highlighting Demo

Real-time SHAP visualization on example predictions

STRONG POSITIVE | pred=1.000 | actual=Positive

This is one of Bruce's most underrated films in my opinion, its an awesome heartwarming film, with a neat story and an amazing performance from Bruce Willis! All the characters are great, and I thought it was very well written.

STRONG NEGATIVE | pred=0.000 | actual=Negative

This was a very disappointing movie. I would definitely call this the worst movie of all time. The acting and writing were poor. And the jokes were not funny. A stupid piece of crap.

BORDERLINE | pred=0.500 | actual=Negative

The film has excellent cinematography and great performances but the plot was bad and the ending was terrible. Mixed signals where the model correctly assigns maximum uncertainty.

Production System

Confidence-Based Routing

Calibrated probabilities enable intelligent triage: auto-classify easy cases, escalate ambiguous ones to human reviewers.

prob > 0.85 or < 0.15

AUTO-CLASSIFY

Model is highly confident. No human review needed. Covers the majority of reviews.

prob 0.60-0.85 or 0.15-0.40

HUMAN REVIEW

Mixed signals detected. Routed to human reviewer for verification.

prob 0.40-0.60

ESCALATE

Model genuinely uncertain. Escalated to senior analyst for decision.

Prediction Confidence Distribution

How the champion model distributes confidence across the test set

Business Case

Cost-Performance Tradeoff

Estimated monthly costs for processing 1M reviews on AWS on-demand instances, US East, April 2026.

LR Champion (CPU)

$50/mo

F1 = 0.8948 | 2.9ms/pred

DistilBERT (GPU)

$500/mo

F1 = 0.8796 | 24.9ms/pred

10x Cost Savings

LR achieves higher F1 at 1/10th the cost. No GPU required. CPU-only deployment on t3.medium covers 1M reviews/month with capacity to spare.

9x Faster Inference

LR processes reviews in 2.9ms vs BERT's 24.9ms. At scale, this means 0.8 compute hours vs 6.9 hours for 1M reviews.

Data Foundation

Data Quality Assessment

47,240 clean text reviews from a curated dataset spanning decades of content.

Retention Rate

99.8%

91 duplicates removed

Class Balance

49.9/50.1

No SMOTE needed

Unique Movies

6,648

Median 4 reviews each

Validation Checks

5/5

Great Expectations

Review Length Distribution

Character count distribution by sentiment class

Temporal Sentiment Trends

Positive review ratio by decade with review counts

Production Monitoring

Is the model still reliable?

Evidently AI drift detection comparing reference and current prediction distributions.

✓

NO DRIFT DETECTED

Prediction distributions are stable. Drift score 0.025 is well below the 0.05 threshold.

Reference Samples

11,741

Current Samples

11,742

Drift Score

0.025

Threshold: 0.05

Mean Prob Diff

0.025

0.487 vs 0.512

Retrain Trigger: Data Drift

If drift score exceeds 0.05, new review vocabulary or sentiment patterns have emerged.

Retrain Trigger: F1 Drop

If monitored F1 on labeled production data drops below 0.85 threshold.

Retrain Trigger: Business Change

If routing thresholds need adjustment or multi-class sentiment (1-5 stars) is required.