Key insights
After testing seven models across 275 real security questionnaire tasks, the data reveals interesting patterns about performance, cost, and practical deployment considerations.
| Finding | Impact | Details |
|---|---|---|
| GPT-5-mini achieves 99.3% of GPT-5's quality | Near-identical performance | Runs 1.8x faster at 4.2x lower cost |
| Latency variance requires careful timeout planning | 6.7x spread between P50 and P95 | 31s median becomes 209s at the tail |
| Smaller models deliver superior economics | 129x value difference | GPT-4.1-mini vs GPT-5 on quality per dollar-second |
| Task specialization yields performance gains | Model selection matters | o4-mini surpasses GPT-5 on questionnaires (0.975 vs 0.950) |
Testing methodology
We ran 275 production security questionnaire tasks through seven models: the GPT-5 family (GPT-5, GPT-5-mini, GPT-5-nano), GPT-4.1 variants (GPT-4.1, GPT-4.1-mini), and reasoning models (o3, o4-mini).
Each model processed identical prompts across six task categories: compliance verification, evidence analysis, hallucination detection, questionnaire responses, reasoning problems, and RFP generation. We tested multiple configurations for each model, varying reasoning depth and verbosity settings to find optimal performance. We tracked quality scores (0-1 scale), response latency at multiple percentiles, token consumption, and total costs across all parameter combinations.
Performance results
Quality scores
Quality differences between models were surprisingly narrow. The gap between the best and worst performer was just 10.2 percentage points.
GPT-5 scored 0.850, while GPT-5-mini achieved 0.844, a difference of less than 1%. This pattern held across task categories: the premium you pay for flagship models buys minimal quality improvements.
Even GPT-4.1-mini, at 0.764, handled most tasks competently. The real differentiators turned out to be latency and cost, not raw quality scores.
Latency characteristics
Response times tell a different story than quality scores. GPT-4.1-mini returns answers in 16 seconds on average, while GPT-5 takes over two minutes.
The variance is what causes problems in production. GPT-5-mini's P95 latency hit 362 seconds, which is six minutes for a single request. That P50 to P95 spread of 6.7x means any timeout you set becomes a trade-off between reliability and completeness.
For context: if you set a 60-second timeout to maintain user experience, you'd lose 5% of GPT-5 responses and nearly 50% of GPT-5-mini responses during peak latency periods.
Cost analysis
Per-request pricing
The cost differential between models is significant. GPT-5 costs $0.0535 per request, which is 19x more than GPT-4.1-mini at $0.0028.
Processing 100,000 monthly requests would cost $280 with GPT-4.1-mini versus $5,350 with GPT-5. For that 19x cost increase, you get 11% better quality scores and 7x slower responses.
Value scoring
When you combine quality, latency, and cost into a single metric (quality per dollar-second), the economics become clear.
GPT-4.1-mini delivers 16.77 units of quality per dollar-second. GPT-5 delivers 0.13. That's a 129x difference in value. To justify GPT-5's cost and latency profile, it would need to perform significantly better than the benchmarks show.
Task-specific performance
The most interesting finding: no single model dominated across all task types.
| Task Type | Best Performer | Score | Average Latency |
|---|---|---|---|
| Questionnaire | o4-mini | 0.975 | 15.0s |
| RFP Response | GPT-5-mini | 0.974 | 35.6s |
| Compliance | GPT-5 | 0.920 | 150.0s |
| Reasoning | GPT-5-mini | 0.909 | 40.6s |
| Evidence | GPT-5 | 0.818 | 173.6s |
| Hallucination | GPT-5 | 0.694 | 83.0s |
o4-mini beat GPT-5 on questionnaire tasks (0.975 versus 0.950) while running 8x faster. This suggests the optimal strategy involves routing tasks to different models based on their characteristics.
The bottom line
The data tells a clear story: diminishing returns hit hard at the premium tier. GPT-5-mini captures nearly all of GPT-5's capability while being practical for production use. GPT-4.1-mini delivers exceptional value for high-volume processing.
The 0.6% quality difference between GPT-5 and GPT-5-mini doesn't justify a 2x latency penalty and 4x cost increase. For most production systems, the primary constraints are response time, cost efficiency, and reliability at scale rather than raw model capability.
Choose based on your actual constraints, not marketing benchmarks. The best model is the one that meets your quality bar while fitting within your latency and cost budgets.
