Case Study - Cutting AI application cost ~70% while holding output quality
A document-generation product ran every long, multi-section generation on a premium frontier model — expensive, and slow at ~39 seconds per document. We treated both how we call the model and which model we call as engineering choices, not defaults. The result: roughly 70% lower model cost (premium → a right-sized mid-tier default) and ~64% faster generation (from parallelism — provider-agnostic, independent of model choice), with output quality held at ≥0.85 parity (up to 1.0) on the production configuration — verified by an independent judge, not by the team that built it.
- Client
- Document-generation AI — regulated life-sciences workflow
- Year
- Service
- AI inference cost & latency engineering

Anonymized engineering case study. Client identity, product name, and proprietary model/system internals are withheld. All figures below are drawn from our own controlled benchmark runs across a six-workflow, multi-scope test set.
The problem
The application generates long, structured technical documents — multi-section proposals with scope classification, costing, schedules, and rendered artifacts — from short free-text briefs. The first production version ran every generation on a premium frontier model. It worked, but it had two scaling problems.
Cost. At premium token rates ($5 / 1M input, $25 / 1M output), each long multi-section document was expensive, and cost scaled linearly with usage. At any real customer volume the unit economics did not hold.
Latency. End-to-end generation averaged ~39 seconds per document — slow enough to hurt the interactive experience and to bottleneck batch/throughput use.
The naive assumption — “the frontier model is required for acceptable quality” — had never actually been tested. That assumption was the thing costing the most money.
The approach
We treated both how we call the model and which model we call as engineering optimizations, not defaults. Four levers, applied together.
1. Parallel section generation. A long document is many independent sections. The first version generated them serially — one model call after another — so total latency was the sum of every section. We restructured the pipeline to generate sections concurrently, bounded by a configurable concurrency limit (an async semaphore), with partial-failure salvage so one slow or failed section never sinks the whole document. Wall-clock time dropped to roughly the slowest section instead of the sum of all of them — the single biggest latency win, and it is provider-agnostic.
2. Model right-sizing (tiering). Benchmark cheaper models on the actual task instead of assuming the premium model is mandatory. Establish the premium model’s output as the quality baseline, measure how close cheaper models land, then make the right-sized tier the default. This is the lever that drives the ~70% cost reduction.
3. Quality-gated regeneration. The right-sized model is the default; a deterministic quality check runs on its output, and only outputs that fail the check are regenerated. Spend follows difficulty — the hard cases get extra compute, the rest clear on the first pass — rather than a flat premium rate on every run.
4. Compute-once upstream stages. Deterministic upstream work (scope classification, scaffolding, costing) is computed once and passed forward as context rather than re-derived per section, keeping repeated work out of the model’s billing path.
Critically, we held the bar on quality with two independent checks: a deterministic evaluator (structure, required sections, scope coverage), and a semantic parity judge (an independent LLM scoring candidate output against the premium baseline, 0–1). A cost or latency win only counted if it survived both gates.
The results
All numbers below are from our six-workflow, multi-scope benchmark.
Cost per generation (token pricing)
| Tier | Input $/1M | Output $/1M | Output-cost vs premium |
|---|---|---|---|
| Premium frontier model (baseline) | $5.00 | $25.00 | — |
| Mid-tier model | $1.50 | $7.50 | −70% |
| Small model | $0.10 | $0.30 | −98.8% |
For an output-heavy long document, moving the default off the premium model to the right-sized mid-tier cut model spend by roughly 70%; workloads that safely run on the small model drop by nearly 99%. The cost reduction comes from the default-tier change; the quality-gated regeneration only adds compute on the minority of outputs that fail the deterministic check.
Latency
| Tier | Avg end-to-end latency |
|---|---|
| Premium frontier model (baseline) | 39.1 s |
| Small model | 17.4 s |
| Mid-tier model | 14.2 s |
End-to-end generation dropped ~64% (39.1 s → 14.2 s, mean of the six-workflow benchmark), turning a sluggish interactive flow into a responsive one. Two effects stack here: the right-sized model returns faster per call, and — the durable, model-independent win — parallel section generation collapses many sequential calls into one bounded concurrent batch, so the document waits on its slowest section instead of the sum of all of them. The parallelism win holds regardless of which model is used; it is the engineering, not the model choice.
Quality held
- Deterministic checks: the cheaper tiers matched the premium model’s structural quality on the benchmark (identical pass profile on the scored set).
- Semantic parity vs the premium baseline (independent judge, 0–1 scale): ≥0.85 on the production configuration, ranging up to 1.0 on multiple cases. In plain terms: the right-sized model’s documents were judged substantively equivalent to the premium model’s the large majority of the time — and where an output diverged, the deterministic quality check caught it and triggered regeneration.
Why it worked (and why teams miss it)
- Serial generation was the hidden latency tax. Generating independent sections one after another made total time the sum of every call. Concurrency made it the slowest call. Same model, same output — dramatically less waiting, just by changing how the pipeline orchestrates calls.
- The default model was the expensive mistake. The single biggest cost lever was an untested assumption (“we need the frontier model”). Measuring it was cheaper than paying for it.
- Difficulty-proportional spend beats flat-rate. A quality-gated regeneration adds compute only on outputs that fail a deterministic check — typically a minority — instead of a premium rate on every run.
- Compute deterministic work once. Upstream stages that don’t need the model (scope classification, scaffolding, costing) should be derived once and passed forward as context, not recomputed per section.
- Trust requires independent verification. Cost and latency wins are only real if an independent quality gate confirms output didn’t degrade. We gated on both a deterministic evaluator and a separate semantic judge, so the savings are defensible to a customer, not just to ourselves.
- Lower model cost (premium → right-sized mid-tier)
- ~70%
- Faster via parallelism (39.1 s → 14.2 s, n=6 mean)
- ~64%
- Semantic parity vs premium (up to 1.0)
- ≥0.85
- Deterministic + independent semantic judge
- 2 gates
Transferable playbook
For any LLM-powered application that generates multi-part output and defaults to a premium model:
- Parallelize independent calls. If your output has independent sections/units, generate them concurrently (bounded by a semaphore) with partial-failure salvage. Latency goes from sum-of-calls to slowest-call — the biggest, most provider-agnostic win, and it survives any future model change.
- Baseline the premium model’s output on your real task and freeze it as the quality reference.
- Benchmark cheaper tiers against that baseline on cost, latency, and an independent quality score — not vibes — then make the right-sized tier the default.
- Add a quality-gated regeneration: run the default tier, and regenerate only the outputs that fail a deterministic check.
- Compute deterministic / repeated upstream stages once and pass them as context, out of the model’s billing path.
- Gate every saving behind an independent evaluator so quality parity is provable.
The result is the same product, materially cheaper and faster, with quality you can demonstrate rather than assert.
What we used
- LLM inference
- Model tiering
- Concurrency / async semaphore
- Quality-gated regeneration
- Compute-once upstream stages
- LLM-as-judge
- Benchmarking
- Cost optimization
Status
Implemented and benchmarked in production-representative testing. Figures are from controlled internal benchmarks on a six-workflow, multi-scope test set; absolute results vary by workload, prompt design, and model availability. Client, product, and proprietary system details are intentionally omitted.
The most expensive line in the app was an untested assumption — that the frontier model was required. Measuring it was cheaper than paying for it. Parallelize the calls, right-size the model, gate every saving behind an independent judge, and the same product gets ~70% cheaper and ~64% faster with quality you can prove.
