Case Study - Reaching frontier-model quality with a local model
Can a self-hosted open model replace a frontier cloud model for biosimilar and monoclonal-antibody CMC proposals — at parity quality, with full data privacy and fixed cost? Through retrieval engineering, deterministic validation, and prompt/enrichment techniques, with zero fine-tuning, a local OLMo-3.1-32B model reached 7:6 parity against Claude Opus 4.8 on a strict, multi-judge head-to-head.
- Client
- Synthesis API — biologics CDMO proposals
- Year
- Service
- Local model engineering & evaluation

The problem
The proposal platform generates client-facing “Gene-to-GMP” CMC proposals across monoclonal antibodies (mAb), bispecific antibodies (bsAb), and biosimilars. Generation ran on Claude Opus 4.8 in the cloud. Two pressures pushed us toward a local model.
- Privacy — Proposals carry sensitive CDMO pricing and scope. Multi-tenant customers (Aragen, Syngene, and others) want on-prem isolation, not “send our pricing to a cloud LLM.”
- Cost at scale — Per-token cloud cost scales with volume; a self-hosted model is a fixed GPU cost.
The risk was the widely held assumption that open models are materially worse than frontier models. The job was to find out how much worse — and whether the gap is closable.
The hypothesis
The task is constrained, not open-ended: follow a fixed proposal structure, cite specific regulations, pull activities from a catalogue, and let a deterministic engine handle pricing. We hypothesised the frontier-model edge would be small for this shape of work, and that RAG plus validators could close most of it — because the hard part (knowledge and citations) is retrievable, not innate.
What we built
We built a decomposed, grounded generation pipeline — deliberately not a single giant LLM call.
- Regulatory RAG corpus — 49 official guideline documents (ICH Q-series, 21 CFR, FDA/EMA/WHO biosimilar guidance, EU GMP annexes, CTD M4Q), each with provenance and trust tier. Embedded with OpenAI text-embedding-3-large; retrieval is filterable by jurisdiction and status, so drafts can never set “Required.”
- Section-level generation — Each of the 13 workflow work-packages is generated independently and concurrently, not as one 9,000-token blob.
- Deterministic validators — Code (not the model) enforces structure (at least 2 client inputs and at least 2 deliverables), citation traceability (every cited regulation id must appear in retrieved context), trap-phrase avoidance, and a repair loop that re-prompts on failure.
The design principle that proved decisive: move rules from prompt into code, and decompose the task into bounded pieces. A weaker model is made reliable by not asking it to hold everything at once.
The experiments
Experiment A — Closed-book recall (the wrong test). OLMo scored 0.40 on a 365-probe closed-book CMC knowledge eval (judged by GPT-5.5). It looked bad. The lesson: closed-book recall is the wrong metric for a RAG system — the model doesn’t need to memorise ICH Q5B, it needs to reason over the retrieved text.
Experiment B — Open-book, with RAG (the right test). The same model, given retrieved context, produced grounded, correctly-cited proposal sections. Across 11 molecule families × 13 workflows = 143 sections, the pipeline passed 143/143 with the repair loop. The lesson: RAG closed the knowledge gap completely.
Experiment C — Head-to-head vs Claude, identical pipeline. The real A/B: same RAG, same prompts, same validators, same repair loop — only the model was swapped. We generated all 13 sections with each model, judged by an independent panel.
| Round | Configuration | Result (OLMo : Claude) |
|---|---|---|
| v1 | Baseline RAG | 3 : 10 — Claude clearly better |
| v2 | + specificity prompt + few-shot + CMC-specificity reference doc in RAG + enrichment pass | 7 : 6 — OLMo ahead |
| v3 | Over-tightened citations | 3 : 10 — regressed (scope-creep) |
| v4 | Scope-locked enrichment (deepen, don’t broaden) | 7 : 6 — confirmed, more stable |
Robust verification: the final tally was confirmed by a 3-judge panel per section, position-swapped, majority vote (78 GPT-5 judgments). Both v2 and v4 held at 7:6, with v4 showing cleaner consensus (more unanimous votes). The single-pass v3 swing proved that one judge is noise — panels are essential.
Closing the gap
A deep gap-analysis of every Claude win revealed the pattern: OLMo knew the rules but stayed generic where Claude got specific. Claude named exact attributes (deamidation, C-terminal lysine, afucosylation), named techniques (icIEF, CE-SDS, peptide mapping), mapped deliverables to CTD subsections, and reasoned about reference-product comparison. OLMo said “PTMs” and “orthogonal methods.”
The fix was four cheap, training-free levers.
| Lever | What it did | Effect |
|---|---|---|
| 1. Specificity-demanding prompt | Required named CQAs, named techniques, CTD mapping, biosimilar RP-logic | Forced elaboration |
| 2. Few-shot exemplar | Showed the depth bar in-context | Model mimicked it |
| 3. CMC Specificity Reference in RAG | A structured doc listing named attributes / techniques / CTD per workflow, making specifics retrievable | Turned an innate-capability gap into a retrieval problem |
| 4. Scope-locked enrichment pass | Second model call: deepen existing points, do not broaden the topic | Added depth without scope-creep |
Net effect: 3:10 → 7:6. The cheap levers didn’t just close the gap — they reversed it.
Two precision fixes added rigor. First, validator over-strictness: the citation checker wrongly flagged CTD coordinates (3.2.S.x / 3.2.P.x) as “unsupported.” A deep per-citation analysis of 6 “failing” sections showed 5 of 6 were validator false-positives — legitimate dossier structure, not fabricated citations. Whitelisting CTD coordinates lifted validation from 7/13 → 12/13.
Second, one genuine model error: OLMo cited “EU GMP Annex 2” for Qualified-Person batch certification, which is actually governed by Annex 16. The root cause was a retrieval gap (Annex 16 was missing from the corpus, and the generic query didn’t surface it), not a hallucination. Adding Annex 16 plus a QP-aware retrieval query fixed it: the section now cites Annex 16 correctly. Final validation: 13/13.
Performance on real hardware
We served OLMo-3.1-32B (BF16, 60 GB) on a single NVIDIA RTX PRO 6000 Blackwell (96 GB) via vLLM.
| Metric | Result |
|---|---|
| Single-request decode | ~20 tok/s (BF16, memory-bandwidth bound — matches published 32B benchmarks) |
| Concurrent throughput (8 parallel) | ~172 tok/s total, same wall-time as 1 request |
| Concurrent throughput (24 parallel) | ~481 tok/s total, near-linear scaling |
| Full 13-section proposal (concurrent) | ~60 seconds |
The key insight: single-request speed is the wrong metric for a multi-tenant SaaS. The Blackwell GPU batches concurrent requests — 8 proposals finish in the time of 1. Decomposing each proposal into 13 concurrent sections means a full proposal lands in ~60s, and many tenants’ proposals run in parallel. Published benchmarks show this card sustaining ~930 tok/s at 50 concurrent users.
Hardware lessons learned:
- BF16 32B (~64 GB) does not fit a 48 GB L40S — it needs quantization or a 96 GB card. The Blackwell fits it natively.
- Blackwell (sm_120) needed CUDA 12.9+ and FlashInfer disabled (an arch-detection bug); FlashAttention-2 worked. Bleeding-edge hardware has driver/library rough edges.
- Use a Deep Learning AMI (drivers pre-installed), not base Ubuntu.
Cost
| State | Cost |
|---|---|
| Blackwell instance running | ~$2.5–3.5 / hr |
| Stopped (model + weights persist) | ~$16 / month (EBS only) |
| Cloud frontier model | per-token, scales with volume |
For a high-volume, multi-tenant proposal product, the fixed-cost local model wins economically at scale — and is the only option that satisfies on-prem data privacy.
- Parity vs Claude Opus 4.8
- 7:6
- Sections passed with RAG + repair loop
- 143/143
- Full 13-section proposal (concurrent)
- ~60s
- Fixed on-prem cost when parked
- ~$16/mo
The verdict
OLMo-3.1-32B, with retrieval engineering and prompt/enrichment techniques, is at parity with Claude Opus 4.8 for biosimilar CMC proposal generation — confirmed 7:6 by a robust multi-judge panel, with zero fine-tuning.
- Quality — 7:6 vs Claude (panel-confirmed). OLMo wins the science-depth sections (analytical, DSP, stability, USP, comparability, tech-transfer); Claude edges the regulatory-judgment/scope sections (CLD, monoclonality, RegCMC).
- Citations — 13/13 sound, zero fabrications.
- Speed — Full proposal ~60s; ~480 tok/s under concurrency.
- Privacy & cost — Fully on-prem, ~$16/mo parked.
The remaining gap, and how to close it: the 6 sections Claude still wins are regulatory judgment and scope — knowing what’s appropriate to include, not just what’s specific. This is the one thing RAG and prompting can’t fully teach. The path from parity to dominance is fine-tuning (LoRA) on Aragen’s real past proposals — teaching house voice, scope judgement, and the elaboration patterns Aragen uses. Data: Aragen’s annotated proposals plus Claude-distilled synthetic pairs (filtered by the validators). That is the next investment, not a prerequisite — parity is already achieved.
What we used
- RAG
- vLLM
- NVIDIA Blackwell
- OLMo-3.1-32B
- Claude Opus 4.8
- CMC proposals
- Biosimilars
- LLM-as-judge
- Self-hosted LLM
Lessons that generalise
- For RAG systems, never judge a model on closed-book recall. Test open-book — it’s a different (and far better) result.
- Decompose plus concurrency beats single-shot for both speed and reliability on a local model. A full proposal as 13 concurrent sections finishes faster than one giant structured call — and the GPU batches it for free.
- Move rules into code. Deterministic validators make a weaker model safe; the model writes prose, code enforces structure and citations.
- The quality gap is mostly elaboration depth, not knowledge — and that’s the most fixable kind, addressable with prompting, few-shot, and a structured “specificity reference” in RAG.
- Use a judge panel, not one judge. A single LLM judge is noisy enough to flip a verdict (we saw 7:6 → 3:10 → 7:6). Majority-vote across position-swapped judges gives a defensible answer.
- Single-request latency is the wrong hardware metric for multi-tenant SaaS — concurrent throughput is what matters, and modern GPUs batch beautifully.
OLMo-3.1-32B, with retrieval engineering and prompt/enrichment techniques, is at parity with Claude Opus 4.8 for biosimilar CMC proposal generation — confirmed 7:6 by a robust multi-judge panel, with zero fine-tuning.
