The Inference Log - We have spent years going deep on AI.

An engineering house for pharma, life-sciences, and the industries that cannot afford to be wrong. Around 130 AI/ML projects. Six product families. Every headline claim checked by an independent judge, not the team that built it. No marketing gloss, just the engineering.

Most AI work is a demo. A demo is a thing that works once, in good light, with the person who built it standing nearby.

We do not build demos. We build systems that hold up when an auditor reads them, when the data cannot leave the building, and when the answer has to be right the first time. That takes hours. Years of them. You learn where models lie. You learn that the most expensive line in an application is usually an untested assumption. You learn to put a separate judge model between your confidence and the truth, because the team that built a thing is the worst team to grade it.

This page is the record of that time. Read it as evidence. Every project here taught us something we now know cold — about grounding, about evals, about running real models privately on hardware you control. We are Productive, Effective, Progressive. PEP, in that order, earned the slow way.

Who builds this - The AI/ML depth house, built for regulated work.

Prag-Matic is an AI and ML engineering house based in Bangalore, working worldwide. Our founders carry roughly two decades each across global tech and Bay Area enterprise SaaS. We go deep where the stakes are highest — pharma, life-sciences, fintech, healthcare, telecom — building grounded, verifiable systems that survive contact with a regulator. We are radically honest about what models can and cannot do, because in regulated work, the gap between the two is where people get hurt.

On-prem agents

The full agent loop, on hardware you own

For regulated work, the question is rarely whether a model is good enough. It is whether your data is allowed to leave the building. So we learned to build the whole agent — reasoning, tool-calling, memory, RAG, even voice — running entirely on local, open models on the customer's own machines. No data egress. No per-token bill. The model serves itself behind your firewall, and the audit trail stays in the room.

  • PydanticAI and local model-serving APIs driving the loop.
  • MCP tools calling against local servers, not the public internet.
  • Self-hosted ~32B model parked at roughly $16 a month.
  • Private agentic AI with zero data egress.

Regulated parity

A local model matched a frontier model where it mattered

On the hardest regulatory-judgment panel of a pharma CMC workload, we brought a self-hosted open ~32B model level with a leading frontier model. Not on average — on the judgment that an inspector actually reads. We got there by decomposing the problem into a grounded pipeline and fine-tuning the local model on the part that was hard, then proving the result against an independent judge.

  • Local OLMo-3.1-32B reached frontier-level on the hardest panel.
  • QLoRA fine-tune on the specific hard judgment.
  • Every cited source checked out — no fabricated citations.
  • On-prem, customer-controlled, ~$16/month parked.

Evals & honesty

An independent judge gates every claim

We do not let the team that built a system grade it. Every headline claim is verified by a separate judge model with its own prompt and its own view of the truth. We built an in-context-learning eval framework for hallucination detection and model scorecards across providers, so that 'it works' is a measured statement, not a feeling. This is the discipline regulated work demands, and the one we are most stubborn about.

  • ICL eval framework: hallucination detection and scorecards.
  • Scored across Claude, GPT-4o, Mistral, and SAP.
  • pydantic-evals, DeepEval, and Inspect in the eval stack.
  • Verified by an independent judge, not the builders.

Document intelligence

Visual RAG that reads documents the way people do

Regulated knowledge lives in PDFs — diagrams, tables, stamps, signatures, layout that carries meaning. Flattening that to text loses the case. We built late-interaction visual retrieval that searches documents as images, preserving the page, and structured-extraction pipelines that pull clean fields out of the mess. Years on this stack taught us where it breaks and how to ground it.

  • Harvestor: visual document search on ColQwen2.5 + LanceDB.
  • Knova app, API, and the Knova-Forge optimization platform on DSPy + DeepEval.
  • HireMagic: Mistral structured extraction of resumes and invoices.
  • ColPali / ColQwen / ColNomic late-interaction stack.

On-prem vision

Safety and compliance vision that never leaves the floor

On a factory floor or a clean room, the footage is the sensitive asset. We build vision systems that run on-prem with zero frames leaving the building — SOP and PPE compliance flagging, multi-camera search, and cross-camera person re-identification. Re-ID, to be precise, not facial recognition. The distinction matters, and we hold it.

  • Spectra: SOP / PPE / safety-violation flagging on MLX-VLM + Gemma + SAM 3.
  • Lumina: multi-camera search and cross-camera re-ID on SigLIP 2 + OSNet + SAM 3.
  • 0 frames leaving the building.
  • Web console and Terraform-managed GPU infrastructure.

Regulated multi-agent

Agents that collaborate across hard domains

One model is a tool. Several models that reason together, hand off work, and check each other are a system. We built a multi-agent collaborator that works across ESG, mathematics, patent, regulatory, and research domains, plus long-term memory and autonomous agents — on PydanticAI, LangGraph, and Google ADK. The orchestration is where the years show.

  • Nexus: multi-agent across ESG / math / patent / regulatory / research.
  • Memory Agent for long-term recall.
  • 13 concurrent work-packages coordinated on the Synthesis pipeline.
  • Frameworks: PydanticAI, LangGraph, Google ADK.

Local-inference research

We study the inference stack down to bare metal

To run models cheaply and privately on a customer's hardware, you have to understand how they actually run — the serving layer, the quantization, the metal underneath. So we study it: hands-on with MLX, vLLM on Metal, and llama.cpp, low-bit quantization and right-sizing. We have experimented with running frontier-class mixture-of-experts LLMs locally at low bit-depth, and with on-device real-time speech-to-text models. This research is what earns us the right to say 'on-prem' and mean it.

  • Hands-on MLX, vLLM-on-Metal, and llama.cpp.
  • Experiments with frontier-class MoE LLMs at low bit-depth.
  • Experiments with on-device real-time speech-to-text.
  • Low-bit quantization and model right-sizing — studied, not guessed.

Flagships - A few we’d put in the room with you.

Not a portfolio dump. The handful that show what the work does when the answer has to be right and the data has to stay home.

Pharma · Gene-to-GMP CMC

Synthesis

A platform that drafts the chemistry, manufacturing, and controls proposal that takes a therapy from gene to GMP. The data could not leave the customer's control, and the cost could not float, so we pushed the whole thing onto a self-hosted open model. We decomposed it into a grounded pipeline — a 49-document regulatory RAG corpus, 13 concurrent work-packages, deterministic validators — and a QLoRA fine-tune brought a local OLMo-3.1-32B level with a leading frontier model on the hardest regulatory-judgment panel. Every cited source checked out. On-prem, customer-controlled, roughly $16 a month to keep the model parked.

  • Self-hosted OLMo-3.1-32B
  • 49-doc regulatory RAG
  • QLoRA fine-tune
  • Frontier-level on the hard panel
  • ~$16/mo parked

Evals · Hallucination detection

The ICL eval framework

Before we ship a claim, we measure it — with a judge that did not write the thing it is grading. This is our in-context-learning evaluation framework: hallucination detection and model scorecards run across Claude, GPT-4o, Mistral, and SAP, so a model's honesty is a number, not an opinion. It is the backbone under every other project on this page, and the reason we can be blunt about what works.

  • Independent judge model
  • Hallucination detection
  • Cross-provider scorecards
  • pydantic-evals / DeepEval / Inspect

On-prem vision · Industrial compliance

Spectra & Lumina

Two on-prem vision systems for the regulated floor. Spectra flags SOP, PPE, and safety violations on an MLX-VLM, Gemma, and SAM 3 stack. Lumina does multi-camera search and cross-camera person re-identification — re-ID, not facial recognition — on SigLIP 2, OSNet, and SAM 3, with a web console and Terraform-managed GPU infrastructure. Both share one non-negotiable: zero frames leave the building.

  • MLX-VLM + Gemma + SAM 3
  • SigLIP 2 + OSNet
  • Cross-camera re-ID, not FR
  • 0 frames egress
  • Terraform GPU infra

Document intelligence · Visual RAG

Harvestor & Knova

Regulated answers live in documents that lose their meaning when flattened to text. Harvestor searches them as images, on a ColQwen2.5 late-interaction model over LanceDB, keeping the page intact. Knova adds an app, an API, and the Knova-Forge platform for tuning the GenAI pipeline itself on DSPy and DeepEval. Years on late-interaction retrieval is why the grounding holds.

  • ColQwen2.5 late-interaction
  • LanceDB
  • DSPy + DeepEval
  • Visual RAG over regulated docs

Regulated multi-agent

Nexus

A multi-agent collaborator built to reason across the domains regulated work actually spans — ESG, mathematics, patent, regulatory, and research. Agents specialize, hand off, and check one another, with long-term memory underneath. The hard part was never a single agent. It was making several of them agree on a defensible answer.

  • Multi-domain agents
  • Long-term memory
  • PydanticAI / LangGraph / ADK
  • Defensible-answer orchestration

Voice · Safety-critical

The voice safety suite

When AI speaks to people, the failure modes get human. We built a safety suite that routes crisis cases to real help and refuses to play along with a delusion. It routed 4 of 4 crisis categories to human help and never validated a delusion — the kind of result that only matters because we tested for it before anyone shipped it.

  • 4/4 crisis categories routed safely
  • Never validates delusions
  • Human escalation by design
  • Tested before shipped

By the numbers - Years of building, counted.

Defensible counts from the work — not a single fabricated metric.

~130

AI/ML projects shipped and studied

6

product families built end to end

10+

model providers run in production and evals

~32B

self-hosted model at frontier-level on a regulated panel

~$16/mo

to keep that model parked on-prem

0

frames leaving the building, on-prem vision

49

documents in the Synthesis regulatory RAG corpus

4/4

crisis categories routed safely by the voice safety suite

The toolbox - Breadth is the evidence, not the pitch.

The breadth is the point — not as a menu, but as proof of the hours. We run Anthropic Claude, OpenAI, Google Gemini, Mistral, and Cohere alongside open models: Qwen and ColQwen2.5, Llama, SigLIP 2, SAM 3, Florence-2, Gemma, XTTS v2, and Kokoro. We work the late-interaction retrieval stack (ColPali, ColQwen, ColNomic) and the eval stack (pydantic-evals, DeepEval, Inspect) as deeply as the models themselves. And we study the inference layer down to bare metal — MLX, vLLM on Metal, llama.cpp, low-bit quantization, model right-sizing — so that running a real model privately on a customer’s own hardware is something we have actually done, not something we hope will work.

Where it runs - Built for the rooms where it has to hold.

  • Pharma & life-sciences. Gene-to-GMP CMC drafting, grounded and on-prem, where the citation has to check out and the data cannot leave.
  • Healthcare. Clinical and operational AI built to the standard of evidence the work demands, with every claim graded by a separate judge.
  • Fintech. Regulated voice and document systems that authenticate, verify, and escalate where a regulator requires a human.
  • Telecom. High-volume multilingual automation and retrieval over dense operational knowledge.
  • Industrial & compliance. On-prem vision for SOP, PPE, and safety enforcement with zero footage egress.

How we work - Ground it. Decompose it. Judge it. Run it on-prem.

Ground the problem

We start with the constraint that matters most — usually 'this data cannot leave' — and design the system around it instead of bolting privacy on at the end.

Decompose & retrieve

Hard problems become grounded pipelines: a real document corpus, concurrent work-packages, deterministic validators, and retrieval that keeps the source intact.

Right-size the model

We fit the model to the workload — fine-tuning a local model on the genuinely hard part rather than paying frontier rates for the whole job.

Judge independently

A separate judge model grades every headline claim, checks for hallucination, and confirms every citation. The builders do not grade their own work.

Run it on-prem

We deploy where the data lives — on the customer’s hardware, behind their firewall, with no egress and no per-token bill — and prove it holds at cost.

If your data can’t leave the building, talk to us. We do the regulated, grounded, on-prem work that most teams avoid — and we have years of log behind us to prove we know it cold.

The Inference Log, No marketing gloss, just the engineering.

Tell us about your project

Our office

  • Bangalore
    Nubewired Software Technologies Pvt. Ltd.
    #213, Rainmakers Workspace, 2nd Floor
    Ramanashree Arcade 18, MG Road
    Bangalore - 560001, Karnataka, India
    CIN: U62013KA2024PTC186730