Why a substrate
Frontier models can write scientific code that runs and science that reads well, then quietly violate dimensions, limits, and conservation laws. The verification step — not generation — is the bottleneck for AI-assisted scientific work.
The gap
On the SPOT benchmark — 83 peer-reviewed papers with 91 known errors — even the strongest current model recovers only 21.1% recall at 6.1% precision. The other models score near zero.
The pattern
Generate a handful of candidates, score each against a structured, auditable reference, return the best. This verifier-guided sampling is where the substrate pays off. On HumanEval-Sci (73 tasks across 7 domains), best-of-5 reranking with Lemma’s verification engine lifts overall score by +0.055 — Llama 3.1 8B 0.630 → 0.685, recovering ~88% of the functional-oracle ceiling at 97.3% agreement with the oracle’s pick. The result replicates identically on Mistral Nemo 12B (0.586 → 0.641).
The win is at sampling time, not retrieval time: giving the model the cards as a tool before it answers did not help on this benchmark — the deployable result is using Lemma to rank candidates after generation.
The reference
The reference is schema-validated cards, not a trained reward model. Each card is a reusable, vendor-neutral specification of a scoring metric over a domain principle. The check is:
- interpretable — per-axis severity, not one opaque scalar;
- auditable — the corpus is open and every verdict cites the card it used;
- portable — model-agnostic; no labelled correct/incorrect dataset is needed to train it. The corpus itself is the signal.
Below the agent layer
Lemma is infrastructure, not another agent. Other tools call it; it does not try to be the agent. The five-year test: does anyone publish a paper without a verification step? If the answer becomes no, the substrate won.