Putting it together — make an AI more correct

The other pages built the pieces. This one shows the payoff: using Lemma’s checks to make a real AI model measurably more correct — and how to try it yourself, even on your own laptop.

The idea: ask more than once, then check

AI models are not perfectly reliable — ask the same question twice and you can get a better answer and a worse one. So instead of trusting a single answer, do this:

Ask the model the same question several times to get several answers.
Run each answer through Lemma.
Keep the one Lemma scores highest.

It’s like asking five people the same question and then letting a fact-checker pick the best reply. Simple, and it works.

flowchart LR
  P[a question] --> G[ask the AI a few times]
  G --> C1[answer 1]
  G --> C2[answer 2]
  G --> C3[answer 3]
  C1 --> S[score each one with Lemma]
  C2 --> S
  C3 --> S
  S --> B[keep the best-scoring answer]
  classDef c fill:#1e3a5f,stroke:#4a90d9,color:#fff;
  class S c;

Does it actually help? Yes — here are the numbers

This was tested on a benchmark of 73 scientific-coding tasks across 7 subjects (the HumanEval-Sci benchmark). Using Lemma to pick the best of 5 answers:

A free model (Llama 3.1 8B) improved from a score of 0.630 to 0.685.
That recovered about 88% of the most you could possibly gain (i.e. it picked nearly as well as a perfect answer key would).
It agreed with that perfect answer key 97% of the time.
The same trick worked on a different model family too (Mistral), so it’s not a fluke of one model.

Try the simplest version yourself

The file examples/verify_llm_output.py is the smallest hands-on version: it asks an AI to propose a formula, then has Lemma check it.

sequenceDiagram
  participant U as you
  participant M as an AI model (on your computer or online)
  participant L as Lemma
  U->>M: propose the formula for kinetic energy
  M-->>U: a formula, written out
  L->>L: work out and check the units
  L-->>U: verdict (looks correct / serious problem)

To run it, you need an AI model to talk to. The easiest is a local one via Ollama:

pip install -e ./sdk-py          # install Lemma's Python library
ollama pull qwen2.5-coder:7b     # download a capable free model
LEMMA_LLM_MODEL=qwen2.5-coder:7b python examples/verify_llm_output.py

Real result — the model proposed the correct kinetic-energy formula, and Lemma worked out the units from it and confirmed they’re right:

the model's proposal:
  looks correct  (1 of 1 checks passed)
  the units match

And if the model returns nonsense — text that isn’t a real formula, or one whose units don’t add up — Lemma rejects it rather than pretend it’s fine. Every way it can fail is part of the demo. (You can point it at an online AI instead by setting LEMMA_LLM_BASE_URL, LEMMA_LLM_API_KEY, and LEMMA_LLM_MODEL.)

The full, rigorous test (A/B)

To measure the benefit properly, Lemma includes a test harness that runs an A/B test: the model alone (the “control”) versus the model with Lemma’s help (the “treatment”), across many questions, then reports the difference.

flowchart TB
  Q[the 73 benchmark questions] --> Ctrl[control: the model alone]
  Q --> Treat[treatment: the model + Lemma]
  Ctrl --> Score[score every answer]
  Treat --> Score
  Score --> Delta[/the difference, with confidence/]

cd eval/humaneval-sci && pnpm install
ollama pull llama3.1:8b
HUMANEVAL_SCI_PROMPTS_DIR=/path/to/prompts \
  pnpm smoke-ab --ollama --model llama3.1:8b --max-prompts 5

It comes with built-in connectors for several AI providers (Ollama for local models, plus Google’s Gemini and Anthropic’s Claude). Adding another model means writing a small connector. The benchmark questions themselves live in the separate humaneval-sci project.

The whole story, in one line

flowchart LR
  Cards[(cards: the knowledge)] --> Engine[the checker]
  Engine --> Tools[MCP server / Python]
  Tools --> Rerank[ask a few times, keep the best]
  Rerank --> Better[/a measurably more correct answer/]
  classDef c fill:#1e3a5f,stroke:#4a90d9,color:#fff;
  class Rerank c;

Open cards hold the knowledge; a generic checker turns them into clear verdicts; the MCP server and Python library hand those verdicts to any tool; and using the verdicts to pick the best answer makes real AI models measurably better at science — with a traceable receipt at every step.

Where to go next

Re-read Lemma, end to end — it should now feel like a summary of everything you know.
The engine for the simple scoring rule behind “keep the best.”
The examples folder — every command on this site is a real, runnable file.