Putting it together — make an AI more correct
The other pages built the pieces. This one shows the payoff: using Lemma’s checks to make a real AI model measurably more correct — and how to try it yourself, even on your own laptop.
The idea: ask more than once, then check
AI models are not perfectly reliable — ask the same question twice and you can get a better answer and a worse one. So instead of trusting a single answer, do this:
- Ask the model the same question several times to get several answers.
- Run each answer through Lemma.
- Keep the one Lemma scores highest.
It’s like asking five people the same question and then letting a fact-checker pick the best reply. Simple, and it works.
flowchart LR P[a question] --> G[ask the AI a few times] G --> C1[answer 1] G --> C2[answer 2] G --> C3[answer 3] C1 --> S[score each one with Lemma] C2 --> S C3 --> S S --> B[keep the best-scoring answer] classDef c fill:#1e3a5f,stroke:#4a90d9,color:#fff; class S c;
Does it actually help? Yes — here are the numbers
This was tested on a benchmark of 73 scientific-coding tasks across 7 subjects (the HumanEval-Sci benchmark). Using Lemma to pick the best of 5 answers:
- A free model (Llama 3.1 8B) improved from a score of 0.630 to 0.685.
- That recovered about 88% of the most you could possibly gain (i.e. it picked nearly as well as a perfect answer key would).
- It agreed with that perfect answer key 97% of the time.
- The same trick worked on a different model family too (Mistral), so it’s not a fluke of one model.
Try the simplest version yourself
The file examples/verify_llm_output.py is the smallest hands-on version: it
asks an AI to propose a formula, then has Lemma check it.
sequenceDiagram participant U as you participant M as an AI model (on your computer or online) participant L as Lemma U->>M: propose the formula for kinetic energy M-->>U: a formula, written out L->>L: work out and check the units L-->>U: verdict (looks correct / serious problem)
To run it, you need an AI model to talk to. The easiest is a local one via Ollama:
pip install -e ./sdk-py # install Lemma's Python libraryollama pull qwen2.5-coder:7b # download a capable free modelLEMMA_LLM_MODEL=qwen2.5-coder:7b python examples/verify_llm_output.pyReal result — the model proposed the correct kinetic-energy formula, and Lemma worked out the units from it and confirmed they’re right:
the model's proposal: looks correct (1 of 1 checks passed) the units matchAnd if the model returns nonsense — text that isn’t a real formula, or one whose
units don’t add up — Lemma rejects it rather than pretend it’s fine. Every
way it can fail is part of the demo. (You can point it at an online AI instead by
setting LEMMA_LLM_BASE_URL, LEMMA_LLM_API_KEY, and LEMMA_LLM_MODEL.)
The full, rigorous test (A/B)
To measure the benefit properly, Lemma includes a test harness that runs an A/B test: the model alone (the “control”) versus the model with Lemma’s help (the “treatment”), across many questions, then reports the difference.
flowchart TB Q[the 73 benchmark questions] --> Ctrl[control: the model alone] Q --> Treat[treatment: the model + Lemma] Ctrl --> Score[score every answer] Treat --> Score Score --> Delta[/the difference, with confidence/]
cd eval/humaneval-sci && pnpm installollama pull llama3.1:8bHUMANEVAL_SCI_PROMPTS_DIR=/path/to/prompts \ pnpm smoke-ab --ollama --model llama3.1:8b --max-prompts 5It comes with built-in connectors for several AI providers (Ollama for local models, plus Google’s Gemini and Anthropic’s Claude). Adding another model means writing a small connector. The benchmark questions themselves live in the separate humaneval-sci project.
The whole story, in one line
flowchart LR Cards[(cards: the knowledge)] --> Engine[the checker] Engine --> Tools[MCP server / Python] Tools --> Rerank[ask a few times, keep the best] Rerank --> Better[/a measurably more correct answer/] classDef c fill:#1e3a5f,stroke:#4a90d9,color:#fff; class Rerank c;
Open cards hold the knowledge; a generic checker turns them into clear verdicts; the MCP server and Python library hand those verdicts to any tool; and using the verdicts to pick the best answer makes real AI models measurably better at science — with a traceable receipt at every step.
Where to go next
- Re-read Lemma, end to end — it should now feel like a summary of everything you know.
- The engine for the simple scoring rule behind “keep the best.”
- The examples folder — every command on this site is a real, runnable file.