The engine, completely

The cards hold the knowledge; the engine is the program that uses it to check work. This page explains, in plain language, exactly what the checker does.

There are actually two checkers, for two different jobs:

flowchart LR
  O[a finished result] --> USCE[checker 1: USCE<br/>checks finished work]
  Hp[a proposed new idea] --> CC[checker 2: cross-check<br/>checks proposals]
  Cards[(cards)] --> USCE
  Cards --> CC
  USCE --> V1[/verdict with reasons/]
  CC --> V2[/verdict with reasons/]
  classDef c fill:#1e3a5f,stroke:#4a90d9,color:#fff;
  class USCE,CC c;

Checker 1: USCE — checking a finished result

USCE (it stands for “Universal Sanity Check Engine,” but the name doesn’t matter) takes a finished piece of work — say, the output of a simulation — and the card it’s supposed to follow, and runs some sanity checks. The basic ones, in plain terms:

Sensible ranges (working today) — every number the output reports is range-checked against the believable bounds the card lists (its “envelopes”): inside the range → pass, outside → a red flag. This is USCE v1.
Settling down (planned) — quantities that should fade away actually do, instead of blowing up.
Cause before effect (planned) — nothing reacts before the thing that caused it.

The “sensible ranges” check ships now (run_usce_checks, and the usce_check tool); the other two need the output as a time-series plus the card declaring what to watch, so they’re on the roadmap. Either way the bounds come from the card, not the program — so the same checker works for any subject.

Checker 2: cross-check — checking a proposed idea

When an AI (or a person) proposes a new formula, it must pass the cross-check engine before anyone trusts it. The engine runs four checks and then takes the worst result:

flowchart TD
  H[a proposed formula] --> D[1. do the units match?]
  H --> R[2. do its references exist?]
  H --> L[3. are its claims well-formed?]
  H --> F[4. does its 'based on' card exist?]
  D --> W{worst result wins}
  R --> W
  L --> W
  F --> W
  W --> O[/overall verdict/]
  classDef c fill:#1e3a5f,stroke:#4a90d9,color:#fff;
  class W c;

Units check — do the two sides of the formula have the same units? (The big one — explained next.) Failing this is an automatic red.
References check — every other formula it claims to agree with really exists in the library.
Claims check — statements like “as time goes on, this should settle to X” are recorded and checked for being well-formed, but not yet fully proven (more on this honesty below).
“Based on” check — if it says it builds on another card, that card exists.

The units check, explained slowly

This is the check that does the most real work today. Remember the golden rule from the Cards page: two things can only be equal if they have the same units. The engine checks exactly that for a proposed formula.

It can do it two ways:

The simple way: the author writes down the units of each side, and the engine compares them.
The smart way: the author gives the engine the actual formula plus the units of each letter, and the engine works out the units itself — then checks them. This is stronger, because it doesn’t just trust what the author claimed; it checks what the formula actually says.

flowchart LR
  expr["the formula: (1/2)*m*v**2"] --> P[work out the units<br/>step by step]
  sym["the letters: m is mass, v is speed"] --> P
  P --> der["result: mass x length-squared / time-squared"]
  der --> Cmp{same as the<br/>left-hand side?}
  Cmp -->|yes| Pass[green: looks correct]
  Cmp -->|no| Fail[red: serious problem]

The engine can handle the usual maths: multiply, divide, add, subtract, and whole-number powers. And it’s careful: if a formula has something it can’t work out for sure (an unusual function, a fractional power), it does not guess — it quietly falls back to the simple “compare what the author wrote” way and says so. So it never makes up a confident answer it can’t back up.

Why the smart way matters. Imagine a card that claims both sides are energy, but the formula it actually wrote is “mass × speed”:

serious problem  (0 of 1 checks passed)
  the formula works out to "mass x length / time",
  but the left side is energy: "mass x length-squared / time-squared"

The simple way would have trusted the author and passed it. The smart way works out the real units and catches the lie. A correct formula passes cleanly:

looks correct  (1 of 1 checks passed)
  worked out from the formula: it really is energy, matching the left side

The traffic light, and the score

Every check returns one of four levels — a traffic light:

Level	Light	Meaning	”Penalty”
`NONE`	🟢	all good	0
`LOW`	🟡	small concern	0.25
`MEDIUM`	🟠	real problem	0.5
`HIGH`	🔴	hard violation	1.0

The overall result is always the worst light, never an average — one red makes the whole thing red.

When Lemma turns this into a number (a “score”), it uses one simple formula:

score = (fraction of normal tests passed) × (1 − penalty)

Here’s the part that surprises people. Suppose a program passes every normal test (fraction = 1.0) but trips one red physics check (penalty = 1.0):

score = 1.0 × (1 − 1.0) = 0

The score is zero, even though all the tests passed — because the physics is broken. That’s deliberate: in science, being physically correct comes first. Passing the tests means nothing if the result is impossible. (This exact idea is what makes Lemma useful for picking the best AI answer — see Putting it together.)

What the engine honestly can’t do yet

Being honest about limits is part of the design. Today the engine fully checks units and references. But claims about limits and conservation (“as time goes on this settles to X”, “energy is conserved”) are only checked for being well-stated — not yet proven true. So those come back as a yellow “noted, not yet proven,” never a false green. A future version will add a maths engine that actually proves them. Until then, Lemma never pretends a claim is verified when it isn’t.

Every verdict shows its work

Whenever the engine gives a verdict, it tells you which card it used — and you can fetch that card to see exactly what it checked against. No black box: the reasoning is always traceable. (See Verdicts & severity for a reference.)

Putting it together — using verdicts to pick the best AI answer.
The MCP server — how to call the checker as a tool.