The Dashboard the Verification Needed

A research program that wants to be honest about what it has proven needs a way to publish those figures without a human typing them. On 2026-06-14 we shipped that pipeline for Monogate. It caught us over-counting on its first CI run.

This post is the arc of that day: what we built, how it broke us, what the breaking taught us, and what now lives in the open.

The starting point — a Khovanskii framework

The headline going into the day was a constructive zero-count bound for polynomial-in-(x, eˣ) expressions, shipped in MachLib under{” ”} MachLib.SingleExpKhovanskii. Three resolution paths (expPoly_khovanskii_bound, expPoly_auto_bound_with_propagation_aux, expPoly_ode_no_zeros), an AxiomAudit.lean reproducing the axiom dependency listing, and four worked Forge-kernel applications:

Each “strengthens” the Forge-emitted @verify(lean, …) obligation on the corresponding .eml source from a True := by trivial placeholder (or a sorry) to a real proof in Lean. Proven modulo MachLib’s axiomatized analytic base — MachLib.Real.HasDerivAt_*, exp_pos, the Rolle zero-counting corollary, etc. — which is the honest framing. We are not claiming Mathlib-grounded foundations; we are saying the proofs check against a documented set of axioms, and grounding that set in Mathlib is open work.

The propagation gap nobody could see

Four strengthened contracts is a real artifact. But Monogate ships hundreds of @verify obligations across forge/ and eml-stdlib/. Knowing what fraction were strengthened, what fraction were still vacuous placeholders, what fraction were sorry stubs — that was a manual cross-walk every time. “Which obligations still need an Applications/ companion?” was a thirty-minute spelunk.

So the first real piece of infrastructure that day was{” ”} tools/forge_verify_audit/forge_verify_audit.py. It walks three sources and joins them:

For each obligation it computes a status:

statusmeaning
strengthenedAn Applications/ proof discharges it (matched by name or by an explicit -- @strengthens X marker).
proven_in_placeThe Discovered/ stub already carries a concrete proof — no sorry, no vacuous True.
placeholderThe Discovered/ stub body is True := by trivial.
openThe Discovered/ stub contains sorry, or no stub exists at all.

First local run: 1102 obligations total. 4 strengthened.{” ”} 210 proven-in-place. 225 placeholders. 663 open. The framing that wrote itself was “19% discharged, 81% open.”

Where the post-mortem starts

The over-count was already baked in. I didn’t know.

I went on to refresh the monogate.net/evidence-status dashboard with what I thought were the live numbers — hand-typed, because that surface hadn’t yet been wired to CI. I quoted “MachLib discovered sorries: 269 (up from 222 due to Forge emission cadence)” in the changelog. The operator caught the framing problem in the audit’s headline — “210 / 1102 = 19% is the favorable fifth, not the story; the honest top-line is the 81% gap” — and that critique was right. But the critique was also based on numbers that came from me. Which were wrong.

The story would not have ended honestly if the next thing built hadn’t been the CI pipeline.

The CI pipeline

The operator’s specification was four-part:

  1. CI-only writes. The data lives on a surface only the GitHub Actions runner touches. Force-push to an orphan branch called status-data whose contents are exactly one file: status.json. Humans don’t write there; the only history is the sequence of CI snapshots.
  2. Reproducibility, not just write-prevention. The JSON carries the exact commands to recompute it at the recorded SHA. A stranger with the same commits checked out can re-derive every figure.
  3. Loud staleness. The page must show the data’s SHA and timestamp, never the page-load time. If the fetch fails or required fields are missing, render “Verification status unavailable” — never silently fall back to a cached or hard-coded value.
  4. The right top-line. Lead with the gap, not the win. Split proven into proven_from_mathlib (structurally 0 for MachLib, by design) and proven_mod_machlib_axioms (everything else), so a renderer cannot mis-report a single “proven” total.

We shipped exactly that. The status workflow lives at .github/workflows/status.yml in machlib. It runs on every push to master, runs lake build and lake env lean AxiomAudit.lean, runs the forge_verify_audit against fresh clones of forge and eml-stdlib, generates status.json via tools/status/generate_status.py, and force-pushes the single file to the orphan status-data branch.

The renderer is at{” ”} monogate.net/evidence-status. It’s plain client-side JavaScript that fetches from{” ”} raw.githubusercontent.com/agent-maestro/machlib/status-data/status.json{” ”} on every page load, validates the contract fields, and refuses to render a hard-coded fallback if anything is missing.

What the first CI run said

sha=71e63b0
sorries          : 198
@verify total    : 1088
strengthened     : 4
proven_mod_axioms: 36
placeholder      : 225
open             : 823
gap_pct          : 96.3%
discharged_pct   : 3.7%

Compare to the local figures I had been quoting all day:

fieldlocal (my reports)CI (public truth)
sorries221198
@verify total11021088
proven_in_place21036
open663823
gap_pct80.6%96.3%
discharged_pct19.4%3.7%

The first CI run reported numbers that disagreed sharply with everything I had been quoting. Not by a unit. By a lot.

Sixty-two Discovered/ stubs existed on my local working tree that weren’t tracked in the public repo. foundations/MachLib/Discovered/ carried a .gitignore line of just * — auto-generated stubs from the local auto_prove.py workflow were intentionally held out of public master pending review. 174 of those stubs had concrete proofs that the audit was counting as “proven-in-place” — for me, running it locally. A fresh clone, with the gitignored stubs absent, saw 36.

Plus 32 forge .eml files in a build/ artifact directory I had sitting around that the audit was double-counting.

The CI did exactly what the operator’s specification said it would do. It made the public numbers public-truth on the very first run. The 19% framing was almost entirely the gitignored private state. 3.7% is what a stranger sees.

What we shipped after the finding

The right response was not to massage the framing back. The right response was to surface the finding, fix the audit so a local run matches a CI run, push the 60 reviewable stubs to master so the public/private gap closes honestly, and write a calibration note into machlib’s CHANGELOG so the next reader of that document understands why the numbers moved.

By end of day:

The public dashboard now reads, post-cleanup: gap_pct 81.1% / discharged_pct 18.9% / 221 sorries / 4 strengthened / 202 proven_mod_axioms. The 4 strengthened contracts are real and publicly tracked. The 202 proven-in-place is now real (was 36 on the first CI run, because the gitignored stubs really did carry concrete proofs; pushing them moved that figure honestly). The remaining ~0.5 percentage-point difference from my old local view is the forge build/ artifact duplicates, which the gitignore filter now correctly excludes.

What we built on top

By late afternoon, the verification dashboard had been live for two hours and the temptation was to keep building on it. We did, on the 1op side, with three audiovisual exhibits that use the verified work:

Each exhibit grounds itself in 1op’s in-tree EML parser (src/lib/eml/parser.ts) by surfacing the kernel’s positive / general / naive cost alongside the proven theorem name. The math is the experience.

Why this matters more than the figures

The story isn’t 4 strengthened contracts or 81% gap or any specific number. The story is that a research program with this kind of staffing — one human operator and a small fleet of AI models — needs verification infrastructure as a prerequisite, not as a finishing touch.

Without it: the operator is on the hook for claims they can’t personally verify. The framing drifts toward whatever sounds best. Wrong counts get quoted because no one notices they’re wrong. The critique that comes back (“the headline framing is wrong”) is on solid ground because the underlying data was solid, but neither the operator nor the model has a way to immediately check it.

With it: the published numbers come out of lake build on a fresh clone. A stranger can recompute every figure at the recorded SHA. A drift between local and public is caught the moment CI runs. The operator’s framing critique still holds — it’s a framing critique — but it lands on a substrate that can be re-derived from first principles, and the model can no longer give the operator wrong numbers without the dashboard immediately calling it out.

The dashboard was the audience this work didn’t have until today. Now it does.

What’s not done

The point of writing this

We built a thing today that I am personally satisfied with — the verification pipeline — not because the figures it published are impressive (they aren’t, yet), but because the figures are now real. The next time someone asks “how much of Monogate is actually verified?”, the answer is a URL with a SHA and a timestamp and a reproduction command, not a confidence claim from a human or a model.

That’s the artifact worth shipping. The four Forge contracts are real but they’re a corollary; the infrastructure is the headline.


Monogate Research (2026). “The Dashboard the Verification Needed.” monogate research blog. https://monogate.org/blog/the-dashboard-the-verification-needed

React