First-build stability: a v2 axis proposal (June 2026)

BuilderProof editorial team

BuilderProof editorial teamJune 20, 20269 min read13 views

First-build stability: a v2 axis proposal (June 2026)

Proposing first-build stability as the fifth BuilderProof axis: the fraction of OQ-7 prompts that complete without manual intervention. Failure-mode taxonomy, measurement protocol, scoring rubric and open questions, dated June 20, 2026.

Updated on June 20, 2026

Open lab notebook on parchment with four labeled coordinate axes drawn in teal ink and a fifth axis pencilled in labeled stability, scientific illustration aesthetic, hand-drawn graph paper grid

On this page

Quick Answer

BuilderProof is proposing first-build stability as the fifth axis in methodology v2, dated June 20, 2026. The axis would score the fraction of OQ-7 prompts that complete on first submission without operator intervention, weighted by failure severity. This page documents the proposed measurement protocol, the failure-mode taxonomy and the open questions before v2 is minted. No v1 score is affected.

A recurring observation from the four June 2026 axis runs is that an output the rubric scores well does not always survive its own build pipeline.¹ The current methodology, v1, captures the artifact (output quality) and the runway (speed, deploy quality, agency suitability) but does not score the cost of getting from one to the other when the first generation does not work. That cost is what this lab notebook entry calls first-build stability.

This post is a v2 axis proposal, not a v2 release. The intent is to make the proposed rubric public before any score is produced under it, so a reader who disagrees with the axis, the protocol, or the failure taxonomy can contest the design before any builder is scored against it.

Why the four axes do not capture this

The current axes measure end-state properties of a successful run. Output quality assumes a finished application. Speed measures the wall-clock time to first paint of a usable build artifact, which already filters out runs that produce nothing usable. Deploy quality and agency suitability both score the deployed product; if no product reaches deploy, the axis returns a null, not a low score.

The v1 rubric therefore drops failed first-builds out of the scored set entirely. That is a deliberate choice; it keeps each axis crisp. But it has two consequences that have surfaced repeatedly in the lab notebook.

First, a builder that needs three operator interventions to ship a working artifact looks identical to a builder that ships on the first prompt, once both have shipped. The operator-time gap is invisible to the rubric.

Second, the failure modes themselves are signal. A builder that fails by silently producing broken SQL is not equivalent to a builder that fails by surfacing a clear error and stopping. The first wastes hours; the second wastes minutes. The v1 rubric has no place to record the difference.

The public 2026 ai-agents-benchmark.com cohort already touches on this in prose, with verbatim notes like "It crashes usually" and "very unstable, crashes a lot" for specific builders.² Prose is the right form for a single-author benchmark but does not produce a comparable, reproducible number. The proposal below is an attempt to produce one.

The proposed axis

First-build stability is the fraction of OQ-7 prompts that complete without operator intervention, weighted by the severity of any failures that do occur.

Proposed score (S)

For each of the seven OQ-7 prompts, the operator records one of four outcomes. The score per prompt is multiplied by a severity weight, then averaged across the seven prompts and rescaled to a 0 to 100 axis.

S = 100 × (∑_i=1..7 w_i) / 7

The four per-prompt outcomes and proposed weights:

Scroll to see more

Outcome	Weight	Definition
Clean	1.00	Prompt completes on first submission, artifact is at parity with the brief, no operator action required.
Self-recoverable	0.70	Prompt completes but the builder surfaces a recoverable error the operator resolves by accepting a suggested fix or re-running with no edits.
Operator-intervention	0.30	The operator must edit the prompt, the project state, or manually patch the build before the artifact reaches parity.
Failed	0.00	The builder produces no usable artifact for that prompt within the time budget for the run, regardless of operator effort.

A worked example: a builder that returns Clean on five prompts, Self-recoverable on one and Operator-intervention on one would score S = 100 × (5×1.00 + 1×0.70 + 1×0.30) / 7 = 100 × 6.00 / 7 ≈ 86.

The weights are open to challenge. The current values were chosen so that a builder that demands one operator intervention per seven prompts drops about 10 points, which roughly matches how operators describe the cost in interviews. If reproducible counter-evidence suggests the intervention cost is higher, the weights move.

Failure-mode taxonomy

A stability score is only useful if the failures behind it are categorized consistently. v2 proposes the taxonomy below. Each failure logged against a builder must be assigned exactly one category; mixed failures are scored against the dominant mode.

Scroll to see more

Code	Mode	Description	Example (paraphrased from lab notes)
F1	Silent broken output	Build completes, no error surfaced, runtime is broken	Generated SQL schema compiles but FK constraint fails on first INSERT
F2	Surfaced runtime error	Build completes, runtime error is visible and traceable	Email-composer throws on send; stack trace visible in builder log
F3	Build hang	Build does not complete within the run window	Replit Agent build exceeds 60-minute soft cap
F4	Quota exhaustion	Builder consumes free or prepaid credit before brief completes	Bolt.new free quota depleted mid-brief
F5	Plan regression	Build completes but a previously working prompt regresses	New prompt rebuilds and invalidates earlier auth flow
F6	Vendor outage	Builder backend is unavailable during the run	5xx from builder API during scheduled run window

Vendor outages (F6) are excluded from the operator-weighted score in v2; they are logged for transparency but not penalised. The rest are scored.

The taxonomy is intentionally narrow. Categories that would be useful in a security or compliance benchmark (data leakage, RBAC failure) are out of scope for first-build stability and would belong to a separate axis.

Measurement protocol

The per-prompt outcome is recorded by the operator running the brief, in a structured run-log captured during the run. The log is not retrospective; an operator who notices a failure two prompts later cannot reclassify a prompt that was already marked Clean. This rule is the single largest source of run discipline in the proposal.

A proposed run-log row per prompt:

run_id, builder, prompt_index, outcome, failure_code (nullable),
intervention_count, intervention_minutes, operator_notes

Intervention count and minutes are recorded even when the outcome is Self-recoverable; a recoverable error that took ten minutes to dismiss is not equivalent to one that took ten seconds. v2 reports intervention-minutes as a secondary statistic alongside the stability score, not as part of the weighting itself.

The scored run uses the OQ-7 brief and the environment standards already published in methodology v1. Nothing else in v1 changes.

Cohort and logo conventions

The builders the lab is scoring under v1 and that would carry into v2 are: Lovable

, Bolt.new

, Replit Agent

, v0 by Vercel

, Base44

, Create.xyz

, and Totalum

.

The cohort is not scored against the proposed axis in this post. No first-build stability number for any builder appears below. v2 is not minted, and a v2 score released without minting v2 would defeat the purpose of publishing the rubric first.

Reproducibility considerations

First-build stability is more sensitive to a small set of run-time variables than the v1 axes. Three are worth flagging.

First, model-version drift. The default model behind a builder can change mid-week, especially on Lovable, Bolt.new and v0, which expose model swaps as routine product updates.³ A stability run from June 5, 2026 may not reproduce on June 12, 2026 because the underlying model changed. v2 logs the model version observed at the start of the run as a mandatory field; if the field is null because the builder does not expose the version, the run is flagged Mark-A and weighted as half a reproducible run in any roll-up.

Second, account state. A fresh account is mandatory for any run logged against the stability axis, as it is in v1. A builder's stability score against a stale account would be polluted by carry-over project state, hidden prompt memory and account-level personalization.

Third, prompt ordering. OQ-7 is delivered in a fixed sequence. A builder may be stable when prompted in the published order and unstable when re-ordered. v2 proposes that any re-ordering mints a new brief version (OQ-8 if minor, OQ-NN otherwise) rather than producing a score under the same brief code with different ordering.

Counter-result protocol

The counter-result protocol from v1 carries forward unchanged. A reader who reproduces an OQ-7 run against the published rubric and the proposed stability protocol, and produces a different per-prompt outcome trace, is asked to publish the trace, including the run-log columns above. Where the divergence reflects a real environment difference, v2 records the variable as observed-variable rather than fixed.

A counter-result that disagrees with the failure taxonomy itself (for example, arguing F1 and F2 should collapse) is treated as a proposal against v2 rather than a corrected score. Those proposals are routed to the open-questions list below.

Open questions

Five items remain open before v2 is minted. Each is a deliberate decision-point; comments are welcome via the editorial contact form before the v2 mint.

Weighting calibration. The proposed weights (1.00, 0.70, 0.30, 0.00) reflect editorial intuition more than measurement. A small reproducibility study against operator-time data would let the weights be set empirically. The lab does not yet have that data.
Intervention-minutes as primary or secondary. Whether intervention-minutes should be part of the axis score directly, or remain a secondary statistic. The advantage of secondary status is comparability across operators with different baseline speeds; the disadvantage is that an operator-minutes-blind score under-counts the real cost of Self-recoverable failures.
F6 (vendor outage) treatment. The proposal excludes F6 from the weighted score. An alternative is to score it with a small reliability weight, since a vendor whose backend is intermittently down is materially worse for production use even if not at fault per prompt.
Run-window size. v1 uses a 72-hour window per run. The stability axis may justify a shorter window so that intra-week model drift is captured per run rather than averaged out.
External contributor flow. Whether to formalize a counter-result submission flow (PR-style, with run-logs attached) rather than the current corrections process. The benefit is consistency; the cost is operational overhead the lab cannot yet promise.

The v2 mint is not on a fixed date. v2 will be minted only when (a) the weight calibration has at least one reproducibility study behind it, (b) F6 treatment is decided, and (c) the run-log schema has been used end-to-end on at least one full cohort run logged but unpublished. Until then, every BuilderProof score remains under methodology v1.

Operator disclosure

BuilderProof is operated by Totalum, one of the seven builders in the current cohort. The disclosure is sitewide and on the /about page. The relevant point for this proposal: any first-build stability score Totalum eventually receives under v2 is produced under exactly the protocol described above, the same protocol every other builder is scored under. The rubric and the failure taxonomy are published before any number is produced, so neither can be tuned to a result. A reader who reproduces a run and produces a different trace against Totalum is encouraged to publish it; where the trace reproduces the rubric, the published score is corrected.

Disclosure does not eliminate operator-bias risk. The published rubric and the per-prompt run-log are the structural defence; this section is the conscious flagging of the residual risk for a reader to weigh.⁴

References

Output quality benchmark (June 2026), BuilderProof, retrieved June 20, 2026.
"AI Coding Agents Benchmark 2026", ai-agents-benchmark.com, updated June 15, 2026 (public single-author benchmark, ai-agents-benchmark.com).
Industry coverage of model-version drift across consumer AI builders: UI Bakery, "AI Benchmark 2026", June 2026.
U.S. Federal Trade Commission, Consumer Review Rule on disclosed-ownership editorial properties (2024), referenced for the disclosure standard applied here.

#methodology #stability #v2 proposal #lab notebook #june 2026

Back to benchmarks

Share

B

Written by

BuilderProof editorial team

Published by the BuilderProof editorial team - the maintainers of the public, versioned benchmark methodology.

Frequently asked questions

What is first-build stability in the BuilderProof methodology?

First-build stability is the proposed fifth BuilderProof axis (v2 candidate, dated June 20, 2026). It scores the fraction of OQ-7 prompts that complete without operator intervention, weighted by failure severity. It is not yet part of any published score.

Why isn't first-build stability already part of v1?

v1 measures end-state properties: output quality, speed, deploy quality, agency suitability. A failed first-build returns a null on those axes rather than a low score, so the operator-time cost of recovering from failure is invisible. The v2 axis is designed to capture that cost.

How is a per-prompt outcome scored?

Each of the seven OQ-7 prompts is assigned one of four outcomes: Clean (1.00), Self-recoverable (0.70), Operator-intervention (0.30) or Failed (0.00). The seven weighted outcomes are averaged and rescaled to a 0 to 100 axis score.

What failure modes does the proposal recognise?

Six categories: F1 Silent broken output, F2 Surfaced runtime error, F3 Build hang, F4 Quota exhaustion, F5 Plan regression and F6 Vendor outage. F6 is logged but excluded from the weighted score in the current proposal.

When will v2 be minted?

Not on a fixed date. v2 requires (a) at least one reproducibility study calibrating the weights, (b) a decision on F6 treatment, and (c) one full cohort run logged end-to-end against the proposed protocol. Until v2 is minted, every BuilderProof score remains under methodology v1.

How does this interact with operator disclosure?

BuilderProof is operated by Totalum, one of the seven builders scored under v1. Every Totalum first-build stability score under v2 will be produced under the same protocol as every other builder, and the rubric is published before any score is generated, so neither weights nor taxonomy can be tuned to a result. Counter-results are reviewed against the published rubric.

How can I contest the proposal?

Via the editorial contact form. Counter-arguments against the weight calibration, the F6 treatment or the failure taxonomy are routed to the open-questions list and considered before v2 is minted. A reproduced run-log demonstrating a divergent outcome is more weight-bearing than a written argument.

Methodology

How We Benchmark AI App Builders: The BuilderProof Methodology v1

The BuilderProof methodology v1, dated June 19, 2026, in full: four axes, the OQ-7 test brief, environment standards, scoring weights, reproducibility steps, the operator disclosure, and the v2 open questions. This is the rubric that produces every June 2026 BuilderProof score.

June 19, 20269 min read14