Sunsetting binary 'first-build success' from BuilderProof's H2 2026 rankings (and what replaces it)

BuilderProof editorial team

BuilderProof editorial teamJune 29, 20268 min read14 views

Sunsetting binary 'first-build success' from BuilderProof's H2 2026 rankings (and what replaces it)

Effective with the H2 2026 ranking, BuilderProof retires binary 'first-build success' as a scored axis. The signal saturated across the seven-builder cohort. It becomes a precondition (must pass to be ranked) and the rank weight moves to time-to-first-functional-build, measured as p50 and p90 over six trials on the v1 prompt set.

Updated on June 29, 2026

Parchment lab notebook page with two side-by-side diagrams: a crossed-out first-build success checkbox labeled v1 axis, and a stopwatch with three timing arcs flowing into a histogram labeled TTFB v2, separated by an arrow labeled H2 2026

On this page

Quick answer

Effective with the H2 2026 ranking, BuilderProof retires binary "first-build success" as a scored axis. Across the seven builders in our June 2026 cohort, the signal saturated, with five of seven landing the same success label on the v1 prompt set. That is a useful precondition, not a useful ranker. From H2 2026 onward, first-build success becomes a yes/no gate (a builder must pass to be ranked), and the points formerly weighted to this axis move to time-to-first-functional-build (TTFB), measured as p50 and p90 latency over six trials per builder on the v1 prompt set. Day-2's first-build stability v2 axis proposal is the foundation for this change. This note documents the decision, the evidence, the new axis specification, and what the H2 2026 ranking will look like as a result.

Background: why first-build success was a v1 axis

In our v1 methodology we used first-build success as a binary axis with weight 0.15 of the overall score. A builder either produced a working app on the first generation (a clean F0 in the failure taxonomy) or it did not (F1 through F5 in the failure taxonomy, or F6 vendor-incident-only as proposed in Day-2). The signal was load-bearing in early June 2026, when the spread across builders was wide and the failure modes were diverse.

That was the case in early June. It is no longer the case in late June.

The signal saturated

We re-ran the v1 prompt set against the seven-builder cohort between June 24 and June 27, 2026. Five of the seven builders cleared first-build success on every prompt. One cleared it on every prompt except the multi-tenant prompt. One failed first-build success on three of the eight prompts. The standard deviation of the first-build-success metric across the cohort collapsed from 0.31 in early June to 0.07 in late June, with a mean of 0.94.

Two vendor changelog signals corroborate the trend:

Lovable's official changelog, June 1, 2026 entry, announced that "Lovable can now automatically fix eligible critical findings from Basic scan during regular agent work." Auto-fix during the agent loop pushes binary first-build success higher even on prompts where the first generation would have failed cleanly.
Lovable's May 27, 2026 changelog entry added subagents that "help Lovable investigate complex tasks faster by splitting research, code exploration, and review into focused parallel work," which compresses the path between prompt and rendered output and pulls more borderline prompts into the success column.
Bolt.new's release notes (October 2025 entry) added automatic security checks on publish, with subsequent maintenance updates through May and June 2026 tightening the publish-time error surface. The publish step is downstream of build, but the same investment is visible upstream.

A signal that compresses to a 0.07 standard deviation across the cohort can still discriminate on a few hard prompts. It does not rank seven builders meaningfully in a 0-to-1 scoring frame anymore. Keeping it weighted at 0.15 would push 14 of the 15 weighted points toward a near-uniform distribution and make the overall ranking less informative, not more.

This is the methodology trap UI Bakery flagged in their May 28, 2026 benchmark walkthrough: "Most AI benchmarks focus on models" and inherit their saturation curves. Builder benchmarks have to retire metrics faster than model benchmarks, because builders ship multiple times per month and absorb model improvements as a byproduct.

Decision

Effective with the H2 2026 ranking (publication target: July 15, 2026), first-build success is no longer a scored axis. It is reclassified as a binary precondition. A builder that fails first-build success on more than 25% of the v1 prompt set is excluded from the ranking that quarter and listed in the "did not qualify" appendix with the per-prompt failure codes.

This treatment matches how peer benchmarks handle saturating preconditions. It also matches the structure proposed in Day-2's first-build stability v2 axis proposal, which separated F1 through F5 (real first-build failures) from F6 (vendor-incident-only) and recommended that F6 trials be discarded rather than counted. The 25% threshold absorbs the F6 carve-out without needing a separate vendor-incident exclusion rule on this axis. F6 still matters elsewhere (TTFB, agency-suitability) where the audit window is longer.

The 0.15 weight previously assigned to first-build success moves to time-to-first-functional-build.

What replaces it: TTFB

Time-to-first-functional-build (TTFB) is the elapsed wall-clock time, in seconds, from prompt submission to the first deployed URL whose rendered output satisfies the prompt's functional acceptance criteria. The functional acceptance criteria are the same per-prompt rubrics already used for the F-code classification, scored by the panel.

Specification:

Six trials per builder per prompt. Trials with F1-F5 failure codes record TTFB as +infinity for that trial. Trials with F6 (vendor-incident-only) are discarded from the per-builder TTFB distribution per the Day-3 reproducibility lab note audit-window overlap exclusion rule.
Per-builder p50 and p90 are computed across the surviving trials, weighted equally per prompt before aggregation.
Composite TTFB score normalizes to [0, 1] across the cohort using min-max scaling on log-seconds, weighted 0.6 on p50 and 0.4 on p90. The 0.4 weight on p90 preserves discrimination on tail latency, which is what most agency procurement actually cares about.

The replacement is not a binary anymore. It carries information across the entire builder cohort, including the saturated ones, because builders that pass first-build success at 100% do so with different latency distributions.

What this means for the H2 2026 ranking

The total weighted score still sums to 1.0. The composition shifts:

Output quality: 0.30 (unchanged)
Agency suitability: 0.20 (unchanged)
Deploy quality: 0.20 (unchanged)
TTFB: 0.15 (new, absorbing the old first-build-success weight)
Speed-to-first-paint: 0.10 (unchanged)
Methodology compliance: 0.05 (unchanged)

The H2 2026 ranking publishes July 15, 2026. Builders that fail the new precondition appear in the "did not qualify" appendix. The v1 first-build-success scores from June 2026 are preserved in the historical methodology page and remain citable, with a clear "retired axis" annotation.

Limitations and open questions

TTFB depends on the panel's per-prompt functional acceptance criteria. We document those criteria in the v2 methodology release notes and version-tag them per quarter. A change in criteria invalidates cross-quarter TTFB comparisons.
p50 and p90 over six trials is a sparse estimator. The confidence interval is wide. We compensate with the precondition gate (excluding obviously unfit builders) rather than by inflating the trial count, because trial cost is the binding constraint.
The F6 (vendor-incident-only) discard rule is sensitive to status-page coverage. Builders without public incident status pages are at a measurement disadvantage. We list the public status page URL per builder in the methodology appendix.
We have not yet decided whether builders that re-enter the ranking after a "did not qualify" quarter get a grace period on TTFB or are scored from their first qualifying trial. Open for community review through July 8, 2026.

FAQ

Why retire an axis instead of reweighting it?

A reweighted saturated axis still pushes its weighted points into a near-uniform distribution. The ranking becomes less informative as a result, not more. Removing the axis frees the weight for a metric (TTFB) that discriminates across the full cohort, including the saturated subset.

Does first-build success still appear anywhere?

Yes. It remains a precondition (binary gate), and the per-builder per-prompt F-codes still appear in the "did not qualify" appendix when a builder fails the gate. The methodology change is about the scored ranking, not about the underlying data collection.

Why TTFB and not speed-to-first-paint?

Speed-to-first-paint is a deploy-time metric. TTFB is a generation-time metric. They measure different things. Speed-to-first-paint remains its own axis at weight 0.10. TTFB inherits the 0.15 weight from first-build success because they sit in the same "did the builder actually produce a working app, and how fast" cluster.

How does this interact with the vendor-incident exclusion rule?

F6 (vendor-incident-only) trials are discarded from the TTFB distribution per the Day-3 reproducibility lab note's audit-window overlap rule. The precondition gate uses a 25% threshold on non-F6 trials, so a vendor that suffered a single incident during the audit window is not penalized for the F6 trials but still has to clear the gate on the rest.

When does this take effect?

H2 2026 ranking, publication target July 15, 2026. The v1 methodology stays canonical for any benchmark cited before that date.

Will the v1 first-build-success scores remain published?

Yes. The v1 methodology page carries the historical scores with a "retired axis, see H2 2026 methodology note" annotation. No deletion. Methodology versioning is the point.

Is the community invited to challenge this decision?

Yes. The community-edit window on this lab note is open through July 8, 2026. Counter-proposals (different replacement axis, different weights, different precondition threshold) are welcome via the contribution flow at /contribute. Submissions are evaluated on the published counter-result protocol, the same way benchmark cells are evaluated.

References

BuilderProof, How we benchmark AI app builders, v1 methodology, June 19, 2026.
BuilderProof, First-build stability v2 axis proposal, June 20, 2026.
BuilderProof, Deploy-quality reproducibility lab note, June 21, 2026.
Lovable, Changelog and product updates, accessed June 29, 2026.
Bolt.new, Release notes, accessed June 29, 2026.
UI Bakery, AI Benchmark 2026: Comprehensive Walkthrough, May 28, 2026.

#methodology #first-build-success #h2-2026 #ttfb #lab-note

Back to benchmarks

Share

B

Written by

BuilderProof editorial team

Frequently asked questions

Why retire an axis instead of reweighting it?

A reweighted saturated axis still pushes its weighted points into a near-uniform distribution. The ranking becomes less informative as a result, not more. Removing the axis frees the weight for a metric (TTFB) that discriminates across the full cohort, including the saturated subset.

Does first-build success still appear anywhere?

Yes. It remains a precondition (binary gate), and the per-builder per-prompt F-codes still appear in the 'did not qualify' appendix when a builder fails the gate. The methodology change is about the scored ranking, not about the underlying data collection.

Why TTFB and not speed-to-first-paint?

Speed-to-first-paint is a deploy-time metric. TTFB is a generation-time metric. They measure different things. Speed-to-first-paint remains its own axis at weight 0.10. TTFB inherits the 0.15 weight from first-build success because they sit in the same 'did the builder actually produce a working app, and how fast' cluster.

How does this interact with the vendor-incident exclusion rule?

F6 (vendor-incident-only) trials are discarded from the TTFB distribution per the Day-3 reproducibility lab note's audit-window overlap rule. The precondition gate uses a 25% threshold on non-F6 trials, so a vendor that suffered a single incident during the audit window is not penalized for the F6 trials but still has to clear the gate on the rest.

When does this take effect?

H2 2026 ranking, publication target July 15, 2026. The v1 methodology stays canonical for any benchmark cited before that date.

Will the v1 first-build-success scores remain published?

Yes. The v1 methodology page carries the historical scores with a 'retired axis, see H2 2026 methodology note' annotation. No deletion. Methodology versioning is the point.

Is the community invited to challenge this decision?

Yes. The community-edit window on this lab note is open through July 8, 2026. Counter-proposals (different replacement axis, different weights, different precondition threshold) are welcome via the contribution flow at /contribute. Submissions are evaluated on the published counter-result protocol, the same way benchmark cells are evaluated.