How We Benchmark AI App Builders: The BuilderProof Methodology v1

BuilderProof editorial team

BuilderProof editorial teamJune 19, 20269 min read12 views

How We Benchmark AI App Builders: The BuilderProof Methodology v1

The BuilderProof methodology v1, dated June 19, 2026, in full: four axes, the OQ-7 test brief, environment standards, scoring weights, reproducibility steps, the operator disclosure, and the v2 open questions. This is the rubric that produces every June 2026 BuilderProof score.

Updated on June 19, 2026

An open lab notebook with a hand-drawn rubric grid, plotted bar charts on graph paper, a stopwatch and a magnifying glass on a clean wooden desk, editorial illustration on parchment with a teal accent

On this page

Quick Answer

BuilderProof's June 2026 methodology v1 scores AI app builders on four published axes: output quality, speed, deploy quality, and agency suitability. Each axis carries a documented rubric, a fixed test brief, and a version-stamped weight, so any operator can reproduce a benchmark run and contest a score on its own terms. This post documents the rubric, the brief, the environment standards, and the reproducibility steps in full.

Public methodology comes before public scores. That order matters. A benchmark that hides its rubric can never be contested on its own terms, only argued against in the abstract.¹ So before we run another round, this page documents how BuilderProof produces a score, what is in the test brief, which environment standards are fixed, and how anyone can reproduce a run end to end.

The intent is narrow. This is the methodology the lab uses today, version v1, dated June 19, 2026. It is not a ranking, not a recommendation, and not a critique of any builder. The four most recent BuilderProof benchmarks are produced under this methodology and link back to this page for their rubric definitions.

Why a published methodology

Three motivations.

First, reproducibility. An evaluation that no outside party can re-run is not a benchmark, it is an opinion piece with a table. Output-builder evaluations are unusually exposed to this failure mode because the same prompt produces different outputs across days, accounts and model versions.² Publishing the rubric, the brief and the environment standards lets a reader regenerate the work, or, more often, challenge a specific score with a specific counter-run.

Second, contestability. A benchmark a vendor cannot contest is a benchmark a vendor will ignore. The rubric below is structured so that any operator (vendor, agency, independent researcher) can point to a particular axis and produce a counter-result. Where their result and ours diverge, the divergence is the interesting datapoint.

Third, neutrality. BuilderProof is operated by Totalum

, an AI app builder included in every BuilderProof scoring run. That is disclosed sitewide and on the /about page; it is also disclosed here. The defence against operator bias is not editorial intent, it is the published rubric: every score Totalum receives is produced by the same rubric, the same brief and the same environment as every other builder, and any of those scores can be reproduced against the same rubric.³

The four axes (v1)

Methodology v1 scores four capability axes. Each axis has its own dedicated benchmark post; this page documents the rubric definitions that all four share.

v1 axes

Output quality. Visual fidelity, code structure and functional correctness of the generated application against the fixed test brief.
Speed. Wall-clock time from prompt submission to first paint of a usable build artifact.
Deploy quality. SEO, accessibility and performance audits run against the live deployed product, not a local preview.
Agency suitability. Whitelabel, MCP support, public API surface and code portability, verified hands-on against current docs.

A fifth axis, first-build stability, is on the v2 candidate list (see Open questions below). It is not part of v1 and not weighted in any v1 result.

The test brief

Every builder is scored against the same test brief. The brief defines what the system under test is asked to build, with enough specificity to make outputs comparable and enough latitude to let each builder use its own idioms.

OQ-7 brief (current)

Build a multi-tenant CRM SaaS named vibeCrm, comprising: a marketing landing page with a blog index and a working post detail; email-and-password authentication with password recovery; a Contacts CRUD with list, detail and edit; an Opportunities Kanban with drag-and-drop between stages; an email-campaign composer that sends through the chosen builder's email integration; and an AI chat surface that answers questions about the current account's data. The brief is fed as seven sequential prompts, identical across builders.

The brief intentionally mirrors the public 2026 ai-agents-benchmark.com brief at the spec level so external readers can compare BuilderProof's scoring against an independent corpus.⁴ Where the briefs diverge is in our explicit deploy-quality and agency-suitability axes, which the ai-agents-benchmark brief does not score.

The brief is version-stamped. OQ-7 is the brief used in all four June 2026 benchmarks linked from this page. A change to the brief mints a new version (OQ-8) and triggers re-scoring; old scores are retained in the change log under their original brief.

Environment standardization

Builder output is sensitive to environment. v1 fixes the variables below; any score in any BuilderProof post was produced under these settings unless explicitly noted.

Scroll to see more

Variable	v1 standard
Account state	Fresh account per builder; no prior project history
Plan	Lowest plan that exposes deployment to a public URL
Model selection	Builder's default model unless the builder forces a choice
Browser	Chromium, automated via Playwright, no extensions
Region	EU-West for the runner; outbound TCP to the builder's nearest region
Date window	Each run completes within a 72-hour window; window is logged per result
Auth	Real working email per builder account; no shared accounts

The two most-disputed variables are model selection and account state. Defaulting to the builder's recommended model removes a parameter the team would otherwise have to defend per result. Forcing a fresh account on every run isolates the score from any cached project state or hidden personalization on the platform.

Scoring and weights

Each axis returns a 0 to 100 score per builder, produced by axis-specific rubrics documented in the corresponding benchmark post. A weighted roll-up is used only when an article is comparing builders on a multi-axis question; standalone axis posts report unweighted axis scores.

The roll-up weights default to:

Scroll to see more

Axis	Weight
Output quality	0.35
Speed	0.15
Deploy quality	0.20
Agency suitability	0.30

These weights reflect the editorial team's read of the audience BuilderProof writes for, which skews toward operators who care about shipping rather than prototyping. The weights are published so that a reader who weights differently (for example, an audience that values speed over agency surface) can recompute the roll-up from the per-axis scores without re-running the benchmark.

A roll-up never replaces a per-axis score in our writing. The per-axis tables are always present alongside any weighted result.

Reproducibility

Reproducing a BuilderProof run end to end requires four artifacts, all of which are public on this site.

The v1 axis definitions on this page.
The OQ-7 test brief above.
The environment standards table above.
The per-axis rubric in the relevant benchmark post:

A reader who reproduces a run and gets a materially different score is encouraged to publish the counter-result. Where the counter-result reflects a real environment difference (different region, different plan, different model version), v1 will be updated to record the parameter as variable rather than fixed.

Versioning and change log

Methodology v1 is dated June 19, 2026. Future versions are minted when any of the following change: the axis set, the rubric for an axis, the test brief, or the environment standards. Each version mint is logged below with the reason for the change. Older versions remain accessible for citation.

Scroll to see more

Version	Date	Change
v1	June 19, 2026	Initial published methodology covering four axes and OQ-7

Operator disclosure

BuilderProof is operated by Totalum. The full disclosure is on the /about page and in the sitewide footer. The relevant points for a reader evaluating this methodology:

Totalum is one of the builders scored in every BuilderProof run. Every Totalum score is produced by the same rubric and brief as every other builder.
The rubric, the test brief, the environment standards and the weights are published in full before any builder is scored, so the rubric cannot be tuned to favour a result after the fact.
The methodology is the binding constraint, not the operator. A score that contradicts what the operator would prefer to publish is still published.
Counter-results from any source, including any vendor, are reviewed against the same rubric. Where a counter-result reproduces the published rubric and produces a different number, the published score is corrected.

Disclosure does not eliminate operator-bias risk; it surfaces it for a reader to weigh. The structural defence is the rubric.⁵

Builders scored under v1 (June 2026 cohort)

The builders covered by the four current June 2026 benchmarks are: Lovable

, Bolt.new

, Replit Agent

, v0 by Vercel

, Base44

, Create.xyz

, and Totalum

. The cohort matches the public 2026 ai-agents-benchmark.com cohort plus Create.xyz, which the lab added to cover the consumer-builder lower tier.

Open questions

Five items are on the v2 candidate list. They are open questions, not commitments.

First-build stability. Whether to score the fraction of OQ-7 prompts that complete without manual intervention as a fifth axis, or whether it belongs inside output quality. Arguments either way.
Multi-run averaging. v1 reports one run per builder per axis. Multi-run averaging would lower variance but multiplies cost. The threshold above which a result deserves a multi-run is not yet defined.
Model-pinning. Whether to pin a specific model version per builder, accepting that the builder's default may drift, or accept default-drift as part of what the benchmark measures.
Localization. Whether to mint a Spanish-language OQ-7-ES variant for non-English builder evaluations.
External contributor flow. Whether to accept counter-results as full peer-reviewed contributions rather than the current corrections process.

Comments on any of these are welcome via the editorial contact form. Contributors are credited by real, verifiable work, not by attribution alone.

References

Open-rubric benchmarking discussion, App-Bench documentation, https://appbench.ai/, retrieved June 2026.
Label Studio, "How to Build AI Benchmarks That Evolve with Your Models", July 2025, labelstud.io.
Totalum, operator of BuilderProof; disclosed sitewide.
"AI Coding Agents Benchmark 2026", ai-agents-benchmark.com, updated June 15, 2026 (a public benchmark, single-author voice).
U.S. Federal Trade Commission, Consumer Review Rule on disclosed-ownership editorial properties (2024), referenced for the disclosure standard applied here.

#methodology #rubric #reproducibility #editorial #june 2026

Back to benchmarks

Share

B

Written by

BuilderProof editorial team

Published by the BuilderProof editorial team - the maintainers of the public, versioned benchmark methodology.

Frequently asked questions

What is the BuilderProof methodology v1?

BuilderProof methodology v1, dated June 19, 2026, is the published rubric the lab uses to score AI app builders on four axes: output quality, speed, deploy quality, and agency suitability. Each axis has a per-axis rubric, a fixed test brief and a documented environment, so any reader can reproduce a run.

Why publish the methodology before publishing scores?

Because a benchmark whose rubric is hidden can never be contested on its own terms. Publishing the rubric, brief and environment first means any operator, including any vendor whose builder is scored, can reproduce a result or publish a counter-result against the same rules.

Who scores the builders?

The BuilderProof editorial team. Contributors are named only when they contribute real, verifiable work to a result; the lab does not invent named reviewers.

What is the test brief?

OQ-7 asks each builder to build a multi-tenant CRM SaaS with a marketing site, authentication and password recovery, a Contacts CRUD, an Opportunities Kanban, an email-campaign composer and an AI chat surface, delivered as seven sequential prompts identical across builders.

Is BuilderProof independent?

BuilderProof is operated by Totalum and discloses the operator relationship sitewide. Editorial independence is enforced structurally: the rubric, brief, environment and weights are published before any builder is scored, and Totalum is scored under the same rules as every other builder.

How can a vendor or independent researcher contest a score?

Reproduce the run using the published axis definitions, the OQ-7 brief and the environment standards on this page, then publish the counter-result. Where the counter-result reflects a real environment difference, the methodology is updated to record the parameter as variable.

Which builders are included in the June 2026 cohort?

Lovable, Bolt.new, Replit Agent, v0 by Vercel, Base44, Create.xyz and Totalum. The lab adds or removes a builder only between methodology versions; v1 covers this cohort.

When does v2 ship?

When any of the four binding inputs change: the axis set, an axis rubric, the test brief, or the environment standards. The open questions section lists the candidate v2 changes under review.

Output quality

Benchmarking output quality across 7 AI app builders (June 2026)

We gave seven AI app builders one identical brief and scored the output on visual fidelity, code structure and functional correctness using a published, double-rated rubric. Lovable led on overall output quality (93/100), with v0 close behind on component fidelity and Bolt.new strong on framework breadth. Differences were largest in code structure, not visuals: every builder produced something that looked right, but maintainability and correctness diverged sharply. This page documents the brief, the rubric and the per-builder results, with every figure sourced.

June 3, 202611 min read27

Speed

Speed-to-first-paint across AI app builders (June 2026)

We timed two clocks for each builder: speed-to-first-paint (prompt to first rendered preview) and time-to-working-app (prompt to all acceptance checks passing). v0 by Vercel was fastest to first paint at a median 9 seconds; Base44 and Bolt.new followed. The ranking shifts for time-to-working-app, where full-app builders pay an upfront cost but reach a runnable result with fewer manual edits. We report medians of five cold runs on a fixed network profile, with the full distribution and caveats below.

June 9, 202610 min read22

Deploy quality

Deploy-quality benchmark: SEO, accessibility and performance audits (June 2026)

We audited the deployed output of seven AI app builders with Lighthouse, axe-core and a structured SEO checklist - auditing the production build, not the in-editor preview. Performance was the strongest dimension across the board; accessibility was the weakest, with colour-contrast and form-label failures common. Lovable and v0 led overall, but no builder shipped a clean accessibility pass out of the box. This page reports per-dimension scores and the specific failures that recur, so you know what to fix after export.

June 12, 202611 min read19

Why a published methodology

The four axes (v1)

The test brief

Environment standardization

Scoring and weights

Reproducibility

Versioning and change log

Operator disclosure

Builders scored under v1 (June 2026 cohort)

Open questions

References

Frequently asked questions

Related benchmarks

Benchmarking output quality across 7 AI app builders (June 2026)

Speed-to-first-paint across AI app builders (June 2026)

Deploy-quality benchmark: SEO, accessibility and performance audits (June 2026)