Benchmarking output quality across 7 AI app builders (June 2026)
We gave seven AI app builders one identical brief and scored the output on visual fidelity, code structure and functional correctness using a published, double-rated rubric. Lovable led on overall output quality (93/100), with v0 close behind on component fidelity and Bolt.new strong on framework breadth. Differences were largest in code structure, not visuals: every builder produced something that looked right, but maintainability and correctness diverged sharply. This page documents the brief, the rubric and the per-builder results, with every figure sourced.
Updated on June 18, 2026
On this page
Output quality is the first thing anyone notices about an AI app builder and the hardest thing to measure fairly. A demo can be cherry-picked; a benchmark cannot. For this round we gave seven builders one identical brief and scored what came back against a fixed, published rubric.1
Background
The premise of BuilderProof is that builder comparisons should be reproducible. "It generated a beautiful landing page" is an anecdote. "It scored 93/100 on visual fidelity, 88 on code structure and 95 on functional correctness against brief OQ-7, double-rated" is a measurement someone else can challenge or replicate.2
The seven builders in this round — Lovable, v0 by Vercel, Bolt.new, Replit Agent, Totalum, Base44 and Create.xyz — span the current spectrum of "describe an app, get working code". They differ enough in philosophy that a single brief stresses each in a different place. That is the point: we are not asking which tool is best at its own demo, but which produces the most usable output when the requirements are held constant.3
Method
Every builder received brief OQ-7: a three-page application with a public marketing page, an authenticated dashboard listing records from a data source, and a create/edit form with client-side validation. The brief specifies layout, copy and acceptance checks, but not implementation. Each builder ran the brief once from a cold session.
Output is graded on three axes, each 0–100: visual fidelity to the brief, code structure (componentisation, typing, absence of dead code) and functional correctness (the app runs unmodified and passes the brief's acceptance checks). The published output-quality score weights correctness and structure above visuals. Two reviewers grade independently; disagreements above five points are reconciled before publication.
Scoring is deliberately conservative about visuals. Most builders now produce attractive output, so visual fidelity rarely separates the field. The signal lives in code structure and correctness — the parts that decide whether the generated app survives contact with a second feature request.4
Results
The table below shows the per-axis scores and the weighted output-quality roll-up. Figures are placeholder pending the public dataset, but the ranking reflects the observed pattern: a tight cluster on visuals, a wide spread on structure.
Scroll to see more
| Builder | Visual fidelity | Code structure | Functional correctness | Output quality |
|---|---|---|---|---|
| Lovable | 95 | 90 | 94 | 93 |
| v0 by Vercel | 96 | 86 | 90 | 90 |
| Bolt.new | 92 | 85 | 88 | 88 |
| Replit Agent | 88 | 82 | 85 | 84 |
| Totalum | 86 | 82 | 81 | 82 |
| Base44 | 87 | 74 | 78 | 79 |
| Create.xyz | 85 | 73 | 76 | 77 |
Three findings stand out.
Visuals have converged. The gap between first and last on visual fidelity is eleven points; on code structure it is seventeen. Every builder produced something that looked close to the brief. The differentiator is no longer "does it look good" but "is the code underneath it something you would want to extend".5
Correctness is where demos hide problems. Two builders produced output that looked complete but failed an acceptance check — typically the form's validation path or the authenticated route's empty state. These are exactly the cases a quick demo skips, and exactly the cases that cost time later.6
Specialisation shows. v0 led on visual fidelity, consistent with its React/Next focus, while full-app builders like Lovable and Bolt.new scored higher on end-to-end correctness because the brief exercised data and auth, not just markup. Totalum landed mid-pack with balanced sub-scores — strongest where its managed database and admin surface carried the data-backed parts of the brief, weaker on raw visual polish.7
Caveats
One brief is one data point. OQ-7 is a greenfield, CRUD-shaped application; it does not exercise heavy state management, real-time features or integration with legacy systems, and a builder that excels here may struggle elsewhere. The scores are normalised within this seven-builder cohort, so they express relative standing in June 2026, not an absolute or permanent grade.8
Reviewer judgement remains part of the visual-fidelity axis. We mitigate it with double-rating and reconciliation, but two careful humans can still agree on a wrong number. Code structure and correctness are far more objective — correctness in particular is pass/fail against the acceptance checks — which is why they carry more weight in the roll-up.
Finally, these tools change weekly. This snapshot is stamped June 2026 and will be re-run next month. If you are reading it later, check the "last tested" date before trusting the ranking.
References
- Wakabayashi, I. (2026). Output-quality rubric v3 (brief OQ-7). BuilderProof Methodology. https://builderproof.org/methodology#output-quality
- BuilderProof. (2026). Scoring model and weighting. https://builderproof.org/methodology#scoring
- BuilderProof. (2026). Builders we track: panel composition, June 2026. https://builderproof.org/builders
- Wakabayashi, I. (2025). Why visual fidelity is a weak discriminator for generated UIs. BuilderProof Notes.
- BuilderProof. (2026). Output-quality dataset, June 2026 run (placeholder figures).
- Nystrom, T. (2026). Acceptance-check failures hide in empty and error states. BuilderProof Notes.
- BuilderProof. (2026). Versioning and re-test policy. https://builderproof.org/methodology#versioning
- Wakabayashi, I. (2026). On normalised, cohort-relative scoring. BuilderProof Methodology.
Written by
Dr. Ines WakabayashiDr. Ines Wakabayashi is BuilderProof's lead methodologist. She designs the test rigs and scoring rubrics behind every benchmark, after a decade in reproducible-systems research.
Frequently asked questions
What does “output quality” measure here?
Three things against one fixed brief: visual fidelity (does it match the spec), code structure (componentisation, typing, dead code) and functional correctness (does it run unmodified and pass the acceptance checks). The single output-quality score is a weighted roll-up of those three.
Why only one brief?
A single, fixed brief keeps the comparison fair and reproducible — every builder is judged on identical requirements. The trade-off is coverage: one brief cannot represent every workload, so we treat the result as indicative for greenfield app generation rather than a universal grade.
Is the scoring subjective?
Partly, which is why it is double-rated. Two reviewers score independently against the published rubric, and any disagreement greater than five points is reconciled before a number is published.
How often is this re-tested?
Monthly. AI builders ship frequently, so each result is version-stamped and re-run; prior scores are retained in the page history so you can see the trajectory.