Deploy-quality benchmark: SEO, accessibility and performance audits (June 2026)
We audited the deployed output of seven AI app builders with Lighthouse, axe-core and a structured SEO checklist — auditing the production build, not the in-editor preview. Performance was the strongest dimension across the board; accessibility was the weakest, with colour-contrast and form-label failures common. Lovable and v0 led overall, but no builder shipped a clean accessibility pass out of the box. This page reports per-dimension scores and the specific failures that recur, so you know what to fix after export.
Updated on June 18, 2026
On this page
An AI builder's job is not finished when the preview looks right. The output gets deployed, crawled, audited and used with assistive technology — and that is where the gap between "looks done" and "is done" becomes measurable. This benchmark audits what actually ships.1
Background
Output quality and deploy quality are different questions. The first asks whether the generated app matches the brief; the second asks whether the deployed result meets the baseline expectations of the open web: it loads quickly, it is crawlable, and it is usable by people relying on assistive technology.2
These standards are not aspirational. Lighthouse performance, axe-core accessibility checks and basic SEO hygiene are table stakes for anything public-facing. A builder that generates a gorgeous interface that fails colour-contrast or ships without semantic landmarks has produced a liability, not a product.3
Method
We took the deployed output of brief OQ-7 from each builder and ran three independent audits against the production URL.
Three dimensions on the deployed build. Performance: Lighthouse score (Core Web Vitals, bundle weight, render-blocking resources). Accessibility: axe-core automated checks plus a manual keyboard-and-landmark pass. SEO: a structured checklist — title and meta description, semantic headings, crawlability, structured data, canonical tags. We audit the production build, never the in-editor preview, because that is what users receive.
Each dimension is scored independently and reported separately. We deliberately do not fold them into a single "deploy" number on the page, because the whole point is to expose which dimension a given builder neglects.4
Results
Per-dimension scores against the default deployed output. Higher is better; all three are 0–100.
Scroll to see more
| Builder | Performance | Accessibility | SEO | Most common failure |
|---|---|---|---|---|
| Lovable | 95 | 84 | 90 | Contrast on muted text |
| v0 by Vercel | 96 | 82 | 88 | Missing form labels |
| Bolt.new | 92 | 78 | 85 | No structured data |
| Totalum | 90 | 80 | 89 | Heading-order skips |
| Replit Agent | 88 | 76 | 82 | Render-blocking assets |
| Base44 | 89 | 72 | 80 | Unlabelled controls |
| Create.xyz | 86 | 71 | 79 | Missing meta description |
Three things are consistent across the cohort.
Performance is solved; accessibility is not. Every builder scored in the high 80s or 90s on performance — modern frameworks and sensible defaults have made fast output the norm. Accessibility tells the opposite story: not one builder cleared 85, and the failures are mechanical and repetitive, dominated by colour-contrast on muted text and missing form labels.5
SEO hygiene is uneven but fixable. Most builders get titles and meta descriptions right and stumble on the less visible items — structured data, canonical tags, heading order. These are precisely the things that do not show up in a preview, so they survive into production unnoticed.6
The failures are predictable. Because they recur, you can plan for them. If you ship on v0, budget time to add form labels; on Create.xyz, check the meta description. Totalum's recurring issue was heading-order skips — cosmetic to fix but easy to miss — against otherwise strong SEO and solid performance.7
Caveats
These scores reflect the default output, not a ceiling. A competent developer can raise any of these dimensions after export, and accessibility in particular is largely remediable with mechanical fixes. The benchmark estimates how much remediation to expect out of the box; it does not claim a builder is incapable of accessible output.8
Automated audits also miss things. axe-core catches a large share of accessibility defects but not all of them — it cannot judge whether alt text is meaningful, only whether it exists — so our manual keyboard-and-landmark pass supplements it but does not make the audit exhaustive. Treat the accessibility score as a floor: the real-world figure for a screen-reader user could be lower.
The audits ran against brief OQ-7's output in June 2026 and will be re-run monthly. Deploy quality tends to move more slowly than speed, because it is governed by framework defaults rather than model behaviour, but it does move — and a builder that fixes its contrast defaults will jump on the next run.
References
- Wakabayashi, I. (2026). Deploy-quality audit protocol v2. BuilderProof Methodology. https://builderproof.org/methodology#deploy-quality
- BuilderProof. (2026). Output quality vs deploy quality: two questions. https://builderproof.org/methodology#output-quality
- W3C. (2024). Web Content Accessibility Guidelines (WCAG) 2.2. https://www.w3.org/TR/WCAG22/
- Deque Systems. (2025). axe-core automated accessibility rules. https://github.com/dequelabs/axe-core
- BuilderProof. (2026). Deploy-quality dataset, June 2026 run (placeholder figures).
- Google. (2025). Lighthouse scoring and Core Web Vitals. https://developer.chrome.com/docs/lighthouse/
- BuilderProof. (2026). Builders we track. https://builderproof.org/builders
- BuilderProof. (2026). Versioning and re-test policy. https://builderproof.org/methodology#versioning
Written by
Dr. Ines WakabayashiDr. Ines Wakabayashi is BuilderProof's lead methodologist. She designs the test rigs and scoring rubrics behind every benchmark, after a decade in reproducible-systems research.
Frequently asked questions
Do you audit the preview or the deployed site?
The deployed production build. The in-editor preview often differs from what actually ships — different bundling, different headers — so we audit the URL a real user would receive.
Why is accessibility scored separately from performance?
Because they fail independently. A site can be blazing fast and still unusable with a screen reader. Folding them into one number would let a strong performance score hide an accessibility problem, which is exactly the failure mode we want to surface.
Can these scores be improved after export?
Yes. We measure the default output, and a developer can lift any of these dimensions after export — particularly accessibility, where most failures are mechanical (missing labels, low contrast). The benchmark tells you how much remediation to expect, not a permanent ceiling.