Methodology

How BuilderProof measures every score. Each method is published, version-stamped and re-run monthly so results stay reproducible and comparable across builders.

Output quality

How we score generated UI, code structure and functional correctness against a fixed brief.

Every builder receives the same fixed brief - a small multi-page app with a form, a list view and an authenticated route. We score the generated output on three axes: visual fidelity to the brief, code structure (componentisation, typing, dead code), and functional correctness when the app is run unmodified.

Scoring is double-rated: two reviewers grade independently against a published rubric, and disagreements above five points are reconciled before a sub-score is published.

Caveat

A single brief cannot capture every workload. Treat output quality as indicative for greenfield app generation, not bespoke or legacy integration work.

Speed

The protocol for measuring speed-to-first-paint and time-to-working-app across builders.

We measure two clocks. Speed-to-first-paint is the wall-clock time from prompt submission to the first rendered preview. Time-to-working-app is the time until the brief's acceptance checks all pass without manual edits.

Each builder is run five times from a cold session on the same network profile; we report the median to dampen cold-start and queue variance.

Caveat

Network latency and provider load vary by region and time of day. Absolute seconds will differ from your environment; the ranking between builders is the durable signal.

Deploy quality

Lighthouse performance, axe accessibility and SEO audits run against each deployed output.

The deployed output of each builder is audited with Lighthouse (performance), axe-core (accessibility) and a structured SEO checklist (metadata, semantic landmarks, crawlability). Scores are captured against the builder's own default deploy target.

We audit the production build, not the in-editor preview, because that is what real users receive.

Caveat

Audits reflect the default output. A skilled developer can lift any of these scores after export; we measure what ships out of the box.

Agency suitability

Whitelabel, MCP support and API surface - what matters when you build for clients.

For teams building on behalf of clients, we score four capabilities: whitelabel (can the builder's branding be fully removed), MCP support (is there a programmatic model-context surface), API surface (breadth and stability of the public API), and export/portability of the generated code.

Each capability is verified hands-on against current documentation, not marketing claims.

Caveat

Agency needs are heterogeneous. The sub-scores are weighted for a generalist agency; reweight them for your own priorities.

Versioning & re-tests

Why every result is version-stamped and re-tested monthly as builders ship updates.

AI builders ship weekly. A benchmark without a date is misleading, so every result on BuilderProof is stamped with the builder version (or test date when no version is exposed) and re-run on a monthly cadence.

When a re-test moves a score, the prior figure is retained in the page history so readers can see the trajectory, not just the latest snapshot.

Caveat

Between monthly runs a builder may have shipped changes we have not yet captured. Always check the 'last tested' stamp on a result.

Scoring model

How sub-scores are weighted and rolled up into a single, comparable Total Score.

The single Total Score is a weighted roll-up of the sub-scores: output quality and deploy quality carry the most weight, followed by speed and agency suitability. Weights are published and fixed for a benchmark generation so scores remain comparable month to month.

Sub-scores are normalised to a 0–100 scale against the panel, so a score expresses relative standing within the tested cohort, not an absolute grade.

Caveat

Weights encode an editorial judgement about what matters most. They are documented precisely so you can disagree with them transparently.

See the methods applied

Every published benchmark links back to the method it uses, with sourced figures and a full references section.

Browse benchmarks Suggest a method improvement