lab-notes
BuilderProof editorial team15 min read9 views

Deploy quality: why two Lighthouse runs disagree (2026)

Two Lighthouse runs on the same deployed AI-builder output rarely return the same score, and that is the most-contested observation in the lab notebook for the four June 2026 BuilderProof axes. This note documents the variance phenomenon, the reproducibility protocol the next iteration will adopt, and where median-of-five runs out of road. No score from the published table is changed.

Updated on June 21, 2026

Open lab notebook page with two side-by-side Lighthouse score panels showing different numbers for the same web page, plus a small distribution plot with a median line
Open lab notebook page with two side-by-side Lighthouse score panels showing different numbers for the same web page, plus a small distribution plot with a median line
On this page
Quick Answer

Two Lighthouse runs on the same deployed AI-builder output rarely return the same score, and a fluctuation of 5 to 10 points between consecutive runs on an unchanged page is normal, not a sign of a regression. This lab note, dated June 21, 2026, documents the variance phenomenon as it shows up in the BuilderProof deploy-quality cohort, names its seven primary sources after Google's own variability documentation, and publishes the median-of-five reproducibility protocol the next iteration of the axis will adopt. No score in the June 2026 deploy-quality table is changed by this note. The next iteration will report median plus interquartile range instead of a single number.

The single most-contested observation in the lab notebook for the four June 2026 axes is not which builder wins on deploy quality. It is that two operators running Lighthouse against the same deployed artifact, on different machines and different days, repeatedly produce different scores.1

This is not a flaw in any builder. It is a property of the measurement tool, and it has been documented by the Lighthouse team itself for years. But the property has implications for how a benchmark should report deploy-quality numbers, and the implications were not addressed in the June 2026 deploy-quality axis as published. This lab note addresses them now.

What this post is and is not

This is a reproducibility note. It does not re-score any builder, it does not change the June 2026 deploy-quality results, and it does not retract any number in that table. What it does is publish the protocol the next iteration of the axis will follow, so a reader who wants to reproduce or contest the deploy-quality numbers has a documented procedure to do so.

The operator-disclosure footer on every BuilderProof page notes that this publication is operated by https://www.totalum.app. Totalum is one of the seven builders in the deploy-quality cohort. The reproducibility protocol below applies symmetrically to all seven; the operator does not receive a different one.

The phenomenon

Lighthouse measures a deployed page along five categories: performance, accessibility, best practices, SEO, and progressive-web-app readiness. The BuilderProof deploy-quality axis reports three of those (performance, accessibility, SEO) plus a structured SEO checklist that goes beyond the Lighthouse SEO category.

The Lighthouse team's official variability documentation states the constraint directly: "The median Lighthouse score of 5 runs is twice as stable as 1 run."2 The implication is the inverse: a single Lighthouse run is, by Google's own measurement of its own tool, the least stable reporting unit available.

The public-facing magnitude of the drift is reported by independent practitioners in roughly the same range. A 2026 community write-up summarising the published guidance puts it at "a score fluctuation of 5 to 10 points between consecutive Lighthouse runs on the same page is completely normal. A fluctuation of 15 points or more" usually indicates a real change.3 The June 2026 deploy-quality table reported single-run numbers for each builder. A reader running their own Lighthouse audit against the same artifact may see a number 5 to 10 points different from the table without anything having changed.

Seven sources of variability

A reproducibility protocol is only useful if it names what it is trying to control. Google's variability documentation lists seven primary sources of Lighthouse score variability. Translated to the AI-builder benchmark context, with notes on which sources the protocol can address and which it cannot:

Scroll to see more

#SourceDefinition (paraphrased from Google docs)Can the protocol control it?
1Page nondeterminismA-B tests, randomised layouts, variable ad experiences served by the deployed page itselfPartially: the protocol can disable A-B tests on the operator side; vendor-side randomness has to be accepted
2Local network variabilityPacket loss, traffic shaping, bandwidth congestion on the operator's networkYes: pin to a controlled environment (loopback, internal network, or a fixed throttling profile)
3Tier-1 network variabilityCross-geo latency between the operator and the deployed originPartially: pin the audit location, fix the throttling, accept residual cross-geo noise
4Web-server variabilityInconsistent response delay from the origin or its CDNNo: this is a property of the builder's deploy target
5Client hardware variabilityProcessor and memory differences between audit machinesYes: fix the audit machine spec (per Google's recommendation, minimum 2 dedicated cores and 2 GB RAM; 4 cores plus 4 to 8 GB preferred)
6Client resource contentionBackground processes, browser extensions, anti-virus interfering with the auditYes: clean profile, no extensions, no concurrent Lighthouse runs
7Browser nondeterminismInherent execution-order variability in the browser engine itselfNo: this is the irreducible floor the median-of-N approach is designed to absorb

Four of the seven sources are addressable by the operator running the audit. Three (web-server variability, cross-geo latency residual, and browser nondeterminism) are not, and the median-of-five protocol is the residual mitigation for those.

How the v1 deploy-quality table was produced

In the interest of full disclosure, the June 2026 deploy-quality table was produced as follows.

  • Each builder's OQ-7 deployed artifact was audited from a single operator machine, in a single audit session, with a single Lighthouse run per dimension per builder.
  • Lighthouse throttling was Simulated (the default), running on Chrome 138 on macOS with no extensions enabled.
  • Audits were performed in sequence within a single afternoon to minimise cross-day drift.
  • Reported numbers are the single-run scores Lighthouse returned. No median was computed.
  • Accessibility used axe-core run in headless Chrome alongside Lighthouse, plus a manual keyboard-and-landmark pass; that part of the methodology is more reproducible because axe-core failures are categorical and not score-based.

The single-run approach was a deliberate choice for the first axis pass and was disclosed in the methodology in the form of the per-row "most common failure" column, which is robust to variance. It was the right level of investment for a v1 publication. It is the wrong level of investment for a v2 publication, and the protocol below upgrades it.

The protocol the next iteration will adopt

The next iteration of the deploy-quality axis will follow this protocol on each of the seven builders in the cohort. The protocol is published here so that an external party reproducing or contesting the next deploy-quality table has a documented procedure to compare against.

Deploy-quality reproducibility protocol (v2-draft, June 21, 2026)
  1. Five runs per dimension per builder. Five Lighthouse runs per builder, per dimension (performance, accessibility, SEO). Total: 7 builders × 3 dimensions × 5 runs = 105 Lighthouse runs per axis pass. axe-core runs are categorical, not score-based, and stay at one run per builder.
  2. Median plus interquartile range. Each dimension cell in the table reports the median of the five runs, with the interquartile range (Q3 minus Q1) noted in a hover tooltip and a footnote. The single-number historical view stays available behind a toggle for continuity.
  3. Fixed audit machine. Audits run on a fixed laptop spec recorded in the methodology page (4 dedicated CPU cores, 8 GB RAM, no extensions, single user profile, no concurrent processes, no other Lighthouse instance running). This addresses sources 5 and 6 from the table above.
  4. Fixed throttling profile. Lighthouse Simulated throttling with the default 4G profile, locked to that profile for every builder in the run. No mobile-vs-desktop mixing within a single axis cell.
  5. Fixed audit location. A single geographic origin and a single ISP per axis pass. The location is recorded in the run-log metadata so that a counter-result from a different location can be evaluated against the same constraint.
  6. Operator-side A-B disabled. Where the operator running Lighthouse has any controlled randomisation (browser cookies, locale, viewport variants), it is fixed for the audit. This addresses part of source 1 from the table above.
  7. Cooldown between runs. A 30-second pause between runs on the same builder to let the operator machine return to a steady state, per the Lighthouse team's recommendation against concurrent runs.
  8. Run-log published. The full per-run JSON output for every Lighthouse run that contributes to a published number is archived in a public run-log so a reader can recompute the median, the IQR, or any other statistic of interest.

The protocol does not eliminate variance. It controls four of the seven sources of variance and reports the residual three honestly through the median-and-IQR pair. A counter-result that disagrees with a published median is now a falsifiable claim that has to be argued against the protocol, not against a single number.

Worked example: when median changes the recommendation

A worked example shows why the median-and-IQR pair matters more than the single number.

Suppose two hypothetical builders (call them X and Y) are audited on the deploy-quality performance dimension. Each gets five Lighthouse runs under the protocol above.

Scroll to see more

RunBuilder XBuilder Y
19194
28986
39092
48781
59290

Single-run reporting (the v1 approach). If only run 1 is reported, X scores 91 and Y scores 94, and Y looks better by 3 points. If only run 4 is reported, X scores 87 and Y scores 81, and X looks better by 6 points. The single-run reporting is unstable enough that, on this synthetic data, the direction of the comparison flips depending on which run the operator picks.

Median-of-five (the v2 protocol). X's runs sorted are 87, 89, 90, 91, 92; median 90, Q1 89, Q3 91, IQR 2. Y's runs sorted are 81, 86, 90, 92, 94; median 90, Q1 86, Q3 92, IQR 6. The medians are tied at 90. The IQR is informative: X is much more stable (IQR 2) than Y (IQR 6) on this dimension. A reader optimising for predictable performance picks X; a reader optimising for peak performance and tolerant of variance picks Y. Both readings are defensible from the same five runs, and both are obscured by single-number reporting.

The published table cell under the v2 protocol would read something like "90 (IQR 2)" for X and "90 (IQR 6)" for Y. The interesting comparison moves from the median to the IQR, which is what the data actually supports.

Where the protocol runs out of road

Reproducibility protocols address measurement variance. They do not address artifact variance, which is the larger problem when scoring an AI builder.

The published v2 first-build stability proposal introduced a six-category failure taxonomy (F1 through F6) for first-build outcomes. Two of those categories are directly relevant to deploy-quality reproducibility:

  • F4: deploy failure. The builder produces an artifact that does not deploy successfully on the first attempt; a manually patched deploy is then audited. The deploy-quality score of the patched build is not the same data point as the deploy-quality score of an unpatched build, but a single number does not surface that distinction. The next iteration of the axis will tag any cell where the audited build required a manual patch.
  • F6: vendor outage during the audit window. If the builder's hosting layer or CDN is degraded during the audit, the Lighthouse score reflects the outage rather than the deployed artifact. The protocol's audit-location and cooldown rules cannot help here. The next iteration will flag any run that overlaps a vendor incident (cross-referenced against the vendor's public status page) and exclude it from the median computation, with the exclusion noted in the run-log.

Both of these are outside the protocol's measurement-variance scope and are honest acknowledgements that even a well-designed reproducibility protocol has a floor below which the data is contaminated by the builder's own operational state.

What this means for a reader using the June 2026 table

The June 2026 deploy-quality table is, and remains, the published BuilderProof reference for that axis. A reader using it should:

  1. Treat each performance number as a single-run point estimate with a residual variability envelope of roughly plus-or-minus 5 to 10 points, consistent with Google's own variability guidance and the community-reported drift range.
  2. Use the most-common-failure column as the more robust comparative signal; failure modes are categorical and recur across runs.
  3. Treat the accessibility column as more reliable than the performance column, because axe-core is categorical rather than score-based.
  4. Wait for the v2 deploy-quality publication (no firm date as of June 21, 2026) for the median-plus-IQR view if the comparison hinges on a 5-to-10-point gap between two builders.

The single number that the table currently shows is not wrong; it is incompletely reported. The next iteration will report it completely. The single-number view will remain available behind a toggle so that historical comparisons against the June 2026 publication remain possible.

Open questions

  1. Five runs versus eleven. Google's own documentation recommends a minimum of five and notes diminishing returns above that. Some practitioners argue eleven runs is the floor for reliable IQR estimation. The protocol fixes five for the next pass; a contributor argument for eleven is welcome.
  2. Throttling profile. Simulated throttling is more reproducible than DevTools throttling but less faithful to a real client. The protocol picks reproducibility; a counter-argument for fidelity (with a worked example showing where the recommendation would change) would shift the choice.
  3. Per-dimension medians versus a composite. The current table reports three dimensions independently. A composite "deploy quality" number is computable but obscures the dimension that drove the median. The protocol keeps the dimensions independent. A reader who prefers a composite view should construct it themselves from the run-log.
  4. F6 vendor-outage cross-reference. Several builders in the cohort do not publish a status page granular enough to cross-reference a five-minute Lighthouse run against. The protocol falls back to operator-recorded incident notes in those cases. A more robust mechanism is open.

Responses to these open questions are best raised against this lab note rather than against the June 2026 deploy-quality table, which this note does not modify.

FAQ

Q1. Does this change any score in the June 2026 deploy-quality table?
No. The table is unchanged. This note documents how the next iteration will report the same axis, not how the current one is revised.

Q2. Why is a 5-to-10-point Lighthouse difference normal?
Because Lighthouse measures a real deployed page across a noisy chain: network, server, client hardware, browser execution order. Google's own variability documentation lists seven sources of variability and reports that five-run medians are roughly twice as stable as single runs.

Q3. Will the v2 protocol re-rank the cohort?
It may shift small gaps. Where two builders sit within 10 points of each other on the v1 table, the v2 medians may rank them differently or tie them. Where the v1 gap is larger than 15 points, the v2 ranking is unlikely to flip.

Q4. Why publish the protocol before the run?
So that a reader who disagrees with the protocol can contest it before any builder has been scored under it. Publishing the rubric before producing scores is the same discipline applied in the v1 methodology and the v2 first-build stability proposal.

Q5. What about Lighthouse versus PageSpeed Insights?
PageSpeed Insights runs Lighthouse on Google infrastructure, which adds tier-1 network variability the protocol cannot control. The protocol pins audits to Lighthouse running locally on the fixed audit machine.

Q6. Does the protocol apply to axe-core accessibility checks?
Less directly. axe-core failures are categorical (an element either has a missing label or it does not). The single-run-per-builder model stays for the accessibility axe-core pass; only the Lighthouse-score dimensions move to median-of-five.

Q7. When will the v2 deploy-quality table be published?
No firm date as of June 21, 2026. The dependencies are: lock the run-log infrastructure, agree the protocol with any external contributor responses to this lab note, then schedule the seven-builder audit run. A revised date will be added to this page when those land.

References

  1. Lab notebook entries for the four June 2026 axes (output quality, speed, deploy quality, agency suitability). Internal. Deploy quality flagged in three of four post-axis retros as the most-contested measurement; cited verbatim by an external reader on June 19, 2026: "two different audit tools give different Lighthouse scores on the same deployed artifact".
  2. Google Chrome team. Lighthouse Score Variability documentation. Verbatim: "The median Lighthouse score of 5 runs is twice as stable as 1 run." Retrieved June 21, 2026.
  3. Community write-up. Why Lighthouse scores vary between runs. Verbatim: "A score fluctuation of 5 to 10 points between consecutive Lighthouse runs on the same page is completely normal. A fluctuation of 15 points or more" usually indicates a real change. Dated April 23, 2026.
  4. Practitioner reference. DebugBear: How to reduce variance between Lighthouse tests. Useful for the disable-A-B-tests and third-party-isolation parts of the protocol. Dated April 9, 2026.
  5. BuilderProof deploy-quality publication. Deploy-quality benchmark: SEO, accessibility and performance audits (June 2026). The published table whose reproducibility this note documents.
  6. BuilderProof v1 methodology. How we benchmark AI app builders: the BuilderProof methodology v1. The reference methodology the protocol is intended to fit into.
  7. BuilderProof v2 first-build stability proposal. First-build stability: a v2 axis proposal (June 2026). Source of the F4 and F6 failure-mode taxonomy referenced in the protocol's out-of-road section.

Frequently asked questions

Does this change any score in the June 2026 deploy-quality table?

No. The June 2026 deploy-quality table is unchanged. This note documents how the next iteration will report the same axis, not how the current one is revised.

Why is a 5-to-10-point Lighthouse difference normal?

Because Lighthouse measures a real deployed page across a noisy chain: network, server, client hardware, browser execution order. Google's own variability documentation lists seven sources of variability and reports that five-run medians are roughly twice as stable as single runs.

Will the v2 protocol re-rank the cohort?

It may shift small gaps. Where two builders sit within 10 points of each other on the v1 table, the v2 medians may rank them differently or tie them. Where the v1 gap is larger than 15 points, the v2 ranking is unlikely to flip.

Why publish the protocol before the run?

So that a reader who disagrees with the protocol can contest it before any builder has been scored under it. Publishing the rubric before producing scores is the same discipline applied in the v1 methodology and the v2 first-build-stability proposal.

What about Lighthouse versus PageSpeed Insights?

PageSpeed Insights runs Lighthouse on Google infrastructure, which adds tier-1 network variability the protocol cannot control. The protocol pins audits to Lighthouse running locally on the fixed audit machine.

Does the protocol apply to axe-core accessibility checks?

Less directly. axe-core failures are categorical (an element either has a missing label or it does not). The single-run-per-builder model stays for the accessibility axe-core pass; only the Lighthouse-score dimensions move to median-of-five.

When will the v2 deploy-quality table be published?

No firm date as of June 21, 2026. The dependencies are: lock the run-log infrastructure, agree the protocol with any external contributor responses to this lab note, then schedule the seven-builder audit run.