/ 03.2 / RESEARCH ARTICLES & ESSAYS

Body Wash, Maritime Shipping, and El Niño: The Absent Layer in Product Experimentation

Product teams adopted the interface of statistical experimentation without importing the methodology that makes it trustworthy.

/// 2026.05.14 · READING TIME: 16 MINUTES
Platform capabilities documented as of Q2 2026.

Design of Experiments Product Experimentation Applied Statistics Methodology

A bottle of body wash, billions in revenue, and the limit of brute force

During the summer of 2018 I worked at Unilever's R&D campus in Trumbull, Connecticut, on a body wash formulation problem. It is still the cleanest example I have for a thing product organizations keep getting wrong: confusing a moved number with a justified attribution.

This is one of those problems you've experienced but probably haven't thought much about. It ties into questions like: why does my facial cleanser produce almost no suds, this hand soap produce a meringue-like lather, and that dish soap bubble up a good few inches? Why does one leave a scent lingering hours longer than the others?

Lather is just a hook. Even though more lather does not actually get you cleaner, consumers correlate sud volume with cleaning ability, so it becomes a meaningful sensory factor in repeat purchase behavior. That half-second sniff at the bottle's neck is another. Even the firmness of the cap when you flip it open contributes something to the equation. When billions of people use a product, there is an enormous amount of revenue stacked on top of small sensory details, and the work of consumer R&D is to optimize each of them under the constraint that the formulation also has to be stable, cost-controlled, and rheologically intact (it must not separate into a sludge before arriving at your local grocery store).

My specific project was fragrance deposition: optimizing how much scent lands on your skin during the few seconds of contact in the shower. It involves formulating the products with cellulose-derived polymers substituted with varying degrees of cationic and hydrophobic components. One complicating detail is that "fragrance" is typically a mixture of chemicals that are heterogeneous across hydrophobicity. Within a single body wash, you can have highly hydrophobic terpenes (limonene, pinene) that partition into the oil phase of a surfactant micelle, and more polar alcohols and esters (citronellol, phenylethyl alcohol) that partition more toward the water phase. There's also a litany of intermediate components in between. The chemistry for getting fragrance molecules to deposit on skin during a few seconds of rinsing involves a coacervate, a polymer-surfactant-fragrance network, that precipitates onto the substrate as the water dilutes through it. The polymer is the structural mediator. The surfactant decides the micelle environment the fragrance starts in. The fragrance composition decides which components partition where.

If you wanted to test every combination of three surfactant systems against two co-surfactants against six candidate polymers (each carrying different molecular weights, charge densities, and hydrophobic modifications) against seven fragrance components at two levels each, you are looking at well over a thousand formulations. Each one has to be batched at a lab bench, panel-tested, run through a silk-fabric wash that approximates skin, and then quantified by headspace gas chromatography. The budget covered a fraction of that. We needed to know which fraction would give us the most information about the system.

The methodology that picks the right fraction and tells you when the data is too noisy to confidently adopt a winner is called design of experiments (DoE). It is a discipline that asks, before you spend the budget: which subset of all possible configurations gives you maximum information about the system? And after you spend it: which factors and interactions actually moved the response, with what confidence, and what should we avoid committing to because the data does not support the inference?

The math is mature. Box, Hunter and Hunter published the standard textbook in 1978 (second edition 2005).¹ The screening design we used was published by Plackett and Burman in 1946.² Experimental design is fundamental to how chemistry, pharmaceutical bioprocess optimization, and consumer product R&D have been done for decades.³

It has not landed in product.

Many product teams have adopted traffic allocation and optimization tooling without importing the methodology required to make the resulting inferences trustworthy. The vendor surface often reflects that same gap.

Same situation, different domain

Once you've seen a problem clearly in one domain, it's hard not to see it everywhere, and the test is always whether the method actually transfers or just looks like it should. One caveat before the examples: design-of-experiments thinking is adopted unevenly across domains and scales. That unevenness is the gap I'm describing, not an absolutist claim that everyone ignores it.

Drug discovery

In pharmaceutical research and development,⁴ the same combinatorial pressure appears in formulation, process optimization, analytical method development, and some medicinal-chemistry work. A team optimizing a molecule, assay, or process faces candidate variables like substituent choice, solvent system, process temperature, concentration, pH, or excipient ratio. Exhaustive enumeration is rarely feasible. Formal DoE is routine in pharmaceutical product, process, and method development. While adoption varies by organization size and stage of development, formulation, process, and assay work see more routine use than early medicinal chemistry.⁵ The defensible claim is not that every SAR program is a textbook fractional factorial. It is that mature technical disciplines already know the pressure: too many candidate factors, expensive observations, noisy responses, and a need to separate signal from interaction.

Product experimentation & AI features

A modern AI product surface has many of the same combinatorial properties. The prompt has a system message, a tool list, a temperature, a context window, a retrieval depth, an output format, a critic loop. Each of those has three to ten reasonable settings. The full combinatorial space is operationally unrunnable. The product manager who needs to know which configuration is actually winning faces the same problem as the medicinal chemist facing the receptor and the formulator facing the deposition coacervate. Brute force is off the table.

Planning maritime shipping routes

A fleet operator allocating ships across the North Atlantic during an El Niño/Southern Oscillation half-cycle faces several factors that could drive fuel-adjusted transit time: route choice, departure timing relative to ENSO state, cruising speed, fuel grade, ballast condition, weather-routing service, hull-fouling state, and the captain's discretionary margin. Major operators run sophisticated weather-routing optimization, isochrone methods, dynamic programming on routing graphs, multi-objective evolutionary algorithms, and increasingly Bayesian-network inference over casualty and delay risk.⁶ The DoE-style factor-screening layer that asks "of these eight candidates, which three actually move the response, and which interactions are confounded with which main effects?" is not the dominant analytical practice in fleet routing. Is that part of the gap?

Three different domains, three different levels of sophistication, and one underlying problem. The same family of methods can be made useful in each, if the design is properly constructed.

What product-experimentation tools ship today, and what they don't

I evaluated ten product-experimentation platforms in May 2026 against one question: does this platform ship the more nuanced methodological details that turn a multi-factor result into a defensible causal inference?⁷

Three platforms, Optimizely, VWO, and Convert, let the user construct multi-factor multivariate tests that explore combinations of independent elements rather than single-factor A/B/n testing. The applied definition of "multi-factor" appears to differ in each of their shipped offerings.

For the other seven, PostHog, Statsig, Eppo, GrowthBook, Amplitude, LaunchDarkly, and Split/Harness FME, I did not find public documentation related to a multi-factor experiment feature. Their "multivariate" surface is usually a single-factor design with N variants of one element. That is not a factorial design; it is an A/B test with C, D, and more tacked on.⁸

Some of the analytical machinery that lifts a multi-factor result into a defensible inference is missing. This audit reflects public documentation and product surfaces available at the time of review; private enterprise features, roadmap items, or undocumented statistical behavior may differ. The comparison table below documents the per-platform state. The methodology audit follows.

Multi-factor MVT	Statistical engine	SRM detection	Methodology gaps
Optimizely
Full factorial + manual partial factorial (user-excluded combos) + user-configured Taguchi templates. 64-combination hard cap.⁹	Stats Engine (proprietary always-valid sequential), Fixed-Horizon (t-test), Bayesian.¹⁰	SSRM (sequential Bayesian multinomial; Lindon & Malek). Warning, not hard gate. Scoped to A/B only, not MVT.¹¹	No alias structure. No designed fractional factorial (only manual exclusion). No screening designs.
VWO
Full factorial + Partial Factorial ("optimal arrays," minimum 5 variables). No "Taguchi" naming in current docs.¹²	Feature Experimentation MVT docs expose fixed-horizon and sequential analysis with Bonferroni correction; other VWO testing surfaces describe Bayesian/Frequentist options separately.¹²	Detected and shown. Methodology not specified in public docs.	"Optimal arrays" do not surface alias structure or stated resolution.
Convert
Full factorial only. No partial factorial, no Taguchi, no screening designs documented.¹³	Frequentist (t-test), Bayesian, Sequential (Asymptotic Confidence Sequences, Waudby-Smith 2023).¹⁴	Chi-squared at 99% confidence. Warning, not hard gate. User-enabled.¹⁵	No partial/fractional factorial at all. Single most statistically transparent of the three by documentation.
PostHog
Single-factor only. Up to 9 test variants + 1 control.¹⁶	Welch's t-test, Bayesian.	Data-quality validation is documented; exact SRM threshold was not surfaced in public docs during this audit.	No multi-factor builder. No MCC by default.
Statsig
Single-factor (A/B/n) only.	Frequentist sequential mSPRT (Zhao et al.).¹⁷	Chi-squared, tiered: warning at p < 0.01, elevated alert at p < 0.001. Not a hard gate.¹⁸	No multi-factor builder. No screening designs.
Eppo
Single-factor N-variant only.	Sequential frequentist (default), Fixed frequentist, Sequential hybrid, Bayesian.¹⁹	Pearson chi-squared at α = 0.001. Warning, not hard gate.	Preferential Bonferroni MCC (rare; correct across metrics & variants). No multi-factor builder.
GrowthBook
Single-factor N-variant. Open-source; sequential testing on paid tiers only.²⁰	Bayesian default, Frequentist t-test with CUPED.	Detected and shown via Health tab. Method not specified.	Multi-factor factorial is an open community feature request.²¹
Amplitude
Single-factor. A/B Test + Multi-armed Bandit modes.	mSPRT sequential (explicitly named), Fixed-horizon t-test, Bayesian.²²	Sequential chi-squared at α = 0.01. Warning, not hard gate.²³	Multi-testing correction not surfaced in public docs.
LaunchDarkly
Single-factor N-variant. No variant cap on flags.²⁴	Frequentist Fixed-Horizon + Sequential, Bayesian. CUPED + stratified sampling available.	Bayesian sequential; LaunchDarkly documents a posterior-odds / 99%+ probability threshold. Distinctive among the ten.	MCC available; Bonferroni and Benjamini-Hochberg are documented. No multi-factor builder.
Split / Harness FME
Single-factor N-variant.	mSPRT sequential (explicitly named), Fixed-horizon.²⁵	Chi-squared at p < 0.001.	BH FDR applied across metrics; explicitly not applied across treatment-arm pairwise comparisons.

The structural audit, restated against the table:

1. Alias structure

When you run a fractional factorial, certain main effects are confounded with two-factor interactions. The design's resolution tells you which. I did not find public documentation showing that any of the ten surfaces alias patterns, defining relations, or design resolution to the user. The three multi-factor platforms (Optimizely, VWO, Convert) implement "partial factorial" as either manual combination exclusion (Optimizely) or vendor-selected optimal arrays without alias disclosure (VWO). The classical DoE machinery, defining relation, generator words, resolution number, alias chain enumeration, is not part of the public product surface.

2. Screening designs

Plackett-Burman screens up to eleven factors in twelve runs, or up to nineteen factors in twenty runs, and surfaces the Pareto-dominant three or four. The math has been settled since 1946. I did not find public documentation showing that any of the ten ships a screening-design surface. Optimizely's 64-combination cap is the closest the reviewed market gets to first-class multi-factor experimentation, and even that is full-factorial allocation up to a budget ceiling.

3. Response-surface methodology

Central composite and Box-Behnken designs fit the local curvature of a response near an interesting point. For the narrow product contexts where factors are continuous and the response is approximately smooth (recommender-weight tuning, ad-bidding parameters, AI hyperparameter sweeps), RSM finds local optima that screening designs and pure factorials cannot. I did not find this surfaced as a user-facing experiment design family in the ten platforms reviewed.

4. Mixture designs

Scheffé designs²⁶ apply when settings sum to a fixed budget: route-share allocation across a fleet, user attention across surfaces, capital across strategic initiatives. Resource-allocation problems are common in product strategy. I did not find a reviewed platform modeling experiments this way. Attention and capital are entered as separate factors the platform treats as independent. The mismatch is not cosmetic. It is a data-model problem.

5. Validity gates

A formal check at ingest that refuses to publish an attribution when realized telemetry violates the design's assumptions. Sample-ratio mismatch²⁷ is detected on most platforms in some form. In the public surfaces I reviewed, SRM is a diagnostic, not a hard attribution gate. Where a platform detects sample-ratio mismatch but does not gate interpretation behind it, the UI can still leave the user with a decision-ready result despite compromised telemetry.

The important point is not that product teams should suddenly become statisticians. It is that modern experimentation systems routinely produce attribution while omitting much of the methodological machinery that older technical disciplines treat as foundational.

Why the gap persists

Some vendor education content explains DoE-adjacent concepts. Operational adoption does not follow. Four structural forces hold the gap open:

The skill-gap incentive. Most product managers have not been trained on design resolution, alias structure, or screening. Surfacing "Resolution III" in the UI without the surrounding discipline produces confusion, not insight. Vendors who try this without solving the teaching problem at the same time will lose user-research battles inside their own product orgs. The three platforms that already ship factorial MVT have chosen, plausibly correctly, not to expose the analytical machinery.

The always-give-a-number incentive. The hard-gate on SRM is methodologically correct and commercially awkward. A team running thirty experiments a quarter does not want one in five blocked at result time on a telemetry diagnostic. The commercial incentive points toward publishing a number.

The data-model mismatch. The platform's input is a feature flag with N variants. The mathematics of factorial design is independent factors at orthogonal levels. The two are not the same. To ship factorial as a first-class object, the platform has to change how experiments are entered, not just how results are computed. That is an input-layer migration, not an analytics feature.

Many PMs are not familiar with experimental design. The discipline of enumerating one's candidate variables as orthogonal independent factors is itself a skill. Without it, factor designs are not useful because the factors are not actually independent. A button color is a factor; a "new onboarding flow" is a bundle of factors masquerading as one. Most product teams do not yet have the language to disambiguate.

The implication is that adding a factorial constructor on top of an existing experimentation platform will not close the gap. A new layer is required.

We don't need a new platform

So the question is: how do you preserve methodological rigor when the user doesn't know the methodology vocabulary? The answer that survives all four forces is an LLM-plus-deterministic-engine architecture, where the LLM is the translator and the engine is the math.

The LLM is not decoration. It lets a PM describe what they want to test in natural language and get back a designed experiment with proper alias structure. The engine is the mathematics; the LLM is the doorway to it.

A product manager prompts: "test whether changing recommendation weight, retrieval depth, and critic-loop count, in combination with two reranker variants, improves my AI-powered email personalization feature's retention." The LLM emits a typed specification of what experiment to run.

{
  "family": "fractional_factorial",
  "factors": ["recommendation_weight", "retrieval_depth",
              "critic_loop_count", "reranker_variant"],
  "levels": [2, 2, 2, 2],
  "runs": 8,
  "resolution": "IV"
}

The deterministic engine takes that specification and does the work: constructs the design matrix, surfaces the alias structure, validates the telemetry as it returns, refuses to attribute when the realized data violates the design assumptions, and returns an attribution with confidence intervals. Design construction has hard mathematical properties (orthogonality of the design matrix, balance across cells, alias structure, resolution, D-efficiency) that you do not want a language model constructing token by token. The LLM should call the function, not be the function.

Qualitative inputs go through the same pipeline.

If a maritime operator's captain weather logs, port inspection narratives, and insurance rate tables feed into the same routing decision, the LLM reads them and emits a typed observation:

{
    "event_type": "rogue_wave_probability_elevated",
    "spatial_extent": {"lat_range": [50, 60], "lon_range": [-30, -10]},
    "temporal_window": {"start": "2026-05-14", "duration_days": 21},
    "confidence": 0.78,
    "basis": ["captain_log:vessel_imo_9123456:2026-05-14",
              "casualty_narrative:case_2026_0521"]
}

The engine consumes the observation, elevates the named factor as a candidate for the next screening design, and proceeds. The model never constructs the experiment. The model never runs the attribution. It paraphrases the engine's output back to the user at the end.

The validity gate, in plain terms.

Four checks fire the moment realized telemetry comes back from production:

Did the realized allocation match the design? If you told the platform to put 12.5% of traffic into each of eight cells of a 2³ factorial and one came back with 4% because of a routing bug, the chi-squared test fires.
Do comparable cells balance against each other? If two cells should look statistically identical on baseline covariates but they do not, the engine flags the imbalance.
Did each cell get enough data to support an inference? Below a floor set by the experimenter, the engine emits a diagnostic, not a number.
Did underlying conditions hold steady across the run? Did the ENSO state weaken partway through the run? Did engineering push a new model behind the personalization API? A Kolmogorov-Smirnov test can detect distributional shift; the engine then splits the analysis at the shift instead of averaging across it.

On failure, the engine does not publish an attribution. It returns a diagnostic with the affected cells and an actionable recommendation. This is the structural inversion of how the current platform stack handles SRM. Where platforms detect sample-ratio mismatch but do not gate interpretation behind it, the user can still walk away with a result that looks more decision-ready than the telemetry justifies.

Who would adopt this, and how it composes with what they already run.

The honest audience for a methodology layer is not the broad product-manager market. It is the segment of product organizations that has outgrown warning-and-continue experimentation: technical PMs at AI-native companies who already feel the inference problem; ML and applied-science teams that bridge to JMP, R, or custom notebooks; internal experimentation teams at companies already building proprietary platforms because the public vendor surface does not fit their work.

None of those audiences is a mass market. None requires one. The methodology layer composes on top of the existing experimentation infrastructure rather than replacing it. The architecture reads exposure events and metric data from whichever platform the org already runs (Optimizely, Statsig, Eppo, an internal pipeline), constructs the designed experiment over the cells the org executes, and writes the validated attribution back. The platform stays the system of record for traffic allocation and telemetry. The methodology layer is the inference tier above it.

A worked example: routing the Atlantic across the 2023–2024 El Niño

The 2023–2024 ENSO cycle entered strong El Niño mid-2023, peaked in November–December 2023 at a Niño-3.4 anomaly of approximately 2.0°C, and decayed to neutral by April–June 2024.²⁸ For domain scale, 2023 also recorded an industry-wide low of 221 containers lost at sea, the lowest count since the World Shipping Council survey began in 2008, out of approximately 250 million containers transported, with around 33% recovery on the lost units.²⁹ The container-loss figure is offered here as scale context, not as an ENSO-attributable outcome — the WSC report attributes the low to improving securing and lashing practices, not weather phase. The maritime example below uses the ENSO timing as a known nonstationarity to demonstrate the methodology, not to explain the container statistic.

The example below is hypothetical. Numbers are constructed to be plausible against the named domain, not measured from real fleet operations. The point is the architecture: screening, mixed follow-up modeling, validity gate.

Hypothetical worked example. The factor list, follow-up model, and validity-gate firing below are illustrative of the methodology's mechanics, not findings from a real fleet operator.

Setup. Imagine a fleet operator allocating 192 transatlantic voyages across a 12-cell screen in Q4 2023 (Boston/NYC to Antwerp/Hamburg/Rotterdam) during peak El Niño. Eight candidate factors are nominated:

F1 — Route: Great Circle vs. Rhumb Line
F2 — Departure-day phase: bi-weekly cycle position 1 vs. position 2
F3 — Cruising speed: Eco vs. Standard
F4 — Fuel grade: VLSFO vs. LSMGO
F5 — Ballast condition: Light vs. Full
F6 — Weather-routing service: Vendor A vs. Vendor B
F7 — Hull-fouling state: 0–6 mo since drydock vs. 6–12 mo
F8 — Captain's discretionary margin: tight (±10 nm) vs. wide (±60 nm)

Step 1 — Plackett-Burman screen, 12 runs over 8 factors. The screen runs 12 voyage configurations as orthogonal contrasts. Response: fuel-adjusted transit time in hours. After 12 cells × 16 voyages each = 192 voyages logged, a hypothetical main-effect ranking looks like this:

Factor                       Effect (hrs)   SE     Pareto rank
F1 (Route)                       -8.4       1.2        1
F6 (Weather-routing service)     -6.1       1.3        2
F2 (Departure phase)             +4.7       1.4        3
F3 (Cruising speed)              -1.8       1.5        4
F7 (Hull fouling)                +1.4       1.4        5
F4, F5, F8                       < 1.0      1.4    below noise floor

Three factors (F1, F2, F6) dominate. Five fall below the screen's resolvable threshold and are dropped from the follow-up. Without methodology, the dashboard can still reward whichever bundle looks best in aggregate. The screen does the less theatrical thing: it asks which factors the data can actually support.

Step 2 — Mixed follow-up model on surviving three factors. Because route and routing-service choice are categorical while departure timing is ordered and time-varying, the follow-up is not a textbook central composite design. A defensible follow-up would be a 2×2 factorial on F1 × F6 with F2 modeled as a smoothed phase/timing term. The point is not that the model discovers a real Atlantic routing rule. It is that the method can test whether the best departure timing stays put or drifts as ENSO declines from its November–December peak.

Step 3 — Validity gate fires. Suppose one follow-up cell contains four voyages after the ENSO state begins to fade from its November–December 2023 peak. A Kolmogorov-Smirnov test on transit-time distribution within that cell rejects stationarity at p < 0.01. The engine refuses to publish a single attribution over that cell and splits the analysis into segments before and after the fade.

Step 4 — Recommendation delta.

Without the methodology layer: aggregated Q4 transit times show weather-routing service B with a 7.2% mean speed advantage over A. The operator is tempted to commit Q1 budget to service B.

With the methodology layer: the split analysis shows B's advantage held mainly during the strong El Niño and mostly disappeared once it faded. The same data that produced "B wins, commit Q1" without the layer produces "B only won while the El Niño was strong, and that window has closed" with it. The Q1 commit does not happen against the wrong assumption.

I do not claim this stack would have avoided the 221 containers lost at sea in 2023. I claim something narrower and more useful: it would make it harder to commit a quarter's routing budget to an advantage that only held while the El Niño was strong, a window that had already closed.

Conclusion

If you build product at an AI-native company, the gap I described is the one you are already feeling but possibly not naming. The post-training-run noise in your eval set looks like the noise in a body-wash panel; the candidate factors driving your AI feature look like the polymers I parameterized in 2018; the El Niño phase looks like the model migration that lands mid-experiment. The math is the same. The methodology is the same. The vendor stack does not ship it.

The methodology layer is the part that determines whether the inference is trustworthy. The body-wash deposition problem, the drug-discovery problem, the AI-feature problem, and the maritime-routing problem are all versions of the same structure: a combinatorial space too large to brute-force, a response that depends on factors and interactions, limited experimental budget, and data conditions that may not remain stable during the run.

The experimentation stack adopted optimization before it adopted methodology. The methodology already exists. Much of product never imported it.

References

1. Box, G. E. P., Hunter, J. S., & Hunter, W. G. (1978). Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building (1st ed.). Wiley-Interscience. First edition ISBN 0-471-09315-7; second edition 2005 ISBN 0-471-71813-0. Google Books first-edition record; Wiley second-edition record.

2. Plackett, R. L., & Burman, J. P. (1946). "The design of optimum multifactorial experiments." Biometrika, 33(4), 305–325. DOI: 10.1093/biomet/33.4.305. Plackett-Burman design overview.

3. Politis, S. N., Colombo, P., Colombo, G., & Rekkas, D. M. (2017). "Design of experiments (DoE) in pharmaceutical development." Drug Development and Industrial Pharmacy, 43(6), 889–901. DOI: 10.1080/03639045.2017.1291672. PMID 28166428. Surveys DoE adoption across pharmaceutical product, process, and method development. EBSCO citation record.

4. Politis, S. N., Colombo, P., Colombo, G., & Rekkas, D. M. (2017). "Design of experiments (DoE) in pharmaceutical development." Drug Development and Industrial Pharmacy, 43(6), 889–901. DOI: 10.1080/03639045.2017.1291672. PMID 28166428. EBSCO citation record. Accessed 2026-05-14.

5. Anandamurthy, M. "Design of experiments for pharma, biotech, and medical electronics." JMP. Documents DoE examples across drug formulation, pharmaceutical process optimization, analytical method development, bioprocessing, and medical-device contexts. JMP. Accessed 2026-05-14.

6. Kytariolou, A., & Themelis, N. (2022). "Ship routing optimisation based on forecasted weather data and considering safety criteria." Journal of Navigation, 75(6), 1310–1331. DOI: 10.1017/S0373463322000613. VISIR-2: ship weather routing in Python (Geosci. Model Dev., 2024). Chen, Y., Zhang, C., Guo, Y., Wang, Y., Lang, X., Zhang, M., & Mao, W. (2025). "State-of-the-art optimization algorithms in weather routing — ship decision support systems: challenge, taxonomy, and review." Ocean Engineering, 331, 121198. DOI: 10.1016/j.oceaneng.2025.121198. These sources show the dominant fleet-routing analytical practice is path optimization (isochrones, dynamic programming, A*, NSGA-III) supplemented by Bayesian-network inference for casualty and delay risk — not factor-screening designs. Cambridge JoN; VISIR-2; state-of-the-art review. Accessed 2026-05-14.

7. Per-platform audit dossier. All vendor capabilities documented as of Q2 2026 from public help-center, vendor-blog, and statistical-method documentation pages.

8. Public documentation reviewed for PostHog, Statsig, Eppo, GrowthBook, Amplitude, LaunchDarkly, and Split/Harness FME presents multivariate experimentation as N variants of a single feature flag or treatment. I did not find public documentation showing a multi-factor experiment-builder with cross-cell orthogonal allocation.

9. Optimizely Support, "Multivariate tests for Optimizely Web Experimentation." Quote: "The number of possible combinations in an MVT is capped at 64." Optimizely offers full factorial, partial factorial (user-controlled combination exclusion), and user-configured Taguchi templates as configuration options. support.optimizely.com. Accessed 2026-05-13.

10. Optimizely Support, "Statistical analysis methods overview." Three options: proprietary Stats Engine (always-valid sequential), Fixed-Horizon t-test, and Bayesian. support.optimizely.com. Accessed 2026-05-13.

11. Optimizely Support, "Optimizely's automatic sample ratio mismatch detection." Uses SSRM (sequential Bayesian multinomial; Lindon & Malek). Explicitly scoped: "Optimizely Experimentation's automatic SRM detection is only for stats engine A/B experiments." support.optimizely.com. Accessed 2026-05-13.

12. VWO Help, "Set Up Multivariate Rules in Feature Experimentation." Quote: "Partial Factorial: Tests a subset of variable combinations using optimal arrays." The same page describes fixed-horizon and sequential analysis and Bonferroni correction for this Feature Experimentation surface; VWO's broader testing documentation describes Bayesian and Frequentist analysis modes elsewhere. Feature Experimentation MVT; SmartStats; Bonferroni. Accessed 2026-05-13.

13. Convert Support, "Creating a multivariate experiment." Full factorial only; combination formula (variations on element A × variations on element B = total). No fractional, Plackett-Burman, or Taguchi support documented. support.convert.com. Accessed 2026-05-13.

14. Convert Support, "Statistical methods used." Three options: Frequentist t-test, Bayesian (uninformative priors), and Sequential (Asymptotic Confidence Sequences). The sequential method is built on Waudby-Smith, I. & Ramdas, A., "Estimating means of bounded random variables by betting," Journal of the Royal Statistical Society: Series B, 86(1), 2023, 1–27. support.convert.com. Accessed 2026-05-13.

15. Convert Support, "What is SRM?" Chi-squared goodness of fit, 99% confidence threshold; user-enabled per project. support.convert.com. Accessed 2026-05-13.

16. PostHog Docs, "Creating an experiment" and statistics pages. Up to 9 test variants plus 1 control; current docs describe Bayesian and frequentist engines, Welch's t-test, and data-quality validation. creating experiments; Bayesian statistics; frequentist statistics. Accessed 2026-05-13.

17. Statsig Docs, "Sequential Testing." Frequentist mSPRT, citing Zhao et al. docs.statsig.com. Accessed 2026-05-13.

18. Statsig Docs, "SRM Checks." Chi-squared, tiered: yellow warning surfaced in Diagnostics card at p < 0.01; elevated red alert at p < 0.001 with ≥ 0.1% absolute deviation. Neither tier is a hard gate. docs.statsig.com. Accessed 2026-05-13.

19. Eppo Docs, "Analysis methods," "Sample Ratio Mismatch," and "Multiple testing correction." Four engines: Fixed-sample frequentist, Sequential frequentist (default), Sequential hybrid, Bayesian. SRM uses Pearson chi-squared at α = 0.001; preferential Bonferroni correction applies across both metrics and variants. analysis methods; SRM; multiple testing. Accessed 2026-05-13.

20. GrowthBook Docs, "Sequential testing." Asymptotic Confidence Sequences; built on Waudby-Smith & Ramdas (2023), JRSS: B 86(1), 1–27. Sequential testing is gated to enterprise/paid plans. docs.growthbook.io. Accessed 2026-05-13.

21. GrowthBook documentation and SDK examples describe experiments as variation arrays and feature-flag values; I did not find public documentation showing native factorial, fractional-factorial, Plackett-Burman, or Taguchi experiment construction in GrowthBook's product surface. GrowthBook JavaScript SDK docs. Accessed 2026-05-14.

22. Amplitude Docs, "Experiment Sequential Testing." Explicitly names mSPRT; results "always valid." amplitude.com/docs. Accessed 2026-05-13.

23. Amplitude Docs, "Sample Ratio Mismatch." Sequential chi-squared at α = 0.01. amplitude.com/docs. Accessed 2026-05-13.

24. LaunchDarkly Docs, "Variations," "Analyzing experiments," "Sample ratio mismatch," and "Multiple comparisons correction." "There is no limit to the number of variations you can add to a multivariate flag." Statistical analysis methods include Fixed-Horizon frequentist t-test, Sequential frequentist, and Bayesian; SRM documentation uses a 99%+ probability / posterior-odds threshold; MCC docs cover Bonferroni and Benjamini-Hochberg. variations; analyze docs; sample ratios; MCC. Accessed 2026-05-13.

25. Harness FME Docs (formerly Split.io), "Statistical approach," "Fixed horizon," and "Multiple Comparison Correction." mSPRT is explicitly named; fixed-horizon testing is documented; MCC applies Benjamini-Hochberg across metrics and explicitly does not adjust pairwise treatment comparisons. statistical approach; fixed horizon; MCC. Accessed 2026-05-13.

26. Scheffé, H. (1958). "Experiments with mixtures." Journal of the Royal Statistical Society: Series B, 20(2), 344–360. DOI: 10.1111/j.2517-6161.1958.tb00299.x. citation record.

27. Microsoft Research, "Diagnosing sample-ratio mismatch in online controlled experiments." microsoft.com/research. Accessed 2026-05-13.

28. NOAA Climate.gov, "April 2024 ENSO update: gone fishing." Niño-3.4 peaked at approximately 2.0 °C in November–December 2023; decay to neutral by April–June 2024. climate.gov. NOAA CPC, monthly ENSO Diagnostic Discussions 2023-06 through 2024-06. Accessed 2026-05-13.

29. World Shipping Council, "Containers Lost at Sea Report — 2024 Update," covering calendar year 2023. 221 containers lost out of approximately 250 million transported; approximately 33% recovery rate; lowest count since the survey began in 2008 (prior low: 661 containers in 2022). worldshipping.org. Accessed 2026-05-13.