Why This Is Asked
Experiment design is the core competency of a product data analyst. This question tests statistical rigor, business judgment, and whether you know what can go wrong in practice.
Key Concepts to Cover
- Randomization unit — user-level vs session-level vs device-level
- Primary metric — checkout conversion rate (completed purchases / checkout initiations)
- Sample size calculation — minimum detectable effect, statistical power, significance level
- Novelty effect — users behave differently when they see something new, regardless of quality
- Interference effects — shared inventory, referral loops that violate independence assumption
How to Approach This
1. Define the Randomization Unit
Randomize at the user level, not the session level. A user seeing both variants in different sessions contaminates the experiment. For logged-out users, randomize by a persistent cookie.
2. Define the Primary Metric
Checkout conversion rate = completed purchases / checkout initiations. This is the direct measure of the change's effect.
Add guardrails:
- Average order value (the new flow should not reduce basket size)
- Support contact rate (a confusing flow may increase help requests)
- Load time p95 (new UI must not regress performance)
3. Calculate Sample Size
For a 5% relative lift on a 3% baseline conversion rate, 80% power, α=0.05:
- Minimum detectable effect (absolute): 3% × 5% = 0.15 percentage points
- Required sample per variant: ~180,000 users (use a sample size calculator)
- At 10,000 checkouts/day, that's ~36 days per variant = ~72 days total
- If this is too long, consider raising α to 0.1 or targeting a larger effect
4. Decide on Duration
Run for at least 2 full weeks to capture weekly seasonality. Don't stop early just because results look significant — peeking inflates false positive rate.
Common Follow-ups
-
"How do you handle users who abandon checkout in the middle and come back later?" Count them in the denominator at checkout initiation. Their return is already captured in your randomization since they're in the same variant.
-
"The experiment ran for 2 weeks and p-value is 0.04. Should you ship?" Check: Did you pre-register the hypothesis? Did you look multiple times during the run? Is the effect size practically meaningful? 0.04 is borderline — look for consistent results across segments before shipping.