Sequential Testing & Early Stopping

Sequential testing lets you continuously monitor experiment results and stop as soon as you have enough evidence — without inflating your false positive rate. Unlike fixed-horizon A/B tests that require a predetermined sample size, sequential methods maintain valid error rates even when you peek at results repeatedly.


When to Use Sequential Testing

SituationUse Sequential Testing
You need results faster than a fixed-horizon test allows✅ Yes
Traffic is unpredictable and sample size targets are hard to set✅ Yes
You want to stop early if a treatment is clearly harmful✅ Yes
You have a strict predetermined sample size and won't peek❌ Use standard A/B
You need exact p-values for regulatory reporting❌ Use standard A/B

Methods

mSPRT (mixture Sequential Probability Ratio Test)

The default method. Computes a likelihood ratio (lambda_ratio) that accumulates evidence for or against an effect:

  • lambda_ratio > 1/alpha → strong evidence for an effect, safe to stop
  • lambda_ratio < alpha → strong evidence for the null, safe to stop for futility
  • Otherwise → continue collecting data

The always_valid_p_value can be interpreted like a standard p-value at any point without inflating Type I error.

Always-Valid Confidence Intervals (Confidence Sequences)

A confidence sequence is a confidence interval that is valid at every sample size simultaneously. The interval shrinks as more data is collected. Use this when you want a continuous view of the effect size range, not just a stop/continue decision.

Alpha Spending (O'Brien-Fleming / Pocock)

For experiments with planned interim analyses, alpha spending functions control how much of the significance budget is used at each look:

  • O'Brien-Fleming: Conservative early on, uses most alpha at the end. Recommended for most experiments.
  • Pocock: Equal boundaries at every look. Easier to explain but requires more total sample size.

API Reference

GET /api/v1/results/{experiment_id}/sequential

Returns the full sequential analysis for an experiment.

Query Parameters

ParameterTypeDefaultDescription
methodmsprt | always_validmsprtSequential testing method
alphafloat0.05Significance level
spending_functionobrien_fleming | pocockobrien_flemingAlpha spending function
num_looksint5Number of planned interim analyses

Example Request

curl -X GET "http://localhost:8000/api/v1/results/exp-uuid/sequential?method=msprt&alpha=0.05" \
  -H "Authorization: Bearer $TOKEN"

Example Response

{
  "method": "msprt",
  "msprt_result": {
    "lambda_ratio": 24.3,
    "always_valid_p_value": 0.012,
    "can_stop": true,
    "evidence_strength": "strong_for_effect",
    "boundary": 20.0
  },
  "confidence_sequence": {
    "lower": 0.02,
    "upper": 0.08,
    "width": 0.06,
    "sample_size": 4200
  },
  "evidence_trajectory": [
    {"sample_size": 500, "lambda_ratio": 1.2, "always_valid_p_value": 0.41, "can_stop": false},
    {"sample_size": 2000, "lambda_ratio": 8.1, "always_valid_p_value": 0.09, "can_stop": false},
    {"sample_size": 4200, "lambda_ratio": 24.3, "always_valid_p_value": 0.012, "can_stop": true}
  ],
  "alpha_spending": [
    {"look_number": 1, "cumulative_alpha": 0.001, "boundary_z": 4.88, "boundary_p": 0.001},
    {"look_number": 2, "cumulative_alpha": 0.005, "boundary_z": 3.36, "boundary_p": 0.004}
  ],
  "long_running_risk": {
    "is_at_risk": false,
    "expected_duration_days": 14,
    "actual_duration_days": 10,
    "risk_ratio": 0.71,
    "recommendation": "Experiment is progressing normally."
  },
  "recommended_action": "stop_for_effect"
}

Evidence Strength Values

ValueMeaning
strong_for_effectClear effect detected — safe to stop and ship
moderate_for_effectTrending positive — consider continuing for confirmation
inconclusiveNot enough data yet — continue
moderate_for_nullTrending toward no effect
strong_for_nullNo effect detected — safe to stop for futility

Recommended Actions

ValueMeaning
stop_for_effectStop the experiment; treatment wins
stop_for_futilityStop the experiment; no meaningful effect
continueKeep running — not enough evidence yet

Permissions

  • VIEWER: ✅ Read access
  • ANALYST: ✅ Read access
  • DEVELOPER: ✅ Read access
  • ADMIN: ✅ Full access

Common Mistakes

Don't use sequential testing results as standard p-values in reports. The always_valid_p_value is semantically equivalent but should be labeled as such.

Don't change the alpha or method mid-experiment. Fix your parameters before launch.

Long-running risk is informational only. An experiment flagged as is_at_risk may still be producing valid results — it just indicates the experiment is taking longer than expected given the observed effect size.