User Guide

Guide for product managers, experimenters, and analysts using the Experimentation Platform.

Introduction

The Experimentation Platform enables you to:

A/B test product changes with statistical rigor
Feature flag new functionality for controlled rollouts
Analyze results with real-time statistical significance reporting
Target users with precise rules-based segmentation

Accessing the Platform

Open the dashboard at http://localhost:3000 (development) or your production URL.

You'll need an account with one of the following roles:

Role	What You Can Do
ADMIN	Everything — manage users, all experiments, all flags
DEVELOPER	Create and manage your own experiments and feature flags
ANALYST	View all experiments, results, and reports (read-only)
VIEWER	View approved experiments and their results

Contact your platform admin to request access or role changes.

Experiments

What Is an A/B Test?

An A/B test (experiment) splits your users into groups, shows each group a different version of a feature, and measures which version performs better on a defined metric.

Key concepts:

Control — the current/existing experience (your baseline)
Treatment/Variant — the new experience you're testing
Metric — what you're measuring (conversion rate, revenue, clicks)
Statistical significance — confidence that the difference is real, not random noise

Creating an Experiment

Step 1: Define the experiment

Navigate to Experiments → New Experiment and fill in:

Name: Descriptive name, e.g., "Homepage CTA Button Color Q1 2026"
Key: Auto-generated slug, e.g., homepage-cta-color — this is used in code
Hypothesis: "Changing the CTA button from blue to green will increase click-through rate by 10%"
Description: Background context, links to design doc, Jira ticket

Step 2: Add variants

Every experiment needs at least one control and one treatment:

Variant	Is Control	Traffic %
Control (Blue Button)	✅	50%
Treatment (Green Button)	—	50%

Traffic allocation must total ≤ 100%. The remaining percentage is excluded from the experiment.

For multivariate tests, add more variants:

Variant	Is Control	Traffic %
Control	✅	34%
Green Button	—	33%
Red Button	—	33%

Step 3: Define metrics

Choose what to measure. Every experiment should have one primary metric (your north star for the decision) and optional secondary metrics.

Metric Type	When to Use	Example
CONVERSION	Binary outcome (did it happen?)	Clicked CTA, completed checkout, signed up
REVENUE	Dollar amounts	Order value, LTV increase
COUNT	How many times	Page views per session, actions taken
DURATION	Time measurements	Session duration, time to first action

The metric's event name must match exactly what your engineers track in code (e.g., checkout_complete).

Step 4: Set targeting rules (optional)

Only want to test on specific users? Add targeting rules:

Country is in [US, CA]
Subscription plan is "premium"
Account age is greater than 30 days

Users who don't match targeting rules are excluded from the experiment entirely.

Step 5: Start the experiment

Click Start Experiment. The status changes to ACTIVE.

You can also schedule automatic start/end dates: go to Experiment → Schedule and set future dates. The experiment will auto-activate at the start date and auto-complete at the end date.

Pausing and Stopping

Pause: Temporarily halt assignment of new users. Existing assignments are preserved. Use when you need to investigate anomalies or if there's a critical bug.

Complete: Stop the experiment. Use when you have sufficient data and are ready to make a decision. Existing data is preserved.

Rules for stopping early:

You need statistical significance (p-value < 0.05) AND practical significance (effect size is meaningful)
Resist the urge to stop as soon as significance is reached — this inflates false positive rates
Use the Sample Size Meter to confirm you've reached the required sample size before deciding

Reading Results

Navigate to Experiments → [Your Experiment] → Results or http://localhost:3000/results/EXPERIMENT_ID.

Experiment Summary Card

Shows at a glance:

Status: Active / Completed
Duration: Days running
Total Users: Participants across all variants
Recommendation: SHIP VARIANT / KEEP CONTROL / CONTINUE TESTING / INCONCLUSIVE

Recommendation Meanings

Recommendation	Meaning	Action
SHIP VARIANT	Treatment is significantly better	Deploy the treatment to all users
KEEP CONTROL	Control is significantly better	Do not ship the treatment
CONTINUE TESTING	Not enough data yet	Wait for more data before deciding
INCONCLUSIVE	No meaningful difference detected	Consider whether the change is worth shipping anyway

Metric Comparison Table

For each metric and variant pair:

Column	Meaning
Rate/Mean	Conversion rate or average value for this variant
Relative Improvement	How much better/worse vs. control (e.g., +12.3%)
p-value	Probability the difference is due to chance (lower = more confident)
Significant	p-value < (1 - confidence level), e.g., < 0.05 for 95% confidence
Effect Size	Practical magnitude: negligible / small / medium / large
Confidence Interval	Range where the true difference likely falls

Green = statistically significant improvement Red = statistically significant degradation Gray = not statistically significant

Trend Chart

Shows conversion rate over time for each variant. Two modes:

Cumulative: Running total from experiment start (recommended — smooths noise)
Daily: Day-by-day rate (useful for detecting novelty effects or day-of-week patterns)

Look for:

Parallel lines early → diverging lines → stable difference (healthy experiment)
Crossing lines → potential interaction effects or bugs
Spike on day 1 → novelty effect (users excited about newness)

Sample Size Meter

Shows current vs. required sample size:

Green (>100%): Adequate — you have enough data for a reliable decision
Amber (80-100%): Almost there — wait a bit longer
Red (<80%): Not enough data — results are unreliable, don't make a decision

"Days to Significance" estimates how long until you reach the required sample size at the current rate.

Making a Decision

Once your experiment is adequately powered and statistically significant:

Check primary metric: Is the direction positive? Is the effect meaningful?
Check secondary metrics: Did anything unexpected change (e.g., revenue went up but customer satisfaction dropped)?
Check for novelty effect: Did the initial spike smooth out? Are trends stable?
Consult stakeholders: Share results and the recommendation with the team

Ship: Deploy the winning variant to 100% of users. Mark the experiment as Completed.

Keep control: The change didn't work. Archive or redesign the hypothesis.

Document your decision: Add a note in the experiment description with the outcome and rationale. This creates institutional memory.

Feature Flags

Feature flags let you control which users see a feature without deploying new code. They're useful for:

Canary releases: Deploy to 5% of users first, monitor, then expand
Kill switches: Instant rollback without a code deploy
Beta programs: Enable features for specific user groups
Gradual rollouts: Expand from 10% → 50% → 100% over time

Creating a Feature Flag

Navigate to Feature Flags → New Flag:

Key: Identifier used in code, e.g., new-checkout-flow (cannot change after creation)
Name: Human-readable name
Description: What this flag controls
Rollout Percentage: Start at 0 for a disabled flag

Controlling Rollout

Manual rollout:

Go to the flag → Edit → set Rollout Percentage to the desired value. Changes take effect within 60 seconds (cache TTL).

Percentage	Meaning
0%	Feature disabled for all users
5%	1 in 20 users sees the feature
50%	Half of users
100%	All users see the feature

Who gets the feature? The platform uses a deterministic hash of the user's ID. The same user always gets the same result (sticky assignment). This means:

If you set 5%, the same 5% of users will consistently see the feature
If you increase to 50%, the original 5% are still included (plus 45% more)

Adding targeting rules:

Optionally restrict to specific users:

Enable only for users in the US
Only for users on the "enterprise" plan
Only for users who signed up before a certain date

Scheduled Rollouts

For gradual rollouts on a schedule:

Go to Flag → Rollout Schedule → Create Schedule:

Stage	Target %	Trigger Type	Start Date
Stage 1	5%	Time-based	Apr 1, 2026
Stage 2	25%	Time-based	Apr 8, 2026
Stage 3	100%	Manual	—

The platform automatically advances time-based stages. Manual stages require you to explicitly click Advance Stage in the UI.

Best practice: End your schedule with a manual stage for the final 100% rollout. This gives you a human approval gate before full deployment.

Disabling / Rolling Back

To instantly disable a flag:

Set Rollout Percentage = 0
Or click Deactivate to change status to INACTIVE

If you have safety monitoring configured, the platform can auto-rollback if error rates spike.

Analytics & Reporting

Viewing All Experiments

Experiments list shows all experiments with status, owner, and quick stats.

Filter by:

Status: DRAFT / ACTIVE / PAUSED / COMPLETED
Date range
Owner
Experiment type

Exporting Results

From the Results page:

CSV export: Download variant-level metric data
API access: GET /api/v1/results/{id} returns full JSON with all statistics

Targeting Rules Reference

Basic Rules

country IN [US, CA, GB]          → User's country is one of the list
plan EQUALS premium              → Exact match
age GREATER_THAN 18              → Numeric comparison
email CONTAINS @company.com     → String match
app_version SEMVER_GT 2.0.0     → Semantic version comparison

Logical Operators

AND: all rules must match
OR:  at least one rule must match
NOT: rule must NOT match

Complex Example

Target US/EU premium users who are active:

AND:
  country IN [US, GB, DE, FR]
  plan IN [pro, enterprise]
  days_since_last_login LESS_THAN 30

Target either new users OR power users (but not average users):

OR:
  account_age_days LESS_THAN 7
  AND:
    lifetime_purchases GREATER_THAN 20
    plan EQUALS enterprise

Best Practices

Experiment Design

Do:

Define your hypothesis and primary metric BEFORE starting the experiment
Run the experiment for at least one full week to capture weekly patterns
Wait for statistical significance AND adequate sample size before deciding
Document your hypothesis, result, and decision for future reference
Run one change at a time per experiment (isolate variables)

Don't:

Stop the experiment as soon as you see significance (peeking inflates false positives)
Change the experiment configuration after it's started
Run too many concurrent experiments on the same users (interaction effects)
Use the novelty effect as evidence (users are excited about newness, not the feature)

Feature Flags

Do:

Always start at 0% and ramp up gradually for risky changes
Set up safety monitoring for flags that touch payments, auth, or core flows
Document what the flag controls and who to contact if issues arise
Clean up flags after full rollout — don't leave them in code forever

Don't:

Jump straight to 100% for significant changes
Leave flags enabled indefinitely — schedule cleanup
Use flags to hide incomplete features in production without testing

Statistical Significance

p-value < 0.05: 95% confident the difference is real (not random)
p-value < 0.01: 99% confident — use for high-stakes decisions
Always check effect size: A p-value of 0.001 means nothing if the effect is 0.1%
Minimum detectable effect: Design experiments to detect a meaningful improvement (e.g., 5%), not just any improvement

Advanced Statistical Methods

Sequential Testing — Stop Experiments Early

Standard A/B tests require a fixed sample size decided upfront. Sequential testing lets you monitor results continuously and stop as soon as you have enough evidence — without inflating your false positive rate.

When to use it: When you need results faster, or when you need to stop early if a variant is performing significantly worse.

Access via the Sequential tab on any experiment results page, or via:

GET /api/v1/results/{experiment_id}/sequential

The platform uses mSPRT (mixture Sequential Probability Ratio Test). When recommended_action is stop_for_effect or stop_for_futility, it is safe to stop. See the Sequential Testing Guide for details.

CUPED — Reach Significance Faster

CUPED reduces result noise by adjusting for each user's pre-experiment behavior. This typically cuts the required sample size by 20–40%.

When to use it: When you have historical metric data for your users (e.g., prior revenue, prior sessions). Works best when the covariate is strongly correlated with the outcome.

Access via the CUPED tab on any experiment results page, or via:

GET /api/v1/results/{experiment_id}/cuped

See the CUPED Guide for covariate selection guidance.

Dimensional Analysis — Did the Effect Vary by Segment?

After an experiment concludes, use dimensional analysis to understand whether the treatment worked differently for different groups (mobile vs desktop, new vs returning users, etc.).

Important: Segment findings are always exploratory. Use them to generate hypotheses for follow-up experiments, not as final conclusions.

Access via the Breakdowns tab on any experiment results page, or via:

GET /api/v1/results/{experiment_id}/breakdown?dimension=device

See the Dimensional Analysis Guide.

Multi-Armed Bandit — Maximize Conversions During the Experiment

A bandit experiment automatically shifts traffic toward the better-performing variant as data accumulates. Use this when maximizing conversions during the experiment matters more than getting precise effect size estimates.

When to use it: Short-lived promotions, content recommendations, or situations where you have many variants to test quickly.

Set optimization_type to thompson_sampling, ucb1, or epsilon_greedy when creating an experiment. See the Multi-Armed Bandit Guide.

Interaction Detection — Are Your Experiments Interfering?

When multiple experiments run simultaneously on overlapping user populations, they can distort each other's results. Run an interaction scan to check.

Access via:

GET /api/v1/interactions/scan

If high-risk pairs are found, add the experiments to a Mutual Exclusion Group to prevent overlap in future runs. See the Interaction Detection Guide.

Getting Help

API Documentation: http://localhost:8000/docs
Technical Guide: See Technical Guide for implementation details
Testing Guide: See Testing Guide for test workflows
Issues: https://github.com/amarkanday/experimentation-platform/issues
Slack: #experimentation-platform channel