Core Concepts
This guide explains the fundamental concepts you need to understand before working with the platform. Whether you are a product manager designing experiments, an engineer integrating feature flags, or an analyst reading results, this reference will clarify what each term means and how the pieces fit together.
Feature Flags
A feature flag (also called a feature toggle or feature switch) is a configuration key that controls whether a piece of functionality is visible or active for a given user. Instead of deploying code to enable or disable a feature, you change a flag value at runtime — no deployment required.
On/Off Flags
The simplest type. A boolean value controls whether the feature is shown at all.
dark-mode: true → show dark theme
dark-mode: false → show light theme
Common uses: gradual rollouts, kill switches, beta programs, maintenance mode.
Multivariate Flags
Instead of a boolean, the flag returns one of several string values. Each value maps to a different experience.
checkout-layout: "classic"
checkout-layout: "simplified"
checkout-layout: "one-page"
Common uses: testing more than two variants, configuring UI themes, enabling different feature tiers per plan.
Why Use Feature Flags
- Deploy code that is hidden behind a flag before it is ready to ship
- Roll out to a small percentage of users first, then expand gradually
- Instantly disable a problematic feature without a code deploy
- Enable features for internal testers before public launch
- Target features to specific segments (premium users, US users, mobile users)
Experiments / A/B Tests
An experiment (also called an A/B test or controlled experiment) is the process of splitting users into groups, exposing each group to a different version of a feature, and measuring which version performs better on a defined metric.
Hypothesis
Every experiment starts with a hypothesis: a precise, falsifiable statement of what you expect to happen and why.
Good hypothesis: "Changing the checkout CTA button from blue to green will increase checkout completion rate by 5% because green creates stronger visual contrast against the white background."
A good hypothesis specifies the change, the expected direction, the magnitude, and the reason.
Control vs Treatment
- Control: the existing experience, unchanged. Serves as the baseline.
- Treatment (or Variant): the new experience you are testing.
An experiment can have one control and one or more treatments. With two variants (control + one treatment) it is called an A/B test. With more than two variants it is called an A/B/n test or multivariate test.
Statistical Significance
Statistical significance tells you how confident you are that the difference between control and treatment is real — not just random noise. It is expressed as a p-value.
- p-value < 0.05: 95% confidence the difference is real. Standard threshold for most experiments.
- p-value < 0.01: 99% confidence. Use for high-stakes decisions.
A result is statistically significant when p < your chosen alpha threshold (usually 0.05).
Important: statistical significance does not tell you the effect is large or meaningful. Always check effect size alongside p-value.
Variants
A variant is one of the possible experiences in an experiment. Every experiment has at least two variants:
- Control variant: the baseline experience (usually what users see today)
- Treatment variant(s): the new experience(s) you are testing
Each variant is assigned a traffic weight — the fraction of eligible users who will be shown that variant. Weights must sum to 100% (or less if you want to exclude some users from the experiment entirely).
Variants vs Feature Flags
A feature flag controls access to a feature for a user. A variant controls which version of a feature a user sees within an experiment. Experiments are temporary; feature flags are often permanent configuration.
Metrics
A metric is the measurable outcome you use to evaluate whether a variant is better or worse than the control.
Metric Types
| Type | Description | Example |
|---|---|---|
| Conversion | Binary: did the event happen or not? | Clicked button, completed checkout, signed up |
| Revenue | Dollar value | Order value, lifetime value |
| Count | How many times the event occurred | Page views per session, items added to cart |
| Duration | Time measurement | Session length, time to first action |
Primary vs Secondary Metrics
Each experiment should have exactly one primary metric — the north star for your ship/no-ship decision. All other metrics are secondary and used to check for unexpected side effects.
Example: primary = checkout completion rate; secondary = cart abandonment rate, average order value, customer support contacts.
Guardrail Metrics
Guardrail metrics are secondary metrics with a minimum acceptable threshold. If a treatment improves your primary metric but degrades a guardrail metric beyond its threshold, the experiment should not ship.
Example guardrail: "Support ticket rate must not increase by more than 5%."
User Bucketing
User bucketing is the process of assigning each user to a variant. The platform uses consistent hash-based bucketing.
Consistent Hash Assignment
The assignment is computed as a deterministic hash of experiment_key + user_id. This means:
- The same user always gets the same variant for a given experiment (sticky assignment)
- No database lookup is needed to retrieve the assignment — it can be computed on the fly
- Sticky assignment survives server restarts, cache clears, and SDK reinitializations
The hash output is mapped to a bucket number (0–99). Variant boundaries are set by traffic weights: if control has 50% weight and treatment has 50% weight, users with bucket 0–49 get control and users with bucket 50–99 get treatment.
Why Consistent Bucketing Matters
Without consistent bucketing, a user could see the control variant on Monday and the treatment variant on Wednesday. This would contaminate your results and create a confusing user experience.
Traffic Allocation vs Rollout Percentage
These two concepts are related but distinct:
Traffic Allocation (Experiments)
Traffic allocation is the fraction of users who are included in the experiment at all. An experiment with 80% traffic allocation means 20% of eligible users are excluded from the experiment entirely and see the default experience.
Within the 80% who are included, users are split between variants according to their weights.
Rollout Percentage (Feature Flags)
Rollout percentage is the fraction of users for whom a feature flag evaluates to true (enabled). A feature flag at 10% rollout means 10% of users see the feature and 90% do not.
Unlike experiment traffic allocation, rollout percentage is usually increased over time as confidence grows.
Targeting Rules
Targeting rules define which users are eligible for an experiment or feature flag. A user who does not match the targeting rules is excluded entirely — they are not assigned to any variant and do not count in the results.
Attribute-Based Rules
Rules are evaluated against user attributes that your application passes to the platform at evaluation time.
country IN [US, CA, GB]
plan EQUALS premium
account_age_days GREATER_THAN 30
app_version SEMVER_GTE 3.0.0
email ENDS_WITH @company.com
Combining Rules
Rules can be combined using AND (all must match), OR (at least one must match), and NOT (inverts the result).
AND:
country IN [US, CA]
plan EQUALS enterprise
account_age_days GREATER_THAN 14
This would target enterprise users in the US or Canada who have had their account for at least two weeks.
Common Targeting Attributes
| Attribute | Type | Example |
|---|---|---|
user_id | string | Target a specific beta tester |
country | string | Geographic targeting |
plan | string | Subscription tier |
account_age_days | number | Target new vs returning users |
app_version | semver | Target specific app versions |
device_type | string | Mobile vs desktop |
email | string | Target internal users by domain |
Environments
The platform supports multiple environments so you can test experiments and flags in staging before they reach production users.
Staging vs Production
| Environment | Purpose |
|---|---|
| staging | Test experiment logic, SDK integration, and targeting rules before launch |
| production | Live experiments and flags serving real users |
Each environment has separate feature flag states and separate experiment assignments. A feature flag that is 100% enabled in staging has no effect in production unless you also enable it there.
Environment Variables
Your SDK client is initialized with an API key that is scoped to a specific environment. Use separate API keys for staging and production to prevent accidental cross-environment leakage.
Glossary
| Term | Definition |
|---|---|
| A/B test | Experiment comparing two variants (control and treatment) |
| A/B/n test | Experiment comparing more than two variants |
| Alpha (α) | The significance threshold; probability of a false positive you are willing to accept (typically 0.05) |
| Bucketing | The process of assigning users to variants |
| Confidence interval | Range of values within which the true effect size likely falls |
| Control | The baseline variant; typically the existing experience |
| Conversion rate | Fraction of users who completed a defined action |
| Covariate | Pre-experiment user attribute used in CUPED variance reduction |
| CUPED | Controlled-experiment Using Pre-Experiment Data; a variance reduction technique |
| Effect size | The magnitude of the difference between variants |
| Experiment | A controlled test that assigns users to variants and measures outcomes |
| Feature flag | A runtime configuration key that controls feature visibility |
| Guardrail metric | A secondary metric that must not degrade beyond a threshold |
| Holdout | A group of users excluded from all experiments, used to measure cumulative experiment impact |
| Hypothesis | A falsifiable statement predicting what will happen in an experiment and why |
| MAB | Multi-Armed Bandit; an adaptive experiment that shifts traffic toward better-performing variants |
| MDE | Minimum Detectable Effect; the smallest improvement worth measuring |
| Metric | A measurable outcome used to evaluate experiment results |
| Mutual exclusion group | A set of experiments configured so users are enrolled in at most one at a time |
| p-value | Probability of observing results as extreme as these if there were no true effect |
| Primary metric | The single metric that determines the ship/no-ship decision |
| Rollout percentage | Fraction of users for whom a feature flag is enabled |
| Segment | A subset of users defined by shared attributes |
| Statistical significance | p-value below the chosen alpha threshold |
| Targeting rule | An attribute-based condition that determines experiment eligibility |
| Traffic allocation | Fraction of eligible users included in an experiment |
| Treatment | A non-control variant in an experiment |
| Variant | One of the possible experiences in an experiment |