Architecture Overview

This document describes the high-level technical architecture of the platform: how the components fit together, how data flows, and how the system is deployed on AWS.


High-Level Components

┌──────────────────────────────────────────────────────────────────┐
│                        Client Applications                        │
│           (web, mobile, server — via SDK or REST API)            │
└──────────────────────┬───────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────────┐
│                     CloudFront CDN                                │
│              (static assets, Lambda@Edge for split URLs)         │
└──────────────────────┬───────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────────┐
│                    API Layer (FastAPI)                            │
│        Experiments / Feature Flags / Analytics / Auth            │
└──────┬───────────────┬───────────────┬───────────────────────────┘
       │               │               │
       ▼               ▼               ▼
  PostgreSQL        Redis           DynamoDB
  (Aurora)      (ElastiCache)    (Real-time counters)
                                       │
                                       ▼
                              Kinesis → Lambda → OpenSearch
                                  (Analytics pipeline)

API Layer

FastAPI Application

The core of the platform is a FastAPI application that exposes a comprehensive REST API. It is deployed on ECS Fargate and scales horizontally behind an Application Load Balancer.

The API is organized into resource-specific endpoint groups:

Endpoint GroupDescription
/api/v1/experimentsCreate, update, start, pause, stop, and query experiments
/api/v1/feature-flagsManage feature flags, targeting rules, and rollout percentages
/api/v1/resultsRead statistical results, run CUPED, dimensional analysis, sequential testing
/api/v1/trackingIngest assignment and event data from client applications
/api/v1/usersUser management and authentication
/api/v1/complianceAudit event log and compliance reports
/api/v1/integrationsJira, Salesforce, and GitHub integrations
/api/v1/notificationsSlack and email alert preferences
/api/v1/rbacRole management and permission grants
/api/v1/banditMulti-armed bandit state and weight management
/api/v1/warehouseWarehouse-native analytics connections and syncs

Interactive API documentation is available at /docs (Swagger UI) and /redoc.

Background Schedulers

Several background tasks run on a configurable cycle alongside the API process:

SchedulerDefault CycleResponsibility
Experiment scheduler15 minutesAuto-starts experiments at start_date, auto-stops at end_date
Rollout scheduler15 minutesAdvances rollout schedule stages based on time-based triggers
Metrics collector15 minutesAggregates raw events into experiment metric summaries
Safety monitor5 minutesChecks error rate and latency thresholds; triggers auto-rollback if breached
Bandit scheduler5 minutesRecomputes variant weights for active multi-armed bandit experiments

Database Layer

PostgreSQL (Aurora)

Amazon Aurora PostgreSQL is the primary relational database. It stores all persistent application data:

  • Experiments, variants, metrics definitions
  • Feature flags, targeting rules, rollout schedules
  • User accounts, roles, permissions
  • Audit events and compliance records
  • Integration configurations
  • Results aggregations

All schema changes are managed with Alembic migrations. The schema is namespaced under the experimentation PostgreSQL schema.

The production Aurora cluster is configured with a read replica. Write operations go to the primary instance; read-heavy queries (results, audit log reads) can be directed to the replica.

Redis (ElastiCache)

Redis serves two purposes:

  1. Session storage: User session tokens are stored in Redis with a configurable TTL. This allows horizontal scaling of the API layer without sticky sessions.

  2. Application cache: Feature flag configurations and experiment assignments are cached in Redis to reduce database load. The default cache TTL is 60 seconds, meaning flag changes propagate to all users within one minute.


Lambda Functions

Three Lambda functions handle high-throughput, latency-sensitive operations:

Experiment Assignment Lambda

Evaluates experiment targeting rules and assigns users to variants. Called directly by SDK clients for server-side assignment.

  • Input: experiment_key, user_id, attributes
  • Output: variant_key, assignment metadata
  • Uses consistent hash bucketing for deterministic results

Event Processor Lambda

Processes incoming tracking events (impressions and conversions) at scale. Triggered by events arriving on the Kinesis stream.

  • Validates events against known experiment/metric configurations
  • Writes processed events to OpenSearch for analytics queries
  • Updates DynamoDB counters atomically

Feature Flag Evaluation Lambda

Evaluates feature flag targeting rules and rollout percentages for a given user. Called by SDK clients for server-side flag evaluation.

  • Input: flag_key, user_id, attributes
  • Output: enabled (boolean), flag metadata
  • Results are cached at the Lambda layer using an in-memory LRU cache

Real-Time Counters (DynamoDB)

Impression and conversion counts are tracked in DynamoDB using atomic ADD operations. This provides high-throughput, conflict-free counter increments without database locking.

Why DynamoDB for Counters

Traditional relational databases struggle with high-frequency counter updates because each update requires a read-modify-write cycle. DynamoDB's atomic ADD operation performs the increment server-side, making it safe for concurrent writes from many Lambda instances.

Counter Schema

partition_key: experiment_id
sort_key:      variant_id#metric_key
impressions:   (atomic counter)
conversions:   (atomic counter)

The metrics collector scheduler reads from DynamoDB every 15 minutes and writes aggregated results to PostgreSQL for historical storage and statistical analysis.


Analytics Pipeline

Raw events flow through a pipeline for aggregation and search:

Client App
    |
    v
POST /api/v1/tracking/track
    |
    v
Kinesis Data Stream
    |
    v
Event Processor Lambda
    |
    +---> DynamoDB (atomic counters, real-time)
    |
    +---> OpenSearch (full event index, for ad-hoc queries)

Kinesis buffers events and decouples ingestion from processing. The platform can handle spikes without dropping events.

OpenSearch provides the backing store for dimensional analysis, segment breakdowns, and ad-hoc event queries.


Split URL Testing (Lambda@Edge)

Split URL experiments use a different flow from standard A/B tests. Instead of modifying a component within a page, the entire URL path changes between variants. This is handled at the CDN layer:

User Request
    |
    v
CloudFront Distribution
    |
    v
Lambda@Edge (viewer-request event)
    |
    +-- Read cookie exp_{experiment_key}
    |     - Present: use stored variant
    |     - Absent: consistent-hash bucket user_id → assign variant
    |
    +-- Return 302 redirect to variant URL
    |
    +-- Set-Cookie: exp_{key}=variant; Max-Age=31536000; Secure; SameSite=Lax

The Lambda@Edge function is deployed to us-east-1 (a requirement for Lambda@Edge) and runs globally on every CloudFront PoP. The split_url_config is fetched from the API at cold start and cached for 60 seconds.


Authentication

AWS Cognito

User authentication is handled by AWS Cognito. The platform integrates with a Cognito User Pool:

  1. Users log in via the dashboard or API
  2. Cognito issues a JWT access token (valid for 30 minutes) and a refresh token
  3. The JWT is passed in the Authorization: Bearer <token> header on all API requests
  4. The API validates the JWT signature against Cognito's public keys

API Key Authentication

SDK clients and server-to-server integrations use API keys instead of JWT tokens. API keys are:

  • Created by admins via POST /api/v1/api-keys
  • Passed in the X-API-Key: <key> header
  • Scoped to specific operations (read, write, or admin)
  • Revocable without affecting user accounts

Role-Based Access Control

Every user has one of four roles, each with progressively broader permissions:

RoleDescription
VIEWERRead-only access to approved experiments and results
ANALYSTView all experiments, results, audit logs, and reports
DEVELOPERCreate and manage experiments and feature flags
ADMINFull access: user management, global settings, compliance exports

Custom roles and direct permission grants extend the base role system for fine-grained access control.


CDK Infrastructure

All AWS infrastructure is defined as code using AWS CDK v2 (TypeScript). Running cdk deploy --all provisions the complete environment from scratch.

Stacks

StackContents
NetworkStackVPC, subnets, security groups
DatabaseStackAurora PostgreSQL, ElastiCache Redis
ApiStackECS Fargate cluster, ALB, ECS service, IAM roles
LambdaStackAssignment, Event Processor, Feature Flag Evaluation, Event Processor Lambda
DynamoStackDynamoDB tables for real-time counters
StreamingStackKinesis stream, OpenSearch domain
MonitoringStackCloudWatch dashboards, alarms, log groups
SplitUrlStackCloudFront distribution, Lambda@Edge function

Deployment Model

The API service uses blue/green deployment via AWS CodeDeploy:

  1. A new task definition is registered (the "green" environment)
  2. CodeDeploy gradually shifts traffic from the old (blue) to the new (green) deployment
  3. If health checks fail, traffic is automatically shifted back to blue
  4. The full cutover completes within minutes with zero downtime

Health and Observability

Endpoint / ResourcePurpose
GET /healthReturns {"status": "ok"} when the API is healthy
GET /metricsPrometheus-format metrics (request count, latency histograms, error rates)
CloudWatch DashboardsAPI latency, Lambda invocations, error rates, queue depth
CloudWatch Alarmsp99 latency > 1s, error rate > 1%, DLQ messages > 0
Structlog JSON logsStructured logs with request_id, user_id, action, duration_ms
AWS X-RayDistributed tracing across API, Lambda, and DynamoDB calls