Architecture Overview

This document describes the high-level technical architecture of the platform: how the components fit together, how data flows, and how the system is deployed on AWS.

High-Level Components

┌──────────────────────────────────────────────────────────────────┐
│                        Client Applications                        │
│           (web, mobile, server — via SDK or REST API)            │
└──────────────────────┬───────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────────┐
│                     CloudFront CDN                                │
│              (static assets, Lambda@Edge for split URLs)         │
└──────────────────────┬───────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────────┐
│                    API Layer (FastAPI)                            │
│        Experiments / Feature Flags / Analytics / Auth            │
└──────┬───────────────┬───────────────┬───────────────────────────┘
       │               │               │
       ▼               ▼               ▼
  PostgreSQL        Redis           DynamoDB
  (Aurora)      (ElastiCache)    (Real-time counters)
                                       │
                                       ▼
                              Kinesis → Lambda → OpenSearch
                                  (Analytics pipeline)

API Layer

FastAPI Application

The core of the platform is a FastAPI application that exposes a comprehensive REST API. It is deployed on ECS Fargate and scales horizontally behind an Application Load Balancer.

The API is organized into resource-specific endpoint groups:

Endpoint Group	Description
`/api/v1/experiments`	Create, update, start, pause, stop, and query experiments
`/api/v1/feature-flags`	Manage feature flags, targeting rules, and rollout percentages
`/api/v1/results`	Read statistical results, run CUPED, dimensional analysis, sequential testing
`/api/v1/tracking`	Ingest assignment and event data from client applications
`/api/v1/users`	User management and authentication
`/api/v1/compliance`	Audit event log and compliance reports
`/api/v1/integrations`	Jira, Salesforce, and GitHub integrations
`/api/v1/notifications`	Slack and email alert preferences
`/api/v1/rbac`	Role management and permission grants
`/api/v1/bandit`	Multi-armed bandit state and weight management
`/api/v1/warehouse`	Warehouse-native analytics connections and syncs

Interactive API documentation is available at /docs (Swagger UI) and /redoc.

Background Schedulers

Several background tasks run on a configurable cycle alongside the API process:

Scheduler	Default Cycle	Responsibility
Experiment scheduler	15 minutes	Auto-starts experiments at `start_date`, auto-stops at `end_date`
Rollout scheduler	15 minutes	Advances rollout schedule stages based on time-based triggers
Metrics collector	15 minutes	Aggregates raw events into experiment metric summaries
Safety monitor	5 minutes	Checks error rate and latency thresholds; triggers auto-rollback if breached
Bandit scheduler	5 minutes	Recomputes variant weights for active multi-armed bandit experiments

Database Layer

PostgreSQL (Aurora)

Amazon Aurora PostgreSQL is the primary relational database. It stores all persistent application data:

Experiments, variants, metrics definitions
Feature flags, targeting rules, rollout schedules
User accounts, roles, permissions
Audit events and compliance records
Integration configurations
Results aggregations

All schema changes are managed with Alembic migrations. The schema is namespaced under the experimentation PostgreSQL schema.

The production Aurora cluster is configured with a read replica. Write operations go to the primary instance; read-heavy queries (results, audit log reads) can be directed to the replica.

Redis (ElastiCache)

Redis serves two purposes:

Session storage: User session tokens are stored in Redis with a configurable TTL. This allows horizontal scaling of the API layer without sticky sessions.
Application cache: Feature flag configurations and experiment assignments are cached in Redis to reduce database load. The default cache TTL is 60 seconds, meaning flag changes propagate to all users within one minute.

Lambda Functions

Three Lambda functions handle high-throughput, latency-sensitive operations:

Experiment Assignment Lambda

Evaluates experiment targeting rules and assigns users to variants. Called directly by SDK clients for server-side assignment.

Input: experiment_key, user_id, attributes
Output: variant_key, assignment metadata
Uses consistent hash bucketing for deterministic results

Event Processor Lambda

Processes incoming tracking events (impressions and conversions) at scale. Triggered by events arriving on the Kinesis stream.

Validates events against known experiment/metric configurations
Writes processed events to OpenSearch for analytics queries
Updates DynamoDB counters atomically

Feature Flag Evaluation Lambda

Evaluates feature flag targeting rules and rollout percentages for a given user. Called by SDK clients for server-side flag evaluation.

Input: flag_key, user_id, attributes
Output: enabled (boolean), flag metadata
Results are cached at the Lambda layer using an in-memory LRU cache

Real-Time Counters (DynamoDB)

Impression and conversion counts are tracked in DynamoDB using atomic ADD operations. This provides high-throughput, conflict-free counter increments without database locking.

Why DynamoDB for Counters

Traditional relational databases struggle with high-frequency counter updates because each update requires a read-modify-write cycle. DynamoDB's atomic ADD operation performs the increment server-side, making it safe for concurrent writes from many Lambda instances.

Counter Schema

partition_key: experiment_id
sort_key:      variant_id#metric_key
impressions:   (atomic counter)
conversions:   (atomic counter)

The metrics collector scheduler reads from DynamoDB every 15 minutes and writes aggregated results to PostgreSQL for historical storage and statistical analysis.

Analytics Pipeline

Raw events flow through a pipeline for aggregation and search:

Client App
    |
    v
POST /api/v1/tracking/track
    |
    v
Kinesis Data Stream
    |
    v
Event Processor Lambda
    |
    +---> DynamoDB (atomic counters, real-time)
    |
    +---> OpenSearch (full event index, for ad-hoc queries)

Kinesis buffers events and decouples ingestion from processing. The platform can handle spikes without dropping events.

OpenSearch provides the backing store for dimensional analysis, segment breakdowns, and ad-hoc event queries.

Split URL Testing (Lambda@Edge)

Split URL experiments use a different flow from standard A/B tests. Instead of modifying a component within a page, the entire URL path changes between variants. This is handled at the CDN layer:

User Request
    |
    v
CloudFront Distribution
    |
    v
Lambda@Edge (viewer-request event)
    |
    +-- Read cookie exp_{experiment_key}
    |     - Present: use stored variant
    |     - Absent: consistent-hash bucket user_id → assign variant
    |
    +-- Return 302 redirect to variant URL
    |
    +-- Set-Cookie: exp_{key}=variant; Max-Age=31536000; Secure; SameSite=Lax

The Lambda@Edge function is deployed to us-east-1 (a requirement for Lambda@Edge) and runs globally on every CloudFront PoP. The split_url_config is fetched from the API at cold start and cached for 60 seconds.

Authentication

AWS Cognito

User authentication is handled by AWS Cognito. The platform integrates with a Cognito User Pool:

Users log in via the dashboard or API
Cognito issues a JWT access token (valid for 30 minutes) and a refresh token
The JWT is passed in the Authorization: Bearer <token> header on all API requests
The API validates the JWT signature against Cognito's public keys

API Key Authentication

SDK clients and server-to-server integrations use API keys instead of JWT tokens. API keys are:

Created by admins via POST /api/v1/api-keys
Passed in the X-API-Key: <key> header
Scoped to specific operations (read, write, or admin)
Revocable without affecting user accounts

Role-Based Access Control

Every user has one of four roles, each with progressively broader permissions:

Role	Description
VIEWER	Read-only access to approved experiments and results
ANALYST	View all experiments, results, audit logs, and reports
DEVELOPER	Create and manage experiments and feature flags
ADMIN	Full access: user management, global settings, compliance exports

Custom roles and direct permission grants extend the base role system for fine-grained access control.

CDK Infrastructure

All AWS infrastructure is defined as code using AWS CDK v2 (TypeScript). Running cdk deploy --all provisions the complete environment from scratch.

Stacks

Stack	Contents
NetworkStack	VPC, subnets, security groups
DatabaseStack	Aurora PostgreSQL, ElastiCache Redis
ApiStack	ECS Fargate cluster, ALB, ECS service, IAM roles
LambdaStack	Assignment, Event Processor, Feature Flag Evaluation, Event Processor Lambda
DynamoStack	DynamoDB tables for real-time counters
StreamingStack	Kinesis stream, OpenSearch domain
MonitoringStack	CloudWatch dashboards, alarms, log groups
SplitUrlStack	CloudFront distribution, Lambda@Edge function

Deployment Model

The API service uses blue/green deployment via AWS CodeDeploy:

A new task definition is registered (the "green" environment)
CodeDeploy gradually shifts traffic from the old (blue) to the new (green) deployment
If health checks fail, traffic is automatically shifted back to blue
The full cutover completes within minutes with zero downtime

Health and Observability

Endpoint / Resource	Purpose
`GET /health`	Returns `{"status": "ok"}` when the API is healthy
`GET /metrics`	Prometheus-format metrics (request count, latency histograms, error rates)
CloudWatch Dashboards	API latency, Lambda invocations, error rates, queue depth
CloudWatch Alarms	p99 latency > 1s, error rate > 1%, DLQ messages > 0
Structlog JSON logs	Structured logs with `request_id`, `user_id`, `action`, `duration_ms`
AWS X-Ray	Distributed tracing across API, Lambda, and DynamoDB calls