AWS Integration
The platform is designed to run natively on AWS. This document describes how each AWS service is used and how to configure the integration.
ECS Fargate
The FastAPI application runs on Amazon ECS Fargate — a serverless container execution environment that removes the need to manage EC2 instances.
Deployment Architecture
- The Fargate service runs multiple task replicas behind an Application Load Balancer (ALB)
- Task health is monitored via the
/healthendpoint; unhealthy tasks are automatically replaced - Horizontal scaling is configured via ECS service auto-scaling based on CPU and memory utilization
Blue/Green Deployments
The platform uses AWS CodeDeploy for zero-downtime blue/green deployments:
- A new task definition is registered with the updated container image
- CodeDeploy creates a "green" target group and shifts 10% of traffic to it
- After a configurable bake time (default 5 minutes), full traffic is shifted to green
- If health checks fail at any stage, traffic shifts back to the blue target group automatically
To deploy a new version:
# Build and push the Docker image
docker build -t your-account.dkr.ecr.region.amazonaws.com/experimentation-api:latest .
docker push your-account.dkr.ecr.region.amazonaws.com/experimentation-api:latest
# Deploy via CDK (re-deploys the ECS service with the new image)
cd infrastructure && cdk deploy ApiStack
Environment Variables
The ECS task reads configuration from AWS Secrets Manager at startup. Set the following as ECS task environment variables or Secrets Manager references:
| Variable | Description |
|---|---|
DATABASE_URL | Aurora PostgreSQL connection string |
REDIS_URL | ElastiCache Redis URL |
SECRET_KEY | Application secret key (min 32 chars) |
COGNITO_USER_POOL_ID | AWS Cognito User Pool ID |
COGNITO_CLIENT_ID | Cognito app client ID |
AWS_REGION | AWS region for DynamoDB and Kinesis calls |
Aurora PostgreSQL
Amazon Aurora PostgreSQL is the primary relational database, storing all experiments, feature flags, users, results, and audit data.
Connection Configuration
# Environment variable format
DATABASE_URL=postgresql://username:password@aurora-cluster.cluster-xxxx.region.rds.amazonaws.com:5432/experimentation
Read Replicas
Aurora automatically maintains a read replica. To direct read-heavy queries to the replica, set:
DATABASE_REPLICA_URL=postgresql://username:password@aurora-cluster.cluster-ro-xxxx.region.rds.amazonaws.com:5432/experimentation
The API uses the replica URL for read-only queries (results pages, audit log reads) to reduce load on the primary.
Connection Pooling
The application uses SQLAlchemy with a connection pool. For production, configure:
DATABASE_POOL_SIZE=20
DATABASE_MAX_OVERFLOW=10
DATABASE_POOL_TIMEOUT=30
ElastiCache Redis
Amazon ElastiCache for Redis provides two functions:
Session Storage
User JWT sessions are stored in Redis with a TTL matching the token expiry time. This allows the API service to scale horizontally without sticky sessions — any task can validate any user's session.
REDIS_URL=redis://your-cluster.xxxxx.0001.use1.cache.amazonaws.com:6379
Application Cache
Feature flag configurations and experiment assignments are cached in Redis. The default TTL is 60 seconds. Changes to flags and experiments propagate to all users within one cache cycle.
REDIS_CACHE_TTL=60 # Cache TTL in seconds
REDIS_CACHE_MAX_SIZE=10000 # Max items in cache
DynamoDB
Amazon DynamoDB stores real-time impression and conversion counters using atomic ADD operations. This allows thousands of concurrent Lambda invocations to increment counters without database contention.
Table Structure
| Table | Partition Key | Sort Key | Purpose |
|---|---|---|---|
ExperimentCounters | experiment_id | variant_id#metric_key | Per-variant metric counts |
Atomic Increment
The Event Processor Lambda uses the DynamoDB UpdateItem with SET #counter = if_not_exists(#counter, :zero) + :increment to safely increment counters from concurrent invocations.
Configuration
DYNAMODB_TABLE_NAME=ExperimentCounters
DYNAMODB_REGION=us-east-1
Lambda Functions
Three Lambda functions handle high-throughput operations. They are deployed in the same AWS region as the API service.
Experiment Assignment Lambda
- Trigger: Direct invocation from SDK clients
- Function: Evaluates targeting rules, performs consistent-hash bucketing, returns variant assignment
- Environment variables:
EXPERIMENTATION_API_URL,DYNAMODB_TABLE_NAME
Event Processor Lambda
- Trigger: Kinesis Data Stream (
ExperimentEvents) - Function: Validates events, updates DynamoDB counters, indexes events in OpenSearch
- Batch size: 100 records per invocation (configurable)
- Bisect on error: Enabled — failed batches are split to isolate bad records
Feature Flag Evaluation Lambda
- Trigger: Direct invocation from SDK clients
- Function: Evaluates feature flag targeting rules and rollout percentage
- Cache: In-memory LRU cache (TTL 60s, max 1000 entries) to reduce API calls
Lambda Environment Variables
EXPERIMENTATION_API_URL=https://your-api.example.com
EXPERIMENTATION_API_KEY=your-internal-key
DYNAMODB_TABLE_NAME=ExperimentCounters
OPENSEARCH_ENDPOINT=https://your-domain.es.amazonaws.com
CloudFront and Lambda@Edge
Amazon CloudFront serves as the CDN for static frontend assets and as the execution layer for split URL experiments via Lambda@Edge.
Static Asset Distribution
The Next.js frontend build output is hosted in an S3 bucket and served through CloudFront with long-lived cache headers for hashed assets and short TTLs for HTML.
Lambda@Edge for Split URL Testing
For split URL experiments, a Lambda@Edge function is attached to the CloudFront distribution's viewer-request event:
- Reads the assignment cookie (
exp_{experiment_key}) - If absent, hashes the user ID to assign a variant
- Returns a
302 Foundredirect to the variant URL - Sets a 1-year
Set-Cookieheader for assignment persistence
Lambda@Edge functions must be deployed to us-east-1 (a CloudFront requirement) and are globally replicated to all edge locations.
// CDK construct usage (infrastructure/app.ts)
import { SplitUrlDistribution } from './constructs/SplitUrlDistribution';
new SplitUrlDistribution(this, 'CheckoutSplitUrl', {
experimentKey: 'checkout-flow-v2',
originDomainName: alb.loadBalancerDnsName,
experimentationApiUrl: 'https://your-api.example.com',
experimentationApiKey: apiKeySecret.secretValue.toString(),
});
Kinesis and OpenSearch
Kinesis Data Stream
Raw tracking events (impressions and conversions) are published to a Kinesis Data Stream (ExperimentEvents). This decouples event ingestion from processing and buffers traffic spikes.
KINESIS_STREAM_NAME=ExperimentEvents
KINESIS_REGION=us-east-1
OpenSearch
The Event Processor Lambda indexes processed events into Amazon OpenSearch Service for ad-hoc queries, dimensional analysis, and segment breakdowns.
OPENSEARCH_ENDPOINT=https://your-domain.us-east-1.es.amazonaws.com
OPENSEARCH_INDEX_PREFIX=experimentation-events
CDK Deployment
All infrastructure is defined in infrastructure/ using AWS CDK v2 (TypeScript).
Prerequisites
- AWS account with appropriate IAM permissions
- Node.js 18+
- AWS CDK v2:
npm install -g aws-cdk - AWS CLI configured:
aws configure
Deploy
# One-time bootstrap (per account/region)
cd infrastructure
cdk bootstrap aws://YOUR_ACCOUNT_ID/YOUR_REGION
# Deploy all stacks
cdk deploy --all
# Deploy a specific stack
cdk deploy ApiStack
Stacks Deployed
| Stack | Resources |
|---|---|
NetworkStack | VPC, subnets, NAT gateways, security groups |
DatabaseStack | Aurora cluster, ElastiCache cluster, subnet groups |
ApiStack | ECS cluster, Fargate service, ALB, IAM roles |
LambdaStack | Assignment, EventProcessor, FeatureFlagEvaluation Lambda functions |
DynamoStack | DynamoDB table for counters |
StreamingStack | Kinesis stream, OpenSearch domain |
MonitoringStack | CloudWatch dashboards, alarms, log groups |
SplitUrlStack | CloudFront distribution, Lambda@Edge |
Required IAM Permissions
The ECS task role and Lambda execution roles need the following permissions:
ECS Task Role
{
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:Query",
"kinesis:PutRecord",
"kinesis:PutRecords",
"secretsmanager:GetSecretValue",
"cognito-idp:AdminGetUser",
"cognito-idp:ListUsers"
],
"Resource": "*"
}
Lambda Execution Role (Event Processor)
{
"Effect": "Allow",
"Action": [
"dynamodb:UpdateItem",
"kinesis:GetRecords",
"kinesis:GetShardIterator",
"kinesis:DescribeStream",
"kinesis:ListShards",
"es:ESHttpPost",
"es:ESHttpPut"
],
"Resource": "*"
}
Environment Variables Reference
| Variable | Required | Description |
|---|---|---|
DATABASE_URL | Yes | Aurora PostgreSQL connection string |
REDIS_URL | Yes | ElastiCache Redis URL |
SECRET_KEY | Yes | Application secret key (min 32 chars) |
COGNITO_USER_POOL_ID | Yes | AWS Cognito User Pool ID |
COGNITO_CLIENT_ID | Yes | Cognito app client ID |
AWS_REGION | Yes | Primary AWS region |
DYNAMODB_TABLE_NAME | Yes | DynamoDB table name for counters |
KINESIS_STREAM_NAME | Yes | Kinesis stream name for events |
OPENSEARCH_ENDPOINT | Yes | OpenSearch domain endpoint |
SLACK_BOT_TOKEN | No | Slack bot token for alerting |
SENDGRID_API_KEY | No | SendGrid API key for email alerts |
AUDIT_HMAC_SECRET | Yes | Secret for HMAC-SHA256 audit event signing |
ANTHROPIC_API_KEY | No | Claude API key for AI experiment design |