AWS Integration

The platform is designed to run natively on AWS. This document describes how each AWS service is used and how to configure the integration.


ECS Fargate

The FastAPI application runs on Amazon ECS Fargate — a serverless container execution environment that removes the need to manage EC2 instances.

Deployment Architecture

  • The Fargate service runs multiple task replicas behind an Application Load Balancer (ALB)
  • Task health is monitored via the /health endpoint; unhealthy tasks are automatically replaced
  • Horizontal scaling is configured via ECS service auto-scaling based on CPU and memory utilization

Blue/Green Deployments

The platform uses AWS CodeDeploy for zero-downtime blue/green deployments:

  1. A new task definition is registered with the updated container image
  2. CodeDeploy creates a "green" target group and shifts 10% of traffic to it
  3. After a configurable bake time (default 5 minutes), full traffic is shifted to green
  4. If health checks fail at any stage, traffic shifts back to the blue target group automatically

To deploy a new version:

# Build and push the Docker image
docker build -t your-account.dkr.ecr.region.amazonaws.com/experimentation-api:latest .
docker push your-account.dkr.ecr.region.amazonaws.com/experimentation-api:latest

# Deploy via CDK (re-deploys the ECS service with the new image)
cd infrastructure && cdk deploy ApiStack

Environment Variables

The ECS task reads configuration from AWS Secrets Manager at startup. Set the following as ECS task environment variables or Secrets Manager references:

VariableDescription
DATABASE_URLAurora PostgreSQL connection string
REDIS_URLElastiCache Redis URL
SECRET_KEYApplication secret key (min 32 chars)
COGNITO_USER_POOL_IDAWS Cognito User Pool ID
COGNITO_CLIENT_IDCognito app client ID
AWS_REGIONAWS region for DynamoDB and Kinesis calls

Aurora PostgreSQL

Amazon Aurora PostgreSQL is the primary relational database, storing all experiments, feature flags, users, results, and audit data.

Connection Configuration

# Environment variable format
DATABASE_URL=postgresql://username:password@aurora-cluster.cluster-xxxx.region.rds.amazonaws.com:5432/experimentation

Read Replicas

Aurora automatically maintains a read replica. To direct read-heavy queries to the replica, set:

DATABASE_REPLICA_URL=postgresql://username:password@aurora-cluster.cluster-ro-xxxx.region.rds.amazonaws.com:5432/experimentation

The API uses the replica URL for read-only queries (results pages, audit log reads) to reduce load on the primary.

Connection Pooling

The application uses SQLAlchemy with a connection pool. For production, configure:

DATABASE_POOL_SIZE=20
DATABASE_MAX_OVERFLOW=10
DATABASE_POOL_TIMEOUT=30

ElastiCache Redis

Amazon ElastiCache for Redis provides two functions:

Session Storage

User JWT sessions are stored in Redis with a TTL matching the token expiry time. This allows the API service to scale horizontally without sticky sessions — any task can validate any user's session.

REDIS_URL=redis://your-cluster.xxxxx.0001.use1.cache.amazonaws.com:6379

Application Cache

Feature flag configurations and experiment assignments are cached in Redis. The default TTL is 60 seconds. Changes to flags and experiments propagate to all users within one cache cycle.

REDIS_CACHE_TTL=60           # Cache TTL in seconds
REDIS_CACHE_MAX_SIZE=10000   # Max items in cache

DynamoDB

Amazon DynamoDB stores real-time impression and conversion counters using atomic ADD operations. This allows thousands of concurrent Lambda invocations to increment counters without database contention.

Table Structure

TablePartition KeySort KeyPurpose
ExperimentCountersexperiment_idvariant_id#metric_keyPer-variant metric counts

Atomic Increment

The Event Processor Lambda uses the DynamoDB UpdateItem with SET #counter = if_not_exists(#counter, :zero) + :increment to safely increment counters from concurrent invocations.

Configuration

DYNAMODB_TABLE_NAME=ExperimentCounters
DYNAMODB_REGION=us-east-1

Lambda Functions

Three Lambda functions handle high-throughput operations. They are deployed in the same AWS region as the API service.

Experiment Assignment Lambda

  • Trigger: Direct invocation from SDK clients
  • Function: Evaluates targeting rules, performs consistent-hash bucketing, returns variant assignment
  • Environment variables: EXPERIMENTATION_API_URL, DYNAMODB_TABLE_NAME

Event Processor Lambda

  • Trigger: Kinesis Data Stream (ExperimentEvents)
  • Function: Validates events, updates DynamoDB counters, indexes events in OpenSearch
  • Batch size: 100 records per invocation (configurable)
  • Bisect on error: Enabled — failed batches are split to isolate bad records

Feature Flag Evaluation Lambda

  • Trigger: Direct invocation from SDK clients
  • Function: Evaluates feature flag targeting rules and rollout percentage
  • Cache: In-memory LRU cache (TTL 60s, max 1000 entries) to reduce API calls

Lambda Environment Variables

EXPERIMENTATION_API_URL=https://your-api.example.com
EXPERIMENTATION_API_KEY=your-internal-key
DYNAMODB_TABLE_NAME=ExperimentCounters
OPENSEARCH_ENDPOINT=https://your-domain.es.amazonaws.com

CloudFront and Lambda@Edge

Amazon CloudFront serves as the CDN for static frontend assets and as the execution layer for split URL experiments via Lambda@Edge.

Static Asset Distribution

The Next.js frontend build output is hosted in an S3 bucket and served through CloudFront with long-lived cache headers for hashed assets and short TTLs for HTML.

Lambda@Edge for Split URL Testing

For split URL experiments, a Lambda@Edge function is attached to the CloudFront distribution's viewer-request event:

  1. Reads the assignment cookie (exp_{experiment_key})
  2. If absent, hashes the user ID to assign a variant
  3. Returns a 302 Found redirect to the variant URL
  4. Sets a 1-year Set-Cookie header for assignment persistence

Lambda@Edge functions must be deployed to us-east-1 (a CloudFront requirement) and are globally replicated to all edge locations.

// CDK construct usage (infrastructure/app.ts)
import { SplitUrlDistribution } from './constructs/SplitUrlDistribution';

new SplitUrlDistribution(this, 'CheckoutSplitUrl', {
  experimentKey: 'checkout-flow-v2',
  originDomainName: alb.loadBalancerDnsName,
  experimentationApiUrl: 'https://your-api.example.com',
  experimentationApiKey: apiKeySecret.secretValue.toString(),
});

Kinesis and OpenSearch

Kinesis Data Stream

Raw tracking events (impressions and conversions) are published to a Kinesis Data Stream (ExperimentEvents). This decouples event ingestion from processing and buffers traffic spikes.

KINESIS_STREAM_NAME=ExperimentEvents
KINESIS_REGION=us-east-1

OpenSearch

The Event Processor Lambda indexes processed events into Amazon OpenSearch Service for ad-hoc queries, dimensional analysis, and segment breakdowns.

OPENSEARCH_ENDPOINT=https://your-domain.us-east-1.es.amazonaws.com
OPENSEARCH_INDEX_PREFIX=experimentation-events

CDK Deployment

All infrastructure is defined in infrastructure/ using AWS CDK v2 (TypeScript).

Prerequisites

  • AWS account with appropriate IAM permissions
  • Node.js 18+
  • AWS CDK v2: npm install -g aws-cdk
  • AWS CLI configured: aws configure

Deploy

# One-time bootstrap (per account/region)
cd infrastructure
cdk bootstrap aws://YOUR_ACCOUNT_ID/YOUR_REGION

# Deploy all stacks
cdk deploy --all

# Deploy a specific stack
cdk deploy ApiStack

Stacks Deployed

StackResources
NetworkStackVPC, subnets, NAT gateways, security groups
DatabaseStackAurora cluster, ElastiCache cluster, subnet groups
ApiStackECS cluster, Fargate service, ALB, IAM roles
LambdaStackAssignment, EventProcessor, FeatureFlagEvaluation Lambda functions
DynamoStackDynamoDB table for counters
StreamingStackKinesis stream, OpenSearch domain
MonitoringStackCloudWatch dashboards, alarms, log groups
SplitUrlStackCloudFront distribution, Lambda@Edge

Required IAM Permissions

The ECS task role and Lambda execution roles need the following permissions:

ECS Task Role

{
  "Effect": "Allow",
  "Action": [
    "dynamodb:GetItem",
    "dynamodb:PutItem",
    "dynamodb:UpdateItem",
    "dynamodb:Query",
    "kinesis:PutRecord",
    "kinesis:PutRecords",
    "secretsmanager:GetSecretValue",
    "cognito-idp:AdminGetUser",
    "cognito-idp:ListUsers"
  ],
  "Resource": "*"
}

Lambda Execution Role (Event Processor)

{
  "Effect": "Allow",
  "Action": [
    "dynamodb:UpdateItem",
    "kinesis:GetRecords",
    "kinesis:GetShardIterator",
    "kinesis:DescribeStream",
    "kinesis:ListShards",
    "es:ESHttpPost",
    "es:ESHttpPut"
  ],
  "Resource": "*"
}

Environment Variables Reference

VariableRequiredDescription
DATABASE_URLYesAurora PostgreSQL connection string
REDIS_URLYesElastiCache Redis URL
SECRET_KEYYesApplication secret key (min 32 chars)
COGNITO_USER_POOL_IDYesAWS Cognito User Pool ID
COGNITO_CLIENT_IDYesCognito app client ID
AWS_REGIONYesPrimary AWS region
DYNAMODB_TABLE_NAMEYesDynamoDB table name for counters
KINESIS_STREAM_NAMEYesKinesis stream name for events
OPENSEARCH_ENDPOINTYesOpenSearch domain endpoint
SLACK_BOT_TOKENNoSlack bot token for alerting
SENDGRID_API_KEYNoSendGrid API key for email alerts
AUDIT_HMAC_SECRETYesSecret for HMAC-SHA256 audit event signing
ANTHROPIC_API_KEYNoClaude API key for AI experiment design