Monitoring and Observability
The platform integrates with AWS CloudWatch, Prometheus, and structlog to provide comprehensive visibility into API performance, Lambda execution, database health, and application behavior.
CloudWatch Dashboards
Deploying Dashboards
cd infrastructure/cloudwatch
./deploy-dashboards.sh
This script creates CloudFormation stacks for the pre-built dashboards:
| Dashboard | Contents |
|---|---|
| API Performance | Request rate, p50/p95/p99 latency, error rate, active connections |
| Lambda Functions | Invocation count, error rate, duration, throttle count (per function) |
| Database | Aurora CPU, IOPS, connection count, replica lag |
| DynamoDB | Read/write capacity units, throttled requests, system errors |
| Kinesis | GetRecords iterator age (processing lag), incoming records rate |
Creating Log Groups
Log groups must be created before the application starts writing logs:
aws logs create-log-group --log-group-name /experimentation-platform/api
aws logs create-log-group --log-group-name /experimentation-platform/services
aws logs create-log-group --log-group-name /experimentation-platform/errors
aws logs create-log-group --log-group-name /experimentation-platform/lambda
# Set retention to 90 days for cost control
aws logs put-retention-policy \
--log-group-name /experimentation-platform/api \
--retention-in-days 90
Prometheus Metrics
The API exposes a Prometheus-compatible metrics endpoint:
GET /metrics
No authentication is required (the endpoint should be restricted at the network level in production — allow access only from your Prometheus scraper's IP range).
Key Metrics Exported
| Metric | Type | Description |
|---|---|---|
http_requests_total | Counter | Total HTTP requests by method, path, and status code |
http_request_duration_seconds | Histogram | Request latency (p50, p95, p99 percentiles) |
http_requests_in_progress | Gauge | Currently active requests |
experiment_assignments_total | Counter | Variant assignments by experiment and variant |
feature_flag_evaluations_total | Counter | Flag evaluations by flag key and result |
events_tracked_total | Counter | Tracked events by type |
db_query_duration_seconds | Histogram | Database query latency |
cache_hits_total | Counter | Redis cache hits |
cache_misses_total | Counter | Redis cache misses |
Sample Prometheus Scrape Config
# prometheus.yml
scrape_configs:
- job_name: 'experimentation-api'
static_configs:
- targets: ['your-api.example.com:8000']
metrics_path: '/metrics'
scheme: 'https'
scrape_interval: 30s
Structured Logging (structlog)
The API uses structlog for JSON-formatted structured logs. Every log entry includes contextual fields for easy filtering and correlation.
Log Format
{
"timestamp": "2026-03-02T14:32:00.123Z",
"level": "info",
"logger": "app.api.experiments",
"request_id": "req-uuid-here",
"user_id": "user-uuid-here",
"action": "experiment.create",
"experiment_id": "exp-uuid-here",
"duration_ms": 42,
"status_code": 201
}
Standard Fields
| Field | Description |
|---|---|
timestamp | ISO 8601 UTC timestamp |
level | debug, info, warning, error, critical |
logger | Module path of the logging component |
request_id | Unique identifier for the HTTP request (correlates all log entries for one request) |
user_id | Authenticated user UUID (if applicable) |
action | Resource and action (e.g., experiment.start, feature_flag.toggle) |
duration_ms | Time taken for the operation in milliseconds |
status_code | HTTP response status code |
Querying Logs in CloudWatch Insights
-- Find slow API requests (> 1000ms)
fields timestamp, request_id, action, duration_ms, status_code
| filter duration_ms > 1000
| sort duration_ms desc
| limit 20
-- Find error requests in the last hour
fields timestamp, request_id, user_id, action, @message
| filter status_code >= 500
| stats count() as error_count by action
| sort error_count desc
-- Trace a specific request
fields @timestamp, level, action, @message
| filter request_id = "req-uuid-here"
| sort @timestamp asc
CloudWatch Alarms
Set up alarms to notify you when key thresholds are breached. Requires an SNS topic for notifications:
# Create SNS topic
aws sns create-topic --name experimentation-alerts
aws sns subscribe --topic-arn arn:aws:sns:us-east-1:123456789:experimentation-alerts \
--protocol email --notification-endpoint oncall@yourcompany.com
Recommended Alarms
# p99 API latency > 1 second
aws cloudwatch put-metric-alarm \
--alarm-name "ExperimentationAPI-HighLatency" \
--metric-name "http_request_duration_seconds" \
--namespace "ExperimentationPlatform" \
--extended-statistic "p99" \
--threshold 1.0 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 3 \
--period 60 \
--alarm-actions arn:aws:sns:us-east-1:123456789:experimentation-alerts
# Error rate > 1%
aws cloudwatch put-metric-alarm \
--alarm-name "ExperimentationAPI-HighErrorRate" \
--metric-name "http_request_5xx_total" \
--namespace "ExperimentationPlatform" \
--statistic "Sum" \
--threshold 100 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--period 60 \
--alarm-actions arn:aws:sns:us-east-1:123456789:experimentation-alerts
# Lambda DLQ messages > 0 (indicates failed events not being processed)
aws cloudwatch put-metric-alarm \
--alarm-name "EventProcessorDLQ-Messages" \
--metric-name "ApproximateNumberOfMessagesVisible" \
--namespace "AWS/SQS" \
--dimensions Name=QueueName,Value=experimentation-dlq \
--statistic "Sum" \
--threshold 0 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--period 60 \
--alarm-actions arn:aws:sns:us-east-1:123456789:experimentation-alerts
Distributed Tracing (AWS X-Ray)
The platform integrates with AWS X-Ray for distributed tracing across the API, Lambda functions, and DynamoDB calls.
Enabling X-Ray
Set the environment variable:
AWS_XRAY_DAEMON_ADDRESS=xray-daemon:2000
The ECS task definition includes the X-Ray daemon as a sidecar container. Traces are automatically captured for:
- All incoming HTTP requests
- DynamoDB read and write operations
- Lambda invocations
- External HTTP calls (Slack, SendGrid, GitHub, Salesforce)
Viewing Traces
- Open the AWS X-Ray Console
- Navigate to Traces and filter by service name:
experimentation-api - Use the Service Map to visualize dependencies
- Click any trace to see the full execution timeline and identify bottlenecks
Health Check
The platform exposes a health check endpoint for load balancer and monitoring use:
GET /health
{
"status": "ok",
"version": "1.4.2",
"database": "connected",
"redis": "connected",
"timestamp": "2026-03-02T14:32:00Z"
}
If any dependency is unreachable, the response returns status 503 Service Unavailable with details:
{
"status": "degraded",
"database": "connected",
"redis": "timeout",
"timestamp": "2026-03-02T14:32:00Z"
}
The ALB health check targets GET /health with a 5-second timeout and a 2/3 healthy/unhealthy threshold. Tasks that fail health checks are automatically replaced by ECS.
Key Metrics to Watch
These are the most important operational signals:
| Metric | Healthy Range | Action if Breached |
|---|---|---|
| API p99 latency | < 1s | Investigate slow queries, check DynamoDB throttling |
| API error rate (5xx) | < 0.1% | Check application logs for exceptions |
| Database connection pool | < 80% utilized | Increase pool size or scale Aurora |
| Redis cache hit rate | > 85% | Investigate cache key patterns, increase TTL |
| Kinesis iterator age | < 60s | Scale Lambda concurrency, check for processing errors |
| Event Processor DLQ | 0 messages | Process DLQ messages manually, fix poison pill events |
| Lambda cold start rate | < 5% | Keep functions warm with scheduled invocations if needed |
| Safety rollback frequency | < 1/week | Investigate flag quality and error rate thresholds |