Physical Architecture
Overview
This document describes the physical infrastructure deployment of AWS Lens, including AWS services used, network topology, compute resources, storage systems, and operational procedures.
Deployment Model
AWS Lens supports three deployment models:
1. Multi-Tenant SaaS (Primary)
- Shared infrastructure across customers
- Data isolation via encryption and access controls
- Most cost-effective option
- Managed by CloudKeeper
2. Single-Tenant SaaS (Enterprise)
- Dedicated infrastructure per customer
- Complete isolation
- Custom configurations allowed
- Managed by CloudKeeper
3. On-Premise / VPC (Special Cases)
- Deployed in customer's AWS account
- Customer-managed infrastructure
- Air-gapped environments
- Full control
This document focuses on the Multi-Tenant SaaS deployment.
AWS Region Strategy
Primary Region: us-east-1 (N. Virginia)
Rationale:
- Lowest AWS pricing for most services
- Largest service availability
- Best connectivity to Snowflake (us-east-1)
- Closest to majority of customers (US-based)
Secondary Region: us-west-2 (Oregon)
Purpose:
- Disaster recovery (DR)
- Regional redundancy
- Compliance (data residency)
Future Expansion
Planned Regions:
- eu-west-1 (Ireland): European customers, GDPR compliance
- ap-south-1 (Mumbai): Asia-Pacific customers, data residency
Network Architecture
VPC Design
Network Specifications
VPC CIDR: 10.0.0.0/16 (65,536 IP addresses)
Subnet Allocation:
| Tier | Availability Zone | CIDR | Usable IPs | Purpose |
|---|---|---|---|---|
| Public | us-east-1a | 10.0.1.0/24 | 251 | ALB, NAT Gateway |
| Public | us-east-1b | 10.0.2.0/24 | 251 | ALB, NAT Gateway |
| Application | us-east-1a | 10.0.11.0/24 | 251 | ECS Tasks, Lambda |
| Application | us-east-1b | 10.0.12.0/24 | 251 | ECS Tasks, Lambda |
| Data | us-east-1a | 10.0.21.0/24 | 251 | RDS, Redis, MongoDB |
| Data | us-east-1b | 10.0.22.0/24 | 251 | RDS, Redis, MongoDB |
| Processing | us-east-1a | 10.0.31.0/24 | 251 | Spark, Airflow, Batch |
| Processing | us-east-1b | 10.0.32.0/24 | 251 | Spark, Airflow, Batch |
Security Groups
Compute Infrastructure
Application Layer - ECS Fargate
ECS Services:
| Service | Task Count | vCPU | Memory | Auto-Scaling | Purpose |
|---|---|---|---|---|---|
| api-service | 3-10 | 2 | 4 GB | CPU > 70% | REST API |
| web-service | 2-5 | 1 | 2 GB | CPU > 70% | React frontend serving |
| worker-service | 2-8 | 4 | 8 GB | Queue depth > 100 | Background processing |
| scheduler-service | 1 | 1 | 2 GB | No auto-scaling | Airflow scheduler |
Task Definitions:
# Example: api-service task definition
family: lens-api-service
networkMode: awsvpc
requiresCompatibilities:
- FARGATE
cpu: 2048 # 2 vCPU
memory: 4096 # 4 GB
containerDefinitions:
- name: api
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/lens-api:latest
portMappings:
- containerPort: 8080
protocol: tcp
environment:
- name: SPRING_PROFILES_ACTIVE
value: production
- name: DB_HOST
value: lens-mysql.cluster-xxxxx.us-east-1.rds.amazonaws.com
secrets:
- name: DB_PASSWORD
valueFrom: arn:aws:secretsmanager:us-east-1:123456789:secret:lens/db-password
logConfiguration:
logDriver: awslogs
options:
awslogs-group: /ecs/lens-api
awslogs-region: us-east-1
awslogs-stream-prefix: ecs
Processing Layer - EMR & Batch
Amazon EMR (Spark):
- Cluster Type: Transient (spins up for jobs, terminates after)
- Master: 1x m5.xlarge (4 vCPU, 16 GB)
- Core: 2-10x m5.2xlarge (8 vCPU, 32 GB) - auto-scaling
- Task: 0-20x m5.2xlarge Spot instances (80% cost reduction)
- Use Case: Daily CUR processing, aggregation jobs
AWS Batch:
- Compute Environment: Fargate Spot
- Max vCPUs: 256
- Use Case: Recommendation engine, ad-hoc analysis
Serverless Compute - Lambda
| Function | Runtime | Memory | Timeout | Trigger | Purpose |
|---|---|---|---|---|---|
| cur-detector | Python 3.11 | 512 MB | 30s | S3 Event | Detect new CUR files |
| cur-parser | Python 3.11 | 3 GB | 15 min | SQS | Parse CUR files |
| alert-sender | Node.js 20 | 256 MB | 10s | EventBridge | Send notifications |
| report-generator | Python 3.11 | 2 GB | 5 min | EventBridge | Generate scheduled reports |
Storage Infrastructure
Block Storage - EBS
Use Cases:
- RDS database storage
- MongoDB persistent volumes
- Temporary processing storage
Volumes:
| Volume | Type | Size | IOPS | Throughput | Purpose |
|---|---|---|---|---|---|
| RDS Primary | gp3 | 1 TB | 12,000 | 250 MB/s | MySQL database |
| RDS Standby | gp3 | 1 TB | 12,000 | 250 MB/s | MySQL replica |
| MongoDB | gp3 | 500 GB | 8,000 | 200 MB/s | Document storage |
| Processing | gp3 | 2 TB | 16,000 | 500 MB/s | Spark temp storage |
Object Storage - S3
Bucket Configuration:
| Bucket | Storage Class | Versioning | Encryption | Cross-Region Replication |
|---|---|---|---|---|
| lens-customer-cur | S3 Standard | Disabled | AES-256 (SSE-S3) | No |
| lens-processed-data | S3 Intelligent-Tiering | Enabled | AES-256 (SSE-KMS) | Yes (to us-west-2) |
| lens-backups | S3 Standard → Glacier | Enabled | AES-256 (SSE-KMS) | Yes (to us-west-2) |
| lens-logs | S3 Standard | Disabled | AES-256 (SSE-S3) | No |
| lens-static-assets | S3 Standard | Enabled | AES-256 (SSE-S3) | Yes (to CloudFront) |
Database Infrastructure
Relational Database - Amazon RDS MySQL
Instance Configuration:
Engine: MySQL 8.0.35
Instance Class: db.r6g.2xlarge (8 vCPU, 64 GB RAM)
Deployment: Multi-AZ (Primary in us-east-1a, Standby in us-east-1b)
Storage: 1 TB gp3 (12,000 IOPS, 250 MB/s throughput)
Backup Retention: 30 days
Automated Backups: Daily at 03:00 UTC
Read Replicas: 1 (for reporting queries)
High Availability:
Performance Tuning:
- Connection Pooling: 100 connections per app instance
- Query Cache: Disabled (using Redis instead)
- InnoDB Buffer Pool: 48 GB (75% of RAM)
- Read Replicas: Offload reporting queries
Cache - Amazon ElastiCache Redis
Cluster Configuration:
Engine: Redis 7.0
Node Type: cache.r6g.xlarge (4 vCPU, 26.32 GB RAM)
Deployment: Cluster mode enabled
Shards: 3
Replicas per Shard: 1
Total Nodes: 6 (3 primary + 3 replicas)
Multi-AZ: Enabled
Encryption: In-transit and at-rest
Redis Cluster Architecture:
Document Store - MongoDB Atlas
Cluster Configuration:
Provider: AWS
Region: us-east-1
Tier: M30 (2 vCPU, 8 GB RAM per node)
Deployment: Replica Set (3 nodes)
- Primary: us-east-1a
- Secondary 1: us-east-1b
- Secondary 2: us-east-1c
Storage: 512 GB
Backups: Continuous (point-in-time recovery)
Why MongoDB Atlas (not self-managed):
- Fully managed (auto-upgrades, monitoring, backups)
- Better security (encryption, access controls)
- Lower operational overhead
- Cost-effective for < 1 TB data
Data Warehouse - Snowflake
Warehouse Configuration:
Cloud Provider: AWS
Region: us-east-1
Account Edition: Enterprise
Virtual Warehouses:
1. INGESTION_WH (Medium, auto-suspend after 5 min)
- Purpose: Load CUR data
- Runs: 24/7, suspends when idle
2. ANALYTICS_WH (Large, auto-suspend after 10 min)
- Purpose: Dashboard queries, aggregations
- Runs: Business hours, auto-resumes on query
3. REPORTING_WH (X-Small, auto-suspend after 5 min)
- Purpose: Scheduled reports
- Runs: Scheduled jobs only
4. AD_HOC_WH (Large, auto-suspend after 2 min)
- Purpose: User ad-hoc queries
- Runs: On-demand
Storage:
- Database Size: ~10 TB (compressed)
- Retention: 1 year active, 7 years archived
- Clustering: Clustered by account_id and date for query performance
Load Balancing & CDN
Application Load Balancer
ALB Configuration:
Scheme: Internet-facing
IP Address Type: IPv4
Availability Zones: us-east-1a, us-east-1b
Security Groups: sg-alb-https (443 from 0.0.0.0/0)
Listeners:
- Port 443 (HTTPS) → Forward to target groups based on path
- Port 80 (HTTP) → Redirect to 443
SSL Certificate: *.cloudkeeper.com (ACM)
Idle Timeout: 60 seconds
Access Logs: Enabled (to S3)
Target Group Health Checks:
| Target Group | Protocol | Path | Interval | Timeout | Healthy Threshold | Unhealthy Threshold |
|---|---|---|---|---|---|---|
| API | HTTP | /api/health | 30s | 5s | 2 | 3 |
| Web | HTTP | / | 30s | 5s | 2 | 3 |
CloudFront CDN
CloudFront Configuration:
Distribution Domain: app.cloudkeeper.com
Origins:
1. ALB (Dynamic Content)
- Origin Protocol: HTTPS only
- Origin Path: /
- Cache Behavior: No caching (pass-through)
2. S3 (Static Content)
- Origin Protocol: HTTPS only
- Origin Path: /static
- Cache Behavior: Cache for 1 year
- Compress: Yes
Price Class: Use all edge locations
SSL Certificate: *.cloudkeeper.com (ACM)
HTTP Version: HTTP/2 enabled
WAF: Enabled (AWS WAF Web ACL)
Auto-Scaling Configuration
ECS Service Auto-Scaling
Scaling Policies:
| Service | Metric | Target | Scale Out | Scale In | Min | Max |
|---|---|---|---|---|---|---|
| API | CPU Utilization | 70% | +2 tasks | -1 task | 3 | 10 |
| API | Request Count | 1000/min | +1 task | -1 task | 3 | 10 |
| Worker | Queue Depth | 100 msgs | +2 tasks | -1 task | 2 | 8 |
| Web | CPU Utilization | 70% | +1 task | -1 task | 2 | 5 |
Cooldown Periods:
- Scale out cooldown: 60 seconds
- Scale in cooldown: 300 seconds (5 minutes)
Disaster Recovery Infrastructure
Cross-Region Replication
RTO & RPO:
| Component | RPO | RTO | Recovery Method |
|---|---|---|---|
| Application (ECS) | 0 | 5 min | Multi-AZ auto-recovery |
| RDS MySQL | 0 | 15 min | Multi-AZ automatic failover |
| Regional Outage | 1 hour | 4 hours | Manual failover to us-west-2 |
| Data Corruption | 24 hours | 24 hours | Point-in-time restore from backup |
Monitoring Infrastructure
CloudWatch Dashboards
Production Dashboard:
Log Aggregation
CloudWatch Logs:
| Log Group | Source | Retention | Purpose |
|---|---|---|---|
| /ecs/lens-api | ECS Tasks | 30 days | Application logs |
| /ecs/lens-worker | ECS Tasks | 30 days | Background job logs |
| /aws/lambda/cur-parser | Lambda | 14 days | CUR parsing logs |
| /aws/rds/instance/lens-mysql/error | RDS | 7 days | Database error logs |
| lens-alb-access-logs | ALB | 90 days | HTTP access logs |
Cost Optimization
Reserved Capacity
RDS Reserved Instances:
- Instance: db.r6g.2xlarge
- Term: 3 years, All Upfront
- Savings: 63% vs On-Demand ($30K/year → $11K/year)
ElastiCache Reserved Nodes:
- Node Type: cache.r6g.xlarge × 6 nodes
- Term: 1 year, Partial Upfront
- Savings: 35% vs On-Demand ($18K/year → $12K/year)
Compute Savings Plans
ECS Fargate Compute Savings Plan:
- Commitment: $500/hour
- Term: 1 year, No Upfront
- Savings: 20% vs On-Demand ($4.4M/year → $3.5M/year)
Spot Instances
EMR Task Nodes:
- On-Demand: 2 core nodes (m5.2xlarge) = $0.768/hour
- Spot: 0-20 task nodes (m5.2xlarge) = $0.154/hour (80% savings)
- Monthly Savings: ~$8,000
Batch Jobs:
- Fargate Spot: 70% discount vs Fargate On-Demand
- Use Case: Non-critical, fault-tolerant workloads
Infrastructure as Code
Terraform
Repository Structure:
terraform/
├── environments/
│ ├── production/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ └── dr/
├── modules/
│ ├── networking/ # VPC, subnets, route tables
│ ├── compute/ # ECS, Lambda, Batch
│ ├── database/ # RDS, Redis, MongoDB
│ ├── storage/ # S3 buckets
│ ├── loadbalancer/ # ALB, target groups
│ ├── cdn/ # CloudFront
│ └── monitoring/ # CloudWatch, alarms
└── global/
├── iam/ # IAM roles, policies
└── route53/ # DNS records
State Management:
- Backend: S3 with DynamoDB locking
- Encryption: Server-side encryption enabled
- Versioning: Enabled for rollback
Operational Procedures
Deployment Pipeline
Backup & Restore Procedures
Daily Backups:
# RDS Automated Backup (daily at 03:00 UTC)
# Retention: 30 days
# Restore procedure:
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier lens-mysql-prod \
--target-db-instance-identifier lens-mysql-restored \
--restore-time 2025-10-25T03:00:00Z
# MongoDB Backup (via Atlas, daily at 04:00 UTC)
# Retention: 30 days
# Restore via Atlas UI or API
# Snowflake Time Travel (7 days retention)
# Restore procedure:
CREATE TABLE restored_table CLONE original_table
AT (TIMESTAMP => '2025-10-25 03:00:00'::timestamp);
Performance Benchmarks
API Performance
| Endpoint | p50 | p95 | p99 | Max RPS |
|---|---|---|---|---|
| GET /api/v1/costs | 45ms | 120ms | 250ms | 5,000 |
| GET /api/v1/dashboards | 80ms | 200ms | 400ms | 2,000 |
| POST /api/v1/reports | 150ms | 350ms | 800ms | 500 |
| GET /api/v1/recommendations | 100ms | 250ms | 500ms | 1,000 |
Database Performance
MySQL:
- Connections: 100 concurrent connections per app instance
- Query Performance: p95 < 50ms for indexed queries
- Replication Lag: < 1 second (Multi-AZ)
Redis:
- Get Operations: p99 < 1ms
- Cache Hit Rate: 85%+
- Throughput: 100,000 ops/sec
Snowflake:
- Query Performance: p95 < 5 seconds for dashboard queries
- Concurrent Queries: 50+ simultaneous queries
- Data Scan: 1 TB/min with Large warehouse
Next Steps
Related Documentation
- Solution Architecture - High-level architecture
- Logical Architecture - Component design
- Security Architecture - Security details
- Developer Quickstart - Development setup
This physical architecture reflects the current production deployment as of October 2025. Infrastructure specifications may change based on growth and optimization opportunities.