Physical Architecture

Overview

This document describes the physical infrastructure deployment of AWS Lens, including AWS services used, network topology, compute resources, storage systems, and operational procedures.

Deployment Model

AWS Lens supports three deployment models:

1. Multi-Tenant SaaS (Primary)

Shared infrastructure across customers
Data isolation via encryption and access controls
Most cost-effective option
Managed by CloudKeeper

2. Single-Tenant SaaS (Enterprise)

Dedicated infrastructure per customer
Complete isolation
Custom configurations allowed
Managed by CloudKeeper

3. On-Premise / VPC (Special Cases)

Deployed in customer's AWS account
Customer-managed infrastructure
Air-gapped environments
Full control

This document focuses on the Multi-Tenant SaaS deployment.

AWS Region Strategy

Primary Region: us-east-1 (N. Virginia)

Rationale:

Lowest AWS pricing for most services
Largest service availability
Best connectivity to Snowflake (us-east-1)
Closest to majority of customers (US-based)

Secondary Region: us-west-2 (Oregon)

Purpose:

Disaster recovery (DR)
Regional redundancy
Compliance (data residency)

Future Expansion

Planned Regions:

eu-west-1 (Ireland): European customers, GDPR compliance
ap-south-1 (Mumbai): Asia-Pacific customers, data residency

Network Architecture

VPC Design

Network Specifications

VPC CIDR: 10.0.0.0/16 (65,536 IP addresses)

Subnet Allocation:

Tier	Availability Zone	CIDR	Usable IPs	Purpose
Public	us-east-1a	10.0.1.0/24	251	ALB, NAT Gateway
Public	us-east-1b	10.0.2.0/24	251	ALB, NAT Gateway
Application	us-east-1a	10.0.11.0/24	251	ECS Tasks, Lambda
Application	us-east-1b	10.0.12.0/24	251	ECS Tasks, Lambda
Data	us-east-1a	10.0.21.0/24	251	RDS, Redis, MongoDB
Data	us-east-1b	10.0.22.0/24	251	RDS, Redis, MongoDB
Processing	us-east-1a	10.0.31.0/24	251	Spark, Airflow, Batch
Processing	us-east-1b	10.0.32.0/24	251	Spark, Airflow, Batch

Security Groups

Compute Infrastructure

Application Layer - ECS Fargate

ECS Services:

Service	Task Count	vCPU	Memory	Auto-Scaling	Purpose
api-service	3-10	2	4 GB	CPU > 70%	REST API
web-service	2-5	1	2 GB	CPU > 70%	React frontend serving
worker-service	2-8	4	8 GB	Queue depth > 100	Background processing
scheduler-service	1	1	2 GB	No auto-scaling	Airflow scheduler

Task Definitions:

# Example: api-service task definition
family: lens-api-service
networkMode: awsvpc
requiresCompatibilities:
  - FARGATE
cpu: 2048      # 2 vCPU
memory: 4096   # 4 GB
containerDefinitions:
  - name: api
    image: 123456789.dkr.ecr.us-east-1.amazonaws.com/lens-api:latest
    portMappings:
      - containerPort: 8080
        protocol: tcp
    environment:
      - name: SPRING_PROFILES_ACTIVE
        value: production
      - name: DB_HOST
        value: lens-mysql.cluster-xxxxx.us-east-1.rds.amazonaws.com
    secrets:
      - name: DB_PASSWORD
        valueFrom: arn:aws:secretsmanager:us-east-1:123456789:secret:lens/db-password
    logConfiguration:
      logDriver: awslogs
      options:
        awslogs-group: /ecs/lens-api
        awslogs-region: us-east-1
        awslogs-stream-prefix: ecs

Processing Layer - EMR & Batch

Amazon EMR (Spark):

Cluster Type: Transient (spins up for jobs, terminates after)
Master: 1x m5.xlarge (4 vCPU, 16 GB)
Core: 2-10x m5.2xlarge (8 vCPU, 32 GB) - auto-scaling
Task: 0-20x m5.2xlarge Spot instances (80% cost reduction)
Use Case: Daily CUR processing, aggregation jobs

AWS Batch:

Compute Environment: Fargate Spot
Max vCPUs: 256
Use Case: Recommendation engine, ad-hoc analysis

Serverless Compute - Lambda

Function	Runtime	Memory	Timeout	Trigger	Purpose
cur-detector	Python 3.11	512 MB	30s	S3 Event	Detect new CUR files
cur-parser	Python 3.11	3 GB	15 min	SQS	Parse CUR files
alert-sender	Node.js 20	256 MB	10s	EventBridge	Send notifications
report-generator	Python 3.11	2 GB	5 min	EventBridge	Generate scheduled reports

Storage Infrastructure

Block Storage - EBS

Use Cases:

RDS database storage
MongoDB persistent volumes
Temporary processing storage

Volumes:

Volume	Type	Size	IOPS	Throughput	Purpose
RDS Primary	gp3	1 TB	12,000	250 MB/s	MySQL database
RDS Standby	gp3	1 TB	12,000	250 MB/s	MySQL replica
MongoDB	gp3	500 GB	8,000	200 MB/s	Document storage
Processing	gp3	2 TB	16,000	500 MB/s	Spark temp storage

Object Storage - S3

Bucket Configuration:

Bucket	Storage Class	Versioning	Encryption	Cross-Region Replication
lens-customer-cur	S3 Standard	Disabled	AES-256 (SSE-S3)	No
lens-processed-data	S3 Intelligent-Tiering	Enabled	AES-256 (SSE-KMS)	Yes (to us-west-2)
lens-backups	S3 Standard → Glacier	Enabled	AES-256 (SSE-KMS)	Yes (to us-west-2)
lens-logs	S3 Standard	Disabled	AES-256 (SSE-S3)	No
lens-static-assets	S3 Standard	Enabled	AES-256 (SSE-S3)	Yes (to CloudFront)

Database Infrastructure

Relational Database - Amazon RDS MySQL

Instance Configuration:

Engine: MySQL 8.0.35
Instance Class: db.r6g.2xlarge (8 vCPU, 64 GB RAM)
Deployment: Multi-AZ (Primary in us-east-1a, Standby in us-east-1b)
Storage: 1 TB gp3 (12,000 IOPS, 250 MB/s throughput)
Backup Retention: 30 days
Automated Backups: Daily at 03:00 UTC
Read Replicas: 1 (for reporting queries)

High Availability:

Performance Tuning:

Connection Pooling: 100 connections per app instance
Query Cache: Disabled (using Redis instead)
InnoDB Buffer Pool: 48 GB (75% of RAM)
Read Replicas: Offload reporting queries

Cache - Amazon ElastiCache Redis

Cluster Configuration:

Engine: Redis 7.0
Node Type: cache.r6g.xlarge (4 vCPU, 26.32 GB RAM)
Deployment: Cluster mode enabled
Shards: 3
Replicas per Shard: 1
Total Nodes: 6 (3 primary + 3 replicas)
Multi-AZ: Enabled
Encryption: In-transit and at-rest

Redis Cluster Architecture:

Document Store - MongoDB Atlas

Cluster Configuration:

Provider: AWS
Region: us-east-1
Tier: M30 (2 vCPU, 8 GB RAM per node)
Deployment: Replica Set (3 nodes)
  - Primary: us-east-1a
  - Secondary 1: us-east-1b
  - Secondary 2: us-east-1c
Storage: 512 GB
Backups: Continuous (point-in-time recovery)

Why MongoDB Atlas (not self-managed):

Fully managed (auto-upgrades, monitoring, backups)
Better security (encryption, access controls)
Lower operational overhead
Cost-effective for < 1 TB data

Data Warehouse - Snowflake

Warehouse Configuration:

Cloud Provider: AWS
Region: us-east-1
Account Edition: Enterprise

Virtual Warehouses:
1. INGESTION_WH (Medium, auto-suspend after 5 min)
   - Purpose: Load CUR data
   - Runs: 24/7, suspends when idle

2. ANALYTICS_WH (Large, auto-suspend after 10 min)
   - Purpose: Dashboard queries, aggregations
   - Runs: Business hours, auto-resumes on query

3. REPORTING_WH (X-Small, auto-suspend after 5 min)
   - Purpose: Scheduled reports
   - Runs: Scheduled jobs only

4. AD_HOC_WH (Large, auto-suspend after 2 min)
   - Purpose: User ad-hoc queries
   - Runs: On-demand

Storage:

Database Size: ~10 TB (compressed)
Retention: 1 year active, 7 years archived
Clustering: Clustered by account_id and date for query performance

Load Balancing & CDN

Application Load Balancer

ALB Configuration:

Scheme: Internet-facing
IP Address Type: IPv4
Availability Zones: us-east-1a, us-east-1b
Security Groups: sg-alb-https (443 from 0.0.0.0/0)
Listeners:
  - Port 443 (HTTPS) → Forward to target groups based on path
  - Port 80 (HTTP) → Redirect to 443
SSL Certificate: *.cloudkeeper.com (ACM)
Idle Timeout: 60 seconds
Access Logs: Enabled (to S3)

Target Group Health Checks:

Target Group	Protocol	Path	Interval	Timeout	Healthy Threshold	Unhealthy Threshold
API	HTTP	/api/health	30s	5s	2	3
Web	HTTP	/	30s	5s	2	3

CloudFront CDN

CloudFront Configuration:

Distribution Domain: app.cloudkeeper.com
Origins:
  1. ALB (Dynamic Content)
     - Origin Protocol: HTTPS only
     - Origin Path: /
     - Cache Behavior: No caching (pass-through)

  2. S3 (Static Content)
     - Origin Protocol: HTTPS only
     - Origin Path: /static
     - Cache Behavior: Cache for 1 year
     - Compress: Yes

Price Class: Use all edge locations
SSL Certificate: *.cloudkeeper.com (ACM)
HTTP Version: HTTP/2 enabled
WAF: Enabled (AWS WAF Web ACL)

Auto-Scaling Configuration

ECS Service Auto-Scaling

Scaling Policies:

Service	Metric	Target	Scale Out	Scale In	Min	Max
API	CPU Utilization	70%	+2 tasks	-1 task	3	10
API	Request Count	1000/min	+1 task	-1 task	3	10
Worker	Queue Depth	100 msgs	+2 tasks	-1 task	2	8
Web	CPU Utilization	70%	+1 task	-1 task	2	5

Cooldown Periods:

Scale out cooldown: 60 seconds
Scale in cooldown: 300 seconds (5 minutes)

Disaster Recovery Infrastructure

Cross-Region Replication

RTO & RPO:

Component	RPO	RTO	Recovery Method
Application (ECS)	0	5 min	Multi-AZ auto-recovery
RDS MySQL	0	15 min	Multi-AZ automatic failover
Regional Outage	1 hour	4 hours	Manual failover to us-west-2
Data Corruption	24 hours	24 hours	Point-in-time restore from backup

Monitoring Infrastructure

CloudWatch Dashboards

Production Dashboard:

Log Aggregation

CloudWatch Logs:

Log Group	Source	Retention	Purpose
/ecs/lens-api	ECS Tasks	30 days	Application logs
/ecs/lens-worker	ECS Tasks	30 days	Background job logs
/aws/lambda/cur-parser	Lambda	14 days	CUR parsing logs
/aws/rds/instance/lens-mysql/error	RDS	7 days	Database error logs
lens-alb-access-logs	ALB	90 days	HTTP access logs

Cost Optimization

Reserved Capacity

RDS Reserved Instances:

Instance: db.r6g.2xlarge
Term: 3 years, All Upfront
Savings: 63% vs On-Demand ($30K/year → $11K/year)

ElastiCache Reserved Nodes:

Node Type: cache.r6g.xlarge × 6 nodes
Term: 1 year, Partial Upfront
Savings: 35% vs On-Demand ($18K/year → $12K/year)

Compute Savings Plans

ECS Fargate Compute Savings Plan:

Commitment: $500/hour
Term: 1 year, No Upfront
Savings: 20% vs On-Demand ($4.4M/year → $3.5M/year)

Spot Instances

EMR Task Nodes:

On-Demand: 2 core nodes (m5.2xlarge) = $0.768/hour
Spot: 0-20 task nodes (m5.2xlarge) = $0.154/hour (80% savings)
Monthly Savings: ~$8,000

Batch Jobs:

Fargate Spot: 70% discount vs Fargate On-Demand
Use Case: Non-critical, fault-tolerant workloads

Infrastructure as Code

Terraform

Repository Structure:

terraform/
├── environments/
│   ├── production/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   └── dr/
├── modules/
│   ├── networking/    # VPC, subnets, route tables
│   ├── compute/       # ECS, Lambda, Batch
│   ├── database/      # RDS, Redis, MongoDB
│   ├── storage/       # S3 buckets
│   ├── loadbalancer/  # ALB, target groups
│   ├── cdn/           # CloudFront
│   └── monitoring/    # CloudWatch, alarms
└── global/
    ├── iam/           # IAM roles, policies
    └── route53/       # DNS records

State Management:

Backend: S3 with DynamoDB locking
Encryption: Server-side encryption enabled
Versioning: Enabled for rollback

Operational Procedures

Deployment Pipeline

Backup & Restore Procedures

Daily Backups:

# RDS Automated Backup (daily at 03:00 UTC)
# Retention: 30 days
# Restore procedure:
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier lens-mysql-prod \
  --target-db-instance-identifier lens-mysql-restored \
  --restore-time 2025-10-25T03:00:00Z

# MongoDB Backup (via Atlas, daily at 04:00 UTC)
# Retention: 30 days
# Restore via Atlas UI or API

# Snowflake Time Travel (7 days retention)
# Restore procedure:
CREATE TABLE restored_table CLONE original_table
  AT (TIMESTAMP => '2025-10-25 03:00:00'::timestamp);

Performance Benchmarks

API Performance

Endpoint	p50	p95	p99	Max RPS
GET /api/v1/costs	45ms	120ms	250ms	5,000
GET /api/v1/dashboards	80ms	200ms	400ms	2,000
POST /api/v1/reports	150ms	350ms	800ms	500
GET /api/v1/recommendations	100ms	250ms	500ms	1,000

Database Performance

MySQL:

Connections: 100 concurrent connections per app instance
Query Performance: p95 < 50ms for indexed queries
Replication Lag: < 1 second (Multi-AZ)

Redis:

Get Operations: p99 < 1ms
Cache Hit Rate: 85%+
Throughput: 100,000 ops/sec

Snowflake:

Query Performance: p95 < 5 seconds for dashboard queries
Concurrent Queries: 50+ simultaneous queries
Data Scan: 1 TB/min with Large warehouse

Next Steps

Solution Architecture - High-level architecture
Logical Architecture - Component design
Security Architecture - Security details
Developer Quickstart - Development setup

This physical architecture reflects the current production deployment as of October 2025. Infrastructure specifications may change based on growth and optimization opportunities.

Overview​

Deployment Model​

1. Multi-Tenant SaaS (Primary)​

2. Single-Tenant SaaS (Enterprise)​

3. On-Premise / VPC (Special Cases)​

AWS Region Strategy​

Primary Region: us-east-1 (N. Virginia)​

Secondary Region: us-west-2 (Oregon)​

Future Expansion​

Network Architecture​

VPC Design​

Network Specifications​

Security Groups​

Compute Infrastructure​

Application Layer - ECS Fargate​

Processing Layer - EMR & Batch​

Serverless Compute - Lambda​

Storage Infrastructure​

Block Storage - EBS​

Object Storage - S3​

Database Infrastructure​

Relational Database - Amazon RDS MySQL​

Cache - Amazon ElastiCache Redis​

Document Store - MongoDB Atlas​

Data Warehouse - Snowflake​

Load Balancing & CDN​

Application Load Balancer​

CloudFront CDN​

Auto-Scaling Configuration​

ECS Service Auto-Scaling​

Disaster Recovery Infrastructure​

Cross-Region Replication​

Monitoring Infrastructure​

CloudWatch Dashboards​

Log Aggregation​

Cost Optimization​

Reserved Capacity​

Compute Savings Plans​

Spot Instances​

Infrastructure as Code​

Terraform​

Operational Procedures​

Deployment Pipeline​

Backup & Restore Procedures​

Performance Benchmarks​

API Performance​

Database Performance​

Next Steps​

Related Documentation​