Skip to main content

Physical Architecture

Overview

This document describes the physical infrastructure deployment of AWS Lens, including AWS services used, network topology, compute resources, storage systems, and operational procedures.


Deployment Model

AWS Lens supports three deployment models:

1. Multi-Tenant SaaS (Primary)

  • Shared infrastructure across customers
  • Data isolation via encryption and access controls
  • Most cost-effective option
  • Managed by CloudKeeper

2. Single-Tenant SaaS (Enterprise)

  • Dedicated infrastructure per customer
  • Complete isolation
  • Custom configurations allowed
  • Managed by CloudKeeper

3. On-Premise / VPC (Special Cases)

  • Deployed in customer's AWS account
  • Customer-managed infrastructure
  • Air-gapped environments
  • Full control

This document focuses on the Multi-Tenant SaaS deployment.


AWS Region Strategy

Primary Region: us-east-1 (N. Virginia)

Rationale:

  • Lowest AWS pricing for most services
  • Largest service availability
  • Best connectivity to Snowflake (us-east-1)
  • Closest to majority of customers (US-based)

Secondary Region: us-west-2 (Oregon)

Purpose:

  • Disaster recovery (DR)
  • Regional redundancy
  • Compliance (data residency)

Future Expansion

Planned Regions:

  • eu-west-1 (Ireland): European customers, GDPR compliance
  • ap-south-1 (Mumbai): Asia-Pacific customers, data residency

Network Architecture

VPC Design

Network Specifications

VPC CIDR: 10.0.0.0/16 (65,536 IP addresses)

Subnet Allocation:

TierAvailability ZoneCIDRUsable IPsPurpose
Publicus-east-1a10.0.1.0/24251ALB, NAT Gateway
Publicus-east-1b10.0.2.0/24251ALB, NAT Gateway
Applicationus-east-1a10.0.11.0/24251ECS Tasks, Lambda
Applicationus-east-1b10.0.12.0/24251ECS Tasks, Lambda
Dataus-east-1a10.0.21.0/24251RDS, Redis, MongoDB
Dataus-east-1b10.0.22.0/24251RDS, Redis, MongoDB
Processingus-east-1a10.0.31.0/24251Spark, Airflow, Batch
Processingus-east-1b10.0.32.0/24251Spark, Airflow, Batch

Security Groups


Compute Infrastructure

Application Layer - ECS Fargate

ECS Services:

ServiceTask CountvCPUMemoryAuto-ScalingPurpose
api-service3-1024 GBCPU > 70%REST API
web-service2-512 GBCPU > 70%React frontend serving
worker-service2-848 GBQueue depth > 100Background processing
scheduler-service112 GBNo auto-scalingAirflow scheduler

Task Definitions:

# Example: api-service task definition
family: lens-api-service
networkMode: awsvpc
requiresCompatibilities:
- FARGATE
cpu: 2048 # 2 vCPU
memory: 4096 # 4 GB
containerDefinitions:
- name: api
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/lens-api:latest
portMappings:
- containerPort: 8080
protocol: tcp
environment:
- name: SPRING_PROFILES_ACTIVE
value: production
- name: DB_HOST
value: lens-mysql.cluster-xxxxx.us-east-1.rds.amazonaws.com
secrets:
- name: DB_PASSWORD
valueFrom: arn:aws:secretsmanager:us-east-1:123456789:secret:lens/db-password
logConfiguration:
logDriver: awslogs
options:
awslogs-group: /ecs/lens-api
awslogs-region: us-east-1
awslogs-stream-prefix: ecs

Processing Layer - EMR & Batch

Amazon EMR (Spark):

  • Cluster Type: Transient (spins up for jobs, terminates after)
  • Master: 1x m5.xlarge (4 vCPU, 16 GB)
  • Core: 2-10x m5.2xlarge (8 vCPU, 32 GB) - auto-scaling
  • Task: 0-20x m5.2xlarge Spot instances (80% cost reduction)
  • Use Case: Daily CUR processing, aggregation jobs

AWS Batch:

  • Compute Environment: Fargate Spot
  • Max vCPUs: 256
  • Use Case: Recommendation engine, ad-hoc analysis

Serverless Compute - Lambda

FunctionRuntimeMemoryTimeoutTriggerPurpose
cur-detectorPython 3.11512 MB30sS3 EventDetect new CUR files
cur-parserPython 3.113 GB15 minSQSParse CUR files
alert-senderNode.js 20256 MB10sEventBridgeSend notifications
report-generatorPython 3.112 GB5 minEventBridgeGenerate scheduled reports

Storage Infrastructure

Block Storage - EBS

Use Cases:

  • RDS database storage
  • MongoDB persistent volumes
  • Temporary processing storage

Volumes:

VolumeTypeSizeIOPSThroughputPurpose
RDS Primarygp31 TB12,000250 MB/sMySQL database
RDS Standbygp31 TB12,000250 MB/sMySQL replica
MongoDBgp3500 GB8,000200 MB/sDocument storage
Processinggp32 TB16,000500 MB/sSpark temp storage

Object Storage - S3

Bucket Configuration:

BucketStorage ClassVersioningEncryptionCross-Region Replication
lens-customer-curS3 StandardDisabledAES-256 (SSE-S3)No
lens-processed-dataS3 Intelligent-TieringEnabledAES-256 (SSE-KMS)Yes (to us-west-2)
lens-backupsS3 Standard → GlacierEnabledAES-256 (SSE-KMS)Yes (to us-west-2)
lens-logsS3 StandardDisabledAES-256 (SSE-S3)No
lens-static-assetsS3 StandardEnabledAES-256 (SSE-S3)Yes (to CloudFront)

Database Infrastructure

Relational Database - Amazon RDS MySQL

Instance Configuration:

Engine: MySQL 8.0.35
Instance Class: db.r6g.2xlarge (8 vCPU, 64 GB RAM)
Deployment: Multi-AZ (Primary in us-east-1a, Standby in us-east-1b)
Storage: 1 TB gp3 (12,000 IOPS, 250 MB/s throughput)
Backup Retention: 30 days
Automated Backups: Daily at 03:00 UTC
Read Replicas: 1 (for reporting queries)

High Availability:

Performance Tuning:

  • Connection Pooling: 100 connections per app instance
  • Query Cache: Disabled (using Redis instead)
  • InnoDB Buffer Pool: 48 GB (75% of RAM)
  • Read Replicas: Offload reporting queries

Cache - Amazon ElastiCache Redis

Cluster Configuration:

Engine: Redis 7.0
Node Type: cache.r6g.xlarge (4 vCPU, 26.32 GB RAM)
Deployment: Cluster mode enabled
Shards: 3
Replicas per Shard: 1
Total Nodes: 6 (3 primary + 3 replicas)
Multi-AZ: Enabled
Encryption: In-transit and at-rest

Redis Cluster Architecture:

Document Store - MongoDB Atlas

Cluster Configuration:

Provider: AWS
Region: us-east-1
Tier: M30 (2 vCPU, 8 GB RAM per node)
Deployment: Replica Set (3 nodes)
- Primary: us-east-1a
- Secondary 1: us-east-1b
- Secondary 2: us-east-1c
Storage: 512 GB
Backups: Continuous (point-in-time recovery)

Why MongoDB Atlas (not self-managed):

  • Fully managed (auto-upgrades, monitoring, backups)
  • Better security (encryption, access controls)
  • Lower operational overhead
  • Cost-effective for < 1 TB data

Data Warehouse - Snowflake

Warehouse Configuration:

Cloud Provider: AWS
Region: us-east-1
Account Edition: Enterprise

Virtual Warehouses:
1. INGESTION_WH (Medium, auto-suspend after 5 min)
- Purpose: Load CUR data
- Runs: 24/7, suspends when idle

2. ANALYTICS_WH (Large, auto-suspend after 10 min)
- Purpose: Dashboard queries, aggregations
- Runs: Business hours, auto-resumes on query

3. REPORTING_WH (X-Small, auto-suspend after 5 min)
- Purpose: Scheduled reports
- Runs: Scheduled jobs only

4. AD_HOC_WH (Large, auto-suspend after 2 min)
- Purpose: User ad-hoc queries
- Runs: On-demand

Storage:

  • Database Size: ~10 TB (compressed)
  • Retention: 1 year active, 7 years archived
  • Clustering: Clustered by account_id and date for query performance

Load Balancing & CDN

Application Load Balancer

ALB Configuration:

Scheme: Internet-facing
IP Address Type: IPv4
Availability Zones: us-east-1a, us-east-1b
Security Groups: sg-alb-https (443 from 0.0.0.0/0)
Listeners:
- Port 443 (HTTPS) → Forward to target groups based on path
- Port 80 (HTTP) → Redirect to 443
SSL Certificate: *.cloudkeeper.com (ACM)
Idle Timeout: 60 seconds
Access Logs: Enabled (to S3)

Target Group Health Checks:

Target GroupProtocolPathIntervalTimeoutHealthy ThresholdUnhealthy Threshold
APIHTTP/api/health30s5s23
WebHTTP/30s5s23

CloudFront CDN

CloudFront Configuration:

Distribution Domain: app.cloudkeeper.com
Origins:
1. ALB (Dynamic Content)
- Origin Protocol: HTTPS only
- Origin Path: /
- Cache Behavior: No caching (pass-through)

2. S3 (Static Content)
- Origin Protocol: HTTPS only
- Origin Path: /static
- Cache Behavior: Cache for 1 year
- Compress: Yes

Price Class: Use all edge locations
SSL Certificate: *.cloudkeeper.com (ACM)
HTTP Version: HTTP/2 enabled
WAF: Enabled (AWS WAF Web ACL)

Auto-Scaling Configuration

ECS Service Auto-Scaling

Scaling Policies:

ServiceMetricTargetScale OutScale InMinMax
APICPU Utilization70%+2 tasks-1 task310
APIRequest Count1000/min+1 task-1 task310
WorkerQueue Depth100 msgs+2 tasks-1 task28
WebCPU Utilization70%+1 task-1 task25

Cooldown Periods:

  • Scale out cooldown: 60 seconds
  • Scale in cooldown: 300 seconds (5 minutes)

Disaster Recovery Infrastructure

Cross-Region Replication

RTO & RPO:

ComponentRPORTORecovery Method
Application (ECS)05 minMulti-AZ auto-recovery
RDS MySQL015 minMulti-AZ automatic failover
Regional Outage1 hour4 hoursManual failover to us-west-2
Data Corruption24 hours24 hoursPoint-in-time restore from backup

Monitoring Infrastructure

CloudWatch Dashboards

Production Dashboard:

Log Aggregation

CloudWatch Logs:

Log GroupSourceRetentionPurpose
/ecs/lens-apiECS Tasks30 daysApplication logs
/ecs/lens-workerECS Tasks30 daysBackground job logs
/aws/lambda/cur-parserLambda14 daysCUR parsing logs
/aws/rds/instance/lens-mysql/errorRDS7 daysDatabase error logs
lens-alb-access-logsALB90 daysHTTP access logs

Cost Optimization

Reserved Capacity

RDS Reserved Instances:

  • Instance: db.r6g.2xlarge
  • Term: 3 years, All Upfront
  • Savings: 63% vs On-Demand ($30K/year → $11K/year)

ElastiCache Reserved Nodes:

  • Node Type: cache.r6g.xlarge × 6 nodes
  • Term: 1 year, Partial Upfront
  • Savings: 35% vs On-Demand ($18K/year → $12K/year)

Compute Savings Plans

ECS Fargate Compute Savings Plan:

  • Commitment: $500/hour
  • Term: 1 year, No Upfront
  • Savings: 20% vs On-Demand ($4.4M/year → $3.5M/year)

Spot Instances

EMR Task Nodes:

  • On-Demand: 2 core nodes (m5.2xlarge) = $0.768/hour
  • Spot: 0-20 task nodes (m5.2xlarge) = $0.154/hour (80% savings)
  • Monthly Savings: ~$8,000

Batch Jobs:

  • Fargate Spot: 70% discount vs Fargate On-Demand
  • Use Case: Non-critical, fault-tolerant workloads

Infrastructure as Code

Terraform

Repository Structure:

terraform/
├── environments/
│ ├── production/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ └── dr/
├── modules/
│ ├── networking/ # VPC, subnets, route tables
│ ├── compute/ # ECS, Lambda, Batch
│ ├── database/ # RDS, Redis, MongoDB
│ ├── storage/ # S3 buckets
│ ├── loadbalancer/ # ALB, target groups
│ ├── cdn/ # CloudFront
│ └── monitoring/ # CloudWatch, alarms
└── global/
├── iam/ # IAM roles, policies
└── route53/ # DNS records

State Management:

  • Backend: S3 with DynamoDB locking
  • Encryption: Server-side encryption enabled
  • Versioning: Enabled for rollback

Operational Procedures

Deployment Pipeline

Backup & Restore Procedures

Daily Backups:

# RDS Automated Backup (daily at 03:00 UTC)
# Retention: 30 days
# Restore procedure:
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier lens-mysql-prod \
--target-db-instance-identifier lens-mysql-restored \
--restore-time 2025-10-25T03:00:00Z

# MongoDB Backup (via Atlas, daily at 04:00 UTC)
# Retention: 30 days
# Restore via Atlas UI or API

# Snowflake Time Travel (7 days retention)
# Restore procedure:
CREATE TABLE restored_table CLONE original_table
AT (TIMESTAMP => '2025-10-25 03:00:00'::timestamp);

Performance Benchmarks

API Performance

Endpointp50p95p99Max RPS
GET /api/v1/costs45ms120ms250ms5,000
GET /api/v1/dashboards80ms200ms400ms2,000
POST /api/v1/reports150ms350ms800ms500
GET /api/v1/recommendations100ms250ms500ms1,000

Database Performance

MySQL:

  • Connections: 100 concurrent connections per app instance
  • Query Performance: p95 < 50ms for indexed queries
  • Replication Lag: < 1 second (Multi-AZ)

Redis:

  • Get Operations: p99 < 1ms
  • Cache Hit Rate: 85%+
  • Throughput: 100,000 ops/sec

Snowflake:

  • Query Performance: p95 < 5 seconds for dashboard queries
  • Concurrent Queries: 50+ simultaneous queries
  • Data Scan: 1 TB/min with Large warehouse

Next Steps


This physical architecture reflects the current production deployment as of October 2025. Infrastructure specifications may change based on growth and optimization opportunities.