Skip to main content

Architecture Overview


Module: Tuner Platform: CloudKeeper Version: 1.0.0-RELEASE Last Updated: October 26, 2025 Document Type: Solution Architecture (Technical Overview)


Table of Contents

  1. Introduction
  2. High-Level Architecture
  3. Core Components
  4. Technology Stack
  5. Data Flow Architecture
  6. Integration Architecture
  7. Deployment Architecture
  8. Security Architecture

Introduction

Purpose of This Document

This document provides a comprehensive architectural overview of AWS Tuner, focusing on:

  • System components and their responsibilities
  • Technology choices and rationale
  • Data flow and processing pipelines
  • Integration points and dependencies
  • Deployment and operational architecture

High-Level Architecture

System Context Diagram

System Context Architecture

Key Architectural Principles

  1. Event-Driven: Asynchronous processing via RabbitMQ for scalability
  2. Microservices: 11 specialized modules (tuner-core, tuner-aws-utils, etc.)
  3. Multi-Database: Right database for the right data (MongoDB for recommendations, MySQL for accounts)
  4. Rule-Based Intelligence: Drools engine separates business rules from code
  5. Read-Only AWS Access: Zero write permissions to customer AWS accounts (security)
  6. Multi-Cloud Ready: Supports AWS and GCP with unified recommendation model

Core Components

1. tuner-core (Main Application)

Purpose: Core recommendation engine and Spring Boot application

Technology:

  • Spring Boot 2.7.4
  • Spring Cloud Config 2021.0.8
  • Spring Data JPA (MySQL)
  • Spring Data MongoDB
  • Drools 7.73.0.Final

Key Responsibilities:

  1. Recommendation Generation:

    • Orchestrates 42+ recommendation job types
    • Executes Drools rules for intelligent analysis
    • Calculates cost savings and prioritization
  2. Resource Synchronization:

    • Scheduled jobs sync AWS resource metadata
    • CloudWatch metrics collection
    • Pricing data updates
  3. API Services:

    • REST endpoints for frontend
    • Recommendation CRUD operations
    • Account management

Key Packages:

com.ttn.ck.tuner
├── recommendation/
│ ├── job/aws/ # 42+ AWS recommendation jobs
│ ├── job/gcp/ # 9+ GCP recommendation jobs
│ ├── engine/ # Drools engine integration
│ ├── api/ # REST controllers
│ └── processor/ # Data processing logic
├── core/
│ ├── processor/ # Scheduler job processors
│ └── service/ # Business services
└── api/
├── controller/ # API controllers
└── dto/ # Data transfer objects

Recommendation Jobs (42+ types):

AWS Jobs:

  • OverProvisionedEc2RecommendationJob
  • SnapshotRecommendationJob
  • IdleNatGatewayRecommendation
  • S3IncompleteMultipartRecommendationJob
  • OverProvisionedRedshiftClusterRecommendationJob
  • IdleVpcEndpointRecommendationJob
  • DynamoDbIdleTableRecommendationJob
  • RdsExtendedSupportRecommendation
  • EksExtendedSupportRecommendation
  • ModerniseElasticacheRecommendation
  • LoadBalancerRecommendationJob
  • LambdaErrorRateRecommendationJob
  • And 30+ more...

GCP Jobs:

  • IdleComputeInstanceRecommendationJob
  • IdleCloudSqlRecommendationJob
  • IdleStaticIPsRecommendationJob
  • IdlePersistentDiskRecommendationJob
  • And 5+ more...

2. Drools Rule Engine

Purpose: Business rules management system for recommendation logic

Why Drools?

  • ✅ Declarative rule definition (readable by non-developers)
  • ✅ Business rules separate from code
  • ✅ Easy to add new recommendation types
  • ✅ Complex decision logic support
  • ✅ No redeployment needed for rule changes

Rule Files (40+ rules):

src/main/resources/rules/
├── oper_provisioned_ec2_instance_rules.drl
├── snapshot-rules.drl
├── natgateway-rules.drl
├── idle-vpc-endpoint-recommendation-rules.drl
├── s3-incomplete-multipart-rules.drl
├── over-provisioned-redshift-cluster-recommendation-rules.drl
├── idle-dynamodb-table-rules.drl
├── rds-extended-support-rules.drl
├── eks-rules.drl
├── modernise-elasticache-rules.drl
└── 30+ more rule files...

Example Rule Anatomy (OverProvisioned EC2):

rule "Generate OverProvisioned EC2 Recommendations"
when
$instance: Ec2InstanceInfo()
then
// 1. Validate recommendation criteria
if ($instance.getRecommendedInstanceType() == null) return;
if ($instance.getOdCostPerHour() <= $instance.getRecommendedInstanceCostPerHour()) return;
if ($instance.getInstanceType().equals($instance.getRecommendedInstanceType())) return;

// 2. Calculate savings
double recommendedCostPerHour = Math.max(0,
$instance.getOdCostPerHour()
.subtract(BigDecimal.valueOf($instance.getRecommendedInstanceCostPerHour()))
.doubleValue()
);

// 3. Minimum savings threshold ($0.005/month)
if (recommendedCostPerHour * 720 <= 0.005) return;

// 4. Create recommendation
RecommendationInfo recommendation = new RecommendationInfo(
$instance.getAccountId(),
instanceId,
$instance.getInstanceName(),
$instance.getRegion(),
description,
action,
message,
currentCost,
recommendedCostPerHour,
status,
metadata
);

recommendationList.add(recommendation);
end

Rule Parameters (Configurable):

  • Lookback period: 30 days (default)
  • CPU threshold: 30% (default)
  • CloudWatch metric period: 3600 seconds
  • Minimum savings: $0.005/month

3. Multi-Database Architecture

Why Multiple Databases?

Each database optimized for specific data patterns:

MongoDB (Document Store)

Purpose: Recommendations, schedules, user preferences

Why MongoDB?:

  • ✅ Flexible schema for varied recommendation types
  • ✅ Fast writes for high-volume recommendations
  • ✅ JSON-like documents match API responses
  • ✅ Horizontal scalability

Collections:

tuner_db
├── recommendations # Generated recommendations
├── scheduler_configs # Schedule definitions
├── tag_scheduler_configs # Tag-based schedules
├── ec2_resources # EC2 metadata cache
├── rds_resources # RDS metadata cache
└── user_preferences # User settings

Example Recommendation Document:

{
"_id": "rec_abcd1234",
"accountId": "123456789012",
"resourceId": "i-0abcd1234efgh5678",
"resourceName": "api-server-prod-3",
"region": "us-east-1",
"category": "OverProvisioned",
"service": "EC2",
"description": "Maximum CPU utilisation of EC2 is 12%...",
"action": "Downsize EC2 instance type from m5.2xlarge to m5.large",
"currentCost": 280.32,
"recommendedCost": 70.08,
"monthlySavings": 210.24,
"annualSavings": 2522.88,
"metadata": {
"instanceType": "m5.2xlarge",
"cpu": 8,
"memory": 32768,
"cpuUtilization": 12,
"recommendedInstanceType": "m5.large"
},
"status": "GENERATED",
"createdAt": "2025-10-26T12:00:00Z",
"updatedAt": "2025-10-26T12:00:00Z"
}

MySQL (Relational Database)

Purpose: Account metadata, user management, transactional data

Why MySQL?:

  • ✅ ACID compliance for critical data
  • ✅ Strong referential integrity
  • ✅ Mature tooling and ecosystem
  • ✅ CloudKeeper platform standard

Schema (Key Tables):

tuner_schema
├── accounts # AWS account information
├── users # User authentication and roles
├── permissions # RBAC permissions
├── audit_logs # Change tracking
├── iam_roles # AWS IAM role configurations
└── account_regions # Account-region mappings

Snowflake (Analytics Data Warehouse)

Purpose: Cost & Usage Report (CUR) data, historical analytics

Why Snowflake?:

  • ✅ Massive scale (petabytes of cost data)
  • ✅ Columnar storage for fast aggregations
  • ✅ Shared with AWS Lens (data reuse)
  • ✅ Time-series optimized queries

Key Queries:

  • Historical cost trends for savings calculations
  • Scheduler savings validation
  • ROI tracking and reporting
  • Multi-account cost attribution

Redis (Cache)

Purpose: API response caching, session management

Why Redis?:

  • ✅ Sub-millisecond latency
  • ✅ Reduces database load
  • ✅ TTL support for cache expiration

Cached Data:

  • Recommendation lists (TTL: 1 hour)
  • AWS pricing data (TTL: 24 hours)
  • User sessions (TTL: session timeout)

4. Event Processing Architecture

RabbitMQ (Message Queue)

Purpose: Asynchronous event processing and job distribution

Message Flows:

RabbitMQ Message Flow

Events:

  • SYNC_RECOMMENDATION_SUCCESS - Recommendation generated
  • SYNC_RECOMMENDATION_FAILURE - Recommendation job failed
  • SCHEDULER_EVENT_START - Resource started by scheduler
  • SCHEDULER_EVENT_STOP - Resource stopped by scheduler

Quartz Scheduler

Purpose: Job scheduling framework for recurring tasks

Scheduled Jobs:

JobFrequencyPurpose
SyncEc2ResourcesJobEvery 6 hoursSync EC2 metadata
SyncRdsResourcesJobEvery 6 hoursSync RDS metadata
OverProvisionedEc2RecommendationJobDailyGenerate EC2 rightsizing recommendations
SnapshotRecommendationJobWeeklyIdentify orphaned snapshots
AccountSchedulerJobProcessorCron-basedExecute start/stop schedules

Technology Stack

Backend

ComponentTechnologyVersionPurpose
Application FrameworkSpring Boot2.7.4Core application framework
ConfigurationSpring Cloud Config2021.0.8Centralized configuration
Rule EngineDrools7.73.0.FinalBusiness rules management
Job SchedulerQuartz(via quartz-utils)Scheduled job execution
Message QueueRabbitMQLatestAsync event processing
Document DatabaseMongoDB5.0+Recommendations storage
Relational DatabaseMySQL8.0.33Account/user management
Analytics DatabaseSnowflakeLatestCost analytics
CacheRedis6.0+API caching
LanguageJava17Primary language
Build ToolGradle7.xBuild automation

Frontend

ComponentTechnologyVersionPurpose
FrameworkReact18.xUI framework
LanguageTypeScript4.xType-safe development
State ManagementRedux Toolkit(TBD)Global state
HTTP ClientAxiosLatestAPI communication
ChartsRechartsLatestData visualization

Browser Extension

ComponentTechnologyVersionPurpose
FrameworkReact18Extension UI
LanguageTypeScript4.xType safety
BundlerWebpack5Build tool
ManifestV3 (Chrome), V2 (Firefox)LatestExtension config
AuthenticationJWT-Token-based auth

Data Flow Architecture

Recommendation Generation Flow

7-Phase Recommendation Generation Flow

Timeline:

  • Hour 0: Resource sync begins
  • Hour 6: Second sync (incremental updates)
  • Hour 24: First recommendation job runs
  • Hour 25: Recommendations available in UI

Integration Architecture

External Integrations

AWS Services

Read-Only API Calls:

EC2:
- DescribeInstances
- DescribeVolumes
- DescribeSnapshots
- DescribeNatGateways
- DescribeVpcEndpoints

RDS:
- DescribeDBInstances
- DescribeDBClusters

CloudWatch:
- GetMetricStatistics
- ListMetrics

Pricing:
- GetProducts

Cost Explorer:
- GetCostAndUsage

Cross-Account Access:

AWS Account (Customer)

├─ IAM Role: CloudKeeperTunerRole
│ ├─ Trust Policy: Allow AssumeRole from CloudKeeper account
│ ├─ External ID: Unique per customer
│ └─ Permissions: Read-only (ec2:Describe*, rds:Describe*, etc.)

└─ AssumeRole call from Tuner
└─ Temporary credentials (1 hour TTL)

CloudKeeper Platform

Internal Services:

  1. AuthX: Authentication and authorization
  2. AWS Lens: Shared Snowflake CUR data
  3. Config Server: Centralized configuration

Deployment Architecture

Production Environment

Production Deployment Architecture

Scalability

Horizontal Scaling:

  • tuner-core: 3-10 instances (auto-scaling based on CPU/memory)
  • MongoDB: Sharded cluster for large customers
  • RabbitMQ: Clustered for high throughput

Vertical Scaling:

  • tuner-core: 4-8 CPU, 8-16 GB RAM per instance
  • MongoDB: 8-16 CPU, 32-64 GB RAM
  • MySQL: 4-8 CPU, 16-32 GB RAM

Security Architecture

Defense in Depth

Security Defense in Depth

Compliance:

  • SOC 2 Type II (TBD)
  • GDPR compliant
  • Data residency controls

Next Steps