Five years into managing AWS infrastructure at HashCash Consultants, I've developed principles and tools that make large-scale cloud chaos manageable. Here's what actually works.
1. Infrastructure as Code - No Exceptions
The single most important shift: every piece of infrastructure lives in Terraform or CloudFormation. No manually created resources in production. If it's not in code, it doesn't exist.
Every EC2 instance, security group, IAM role, and VPC must be reproducible from code in under 10 minutes.
At HashCash we manage infrastructure for multiple fintech clients simultaneously. Without IaC, the cognitive overhead of tracking hundreds of manually provisioned resources across multiple AWS accounts would be impossible. With Terraform, every configuration is version-controlled in Git, peer-reviewed via pull requests, and auditable through commit history.
The discipline is absolute: if someone creates a resource manually in the console, it gets imported into Terraform before the day ends - or scheduled for deletion.
AMI Strategy: Build Once, Deploy Everywhere
Managing 500+ instances individually is impossible. Manage them as a fleet of identical machines built from a controlled base AMI.
- Base Ubuntu 22.04 AMI with all standard tooling baked in
- Security hardening applied at AMI build time (CIS benchmarks)
- Cross-region replication automated via Lambda trigger on new AMI creation
- Old AMIs auto-deregistered after 90 days via lifecycle policy
- AMI ID stored as SSM Parameter - one parameter update cascades across all Terraform configs
When a security patch is needed, we update the base AMI, test it, and roll it out to all instances via a blue-green deployment. No SSH-ing into individual boxes. No configuration drift. No surprises.
Tagging: The Unsung Hero
Without proper tags, 500 instances become 500 mysteries. Every resource gets mandatory tags enforced via AWS Config rules:
Environment: production | staging | dev
Client: [name]
Service: api | worker | db | lb
Owner: devops-team
AutoStop: true | false
CostCenter: [client-code]
The AutoStop tag drives a Lambda that shuts down non-production instances every evening, cutting cloud spend significantly. The CostCenter tag feeds into AWS Cost Explorer filtered views giving each client a monthly breakdown of exactly what they're paying for.
Observability Stack
CloudWatch handles AWS-native metrics. Prometheus + Grafana handles application-level visibility. Every instance runs node_exporter. Fleet health is visible at a glance in a single Grafana dashboard.
The key principle: no alert without a runbook. Every CloudWatch alarm that pages the on-call engineer has a corresponding runbook that describes exactly what to check and how to resolve it. Alert fatigue kills teams. Actionable alerts with documented responses build confidence.
We track three fleet-level metrics obsessively: patch compliance percentage, CPU over-provisioning ratio (instances consistently below 20% CPU), and security group drift (instances with ad-hoc rules not in Terraform).
IAM Discipline
At fintech scale, IAM is survival. The principles we never compromise on:
- Principle of least privilege, strictly applied. No role has more permissions than its narrowest use case requires.
- Role-based access only. No user-level access keys for application code.
- No long-lived access keys anywhere. STS assume-role for everything cross-account.
- AWS Instance Profiles for all EC2 workloads. The instance gets credentials from metadata - no secrets in environment variables.
- GuardDuty on every account, 24/7. Any anomalous API call pattern triggers an alert within 5 minutes.
Automated Patching Pipeline
With 500+ instances, manual patching is not a strategy. AWS Systems Manager Patch Manager handles OS patching on a defined schedule:
- Dev environment: patched weekly, Tuesday 02:00 IST
- Staging: patched weekly, Wednesday 02:00 IST (after dev validation)
- Production: patched bi-weekly, Saturday 02:00 IST with maintenance window approval
Patch compliance is reported to a CloudWatch dashboard. Any instance that misses a patch cycle generates an alert. Production must be 100% patch compliant at all times - this is a non-negotiable SLA.
Cost Engineering
- Reserved Instances for baseline workloads - 1-year convertible RIs for anything running 24/7
- Spot Instances for batch and stateless workers - up to 70% savings on transcoding, ETL, and test runners
- Right-sizing reviews quarterly - any instance consistently below 20% CPU average gets downsized
- S3 Intelligent-Tiering for all object storage - automatically moves infrequently accessed data to cheaper tiers
Conclusion
Managing infrastructure at scale isn't about working harder - it's about building systems that make the right thing easy and the wrong thing hard. IaC, AMI pipelines, strict tagging, observability, IAM discipline, automated patching, and cost engineering are the seven pillars.
The goal isn't to manage 500 instances - it's to build a system that manages itself, alerts you when something needs human judgment, and keeps your team sleeping through the night.
If there's one principle that ties all of these together: if you do it more than twice, automate it.