Overview
Our microservices on EKS lacked centralised metrics. Engineers spent hours diagnosing issues using logs alone.
Built an end-to-end monitoring and alerting solution on EKS, reducing MTTR by 70% and enabling proactive incident response.
Our microservices on EKS lacked centralised metrics. Engineers spent hours diagnosing issues using logs alone.
Deployed Prometheus Operator via Helm on the cluster, with Node Exporter DaemonSets and ServiceMonitors for application endpoints. Grafana runs in read-only mode on a separate namespace.
1. Installed Prometheus Operator and CRDs. 2. Created ConfigMaps for scrape configs. 3. Installed Alertmanager and configured Slack + PagerDuty webhooks. 4. Built Grafana dashboards via JSON model files in Git.
Cluster Health: CPU, Memory Utilisation • Service Latency: p50/p95/p99 • Custom Business Metrics: order throughput, error rates
Fired alerts for CPU > 80% sustained, Memory > 75%, HTTP 5xx spikes, and node unreachable. Alerts include runbook links and severity levels.
Mean Time to Detect (MTTD) dropped from 18m to 7m. Mean Time to Repair (MTTR) dropped from 45m to 13m. Capacity planning improved by 30% using historical trend graphs.
Automating dashboard provisioning via Terraform would ensure reproducibility. We plan to evaluate Grafana's new Cloud offerings next.
Prometheus • Grafana • Alertmanager • Kubernetes (EKS) • Helm • Slack • PagerDuty
Let's discuss how I can help optimize your infrastructure