Back to Projects

Prometheus & Grafana Monitoring Stack

Built an end-to-end monitoring and alerting solution on EKS, reducing MTTR by 70% and enabling proactive incident response.

Overview

Our microservices on EKS lacked centralised metrics. Engineers spent hours diagnosing issues using logs alone.

Architecture

Deployed Prometheus Operator via Helm on the cluster, with Node Exporter DaemonSets and ServiceMonitors for application endpoints. Grafana runs in read-only mode on a separate namespace.

Implementation Steps

1. Installed Prometheus Operator and CRDs. 2. Created ConfigMaps for scrape configs. 3. Installed Alertmanager and configured Slack + PagerDuty webhooks. 4. Built Grafana dashboards via JSON model files in Git.

Key Dashboards

Cluster Health: CPU, Memory Utilisation • Service Latency: p50/p95/p99 • Custom Business Metrics: order throughput, error rates

Alerting Rules

Fired alerts for CPU > 80% sustained, Memory > 75%, HTTP 5xx spikes, and node unreachable. Alerts include runbook links and severity levels.

Outcomes

Mean Time to Detect (MTTD) dropped from 18m to 7m. Mean Time to Repair (MTTR) dropped from 45m to 13m. Capacity planning improved by 30% using historical trend graphs.

Lessons Learned

Automating dashboard provisioning via Terraform would ensure reproducibility. We plan to evaluate Grafana's new Cloud offerings next.

Technologies

Prometheus • Grafana • Alertmanager • Kubernetes (EKS) • Helm • Slack • PagerDuty

Interested in Similar Solutions?

Let's discuss how I can help optimize your infrastructure