You cannot operate reliable infrastructure without observability. When an application slows down or a service goes offline, you need to know about it before your users do — and you need the data to diagnose the root cause quickly. Prometheus and Grafana have become the standard open-source monitoring stack for cloud-native infrastructure, offering powerful metrics collection, alerting, and visualisation. Organisations that have already adopted Kubernetes for their workloads benefit especially from this stack, since both tools were designed with container orchestration in mind.
Prometheus: Metrics Collection and Alerting
Prometheus is a time-series database and monitoring system designed for reliability and simplicity. It collects metrics by scraping HTTP endpoints exposed by your applications and infrastructure components at regular intervals.
- Pull-based model — Prometheus pulls metrics from targets rather than receiving pushed data. This means Prometheus controls the scrape interval and can detect when targets are down (a missed scrape is itself a signal).
- PromQL — Prometheus's query language is powerful and flexible, allowing you to aggregate, filter, and transform metrics to answer operational questions like "what is the 99th percentile request latency for service X over the last hour?"
- Service discovery — Prometheus integrates with Kubernetes, cloud provider APIs, Consul, and other service registries to automatically discover monitoring targets as they are created and destroyed.
- Alertmanager — a companion component that handles alert routing, deduplication, grouping, and silencing. It can send notifications to email, Slack, PagerDuty, OpsGenie, and other channels.
Grafana: Visualisation and Dashboards
Grafana provides the visualisation layer that makes monitoring data actionable. While Prometheus has a basic expression browser, Grafana offers rich, interactive dashboards that teams can use for both real-time monitoring and historical analysis.
- Multi-source dashboards — Grafana can query data from Prometheus, Elasticsearch, CloudWatch, Azure Monitor, PostgreSQL, and dozens of other data sources in a single dashboard.
- Template variables — create dynamic dashboards that let users filter by environment, service, region, or any other dimension without creating separate dashboards for each combination.
- Alerting — Grafana also includes its own alerting system, which can be simpler to configure than Alertmanager for teams already using Grafana as their primary monitoring interface.
- Community dashboards — thousands of pre-built dashboards are available on Grafana.com for common infrastructure (Kubernetes, PostgreSQL, NGINX, Node Exporter) that can be imported and customised.
Setting Up the Monitoring Stack
For Kubernetes environments, the most common approach is to deploy the kube-prometheus-stack Helm chart, which includes Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics in a single, well-configured deployment:
- Deploy the stack — install the kube-prometheus-stack Helm chart in a dedicated monitoring namespace. This provides immediate visibility into cluster health, node resources, and pod metrics.
- Instrument your applications — add Prometheus client libraries to your applications to expose custom metrics. Libraries are available for all major languages (Go, Java, Python, Node.js, .NET). Focus on the four golden signals: latency, traffic, errors, and saturation.
- Configure service monitors — create ServiceMonitor resources that tell Prometheus which services to scrape and how. This integrates naturally with Kubernetes service discovery.
- Build dashboards — start with the pre-built dashboards included in the Helm chart, then create custom dashboards for your application-specific metrics.
- Set up alerts — define alerting rules in Prometheus for critical conditions (high error rates, resource exhaustion, service downtime) and configure Alertmanager to route notifications to the appropriate teams.
Beyond Metrics: Logs and Traces
Metrics tell you that something is wrong. Logs and traces tell you why. A complete observability stack includes all three pillars:
- Logging — Loki (from Grafana Labs) is the natural complement to Prometheus and Grafana. It indexes log metadata (labels) rather than full-text content, making it efficient and cost-effective. Logs can be queried alongside metrics in Grafana dashboards.
- Distributed tracing — for microservices architectures, tracing tools like Jaeger or Tempo (also from Grafana Labs) track requests as they flow through multiple services, helping you identify bottlenecks and failures in complex call chains.
- OpenTelemetry — an increasingly standard framework for instrumenting applications with metrics, logs, and traces using a single, vendor-neutral SDK. If you are starting fresh, OpenTelemetry is the recommended instrumentation approach.
Scaling Prometheus
A single Prometheus instance works well for small to medium deployments, but larger environments may require scaling strategies. Proper scaling is also a key consideration in any disaster recovery plan for cloud infrastructure, since monitoring data must remain available even during regional failures.
- Thanos — extends Prometheus with long-term storage, global querying across multiple Prometheus instances, and deduplication. Thanos stores historical data in object storage (S3, Azure Blob) for cost-effective retention.
- Cortex / Mimir — horizontally scalable, multi-tenant Prometheus backends. Grafana Mimir is the recommended option for organisations that need to centralise metrics from many clusters or teams.
- Federation — Prometheus supports hierarchical federation, where a global Prometheus scrapes aggregated metrics from per-cluster Prometheus instances.
How ICTLAB Can Help
ICTLAB designs and deploys monitoring and observability solutions for Belgian organisations. From setting up Prometheus and Grafana on your Kubernetes clusters to building custom dashboards, alerting workflows, and long-term metrics storage, we help your team gain full visibility into your infrastructure and applications.
Related reading: learn how internal developer platforms integrate monitoring, explore our guide to reducing cloud costs with FinOps, or see how GitOps and Kubernetes work alongside observability for reliable deployments.