Chaos to Clarity: How We Built a Scalable Observability System for Logs, Metrics and Traces
Monitoring is crucial for maintaining high availability, optimizing performance, and diagnosing issues effectively. In a cloud-native environment, visibility into metrics, logs, and traces allows us to detect and resolve problems before they impact our users proactively.
To achieve this, we built our observability stack using multiple components from the Grafana ecosystem (LGTM stack)- Grafana, Alloy, Loki, Tempo, and Prometheus. We also utilized OpenTelemetry Collector (OTel) for exposing traces of KServe services running under the ISTIO umbrella.
1. Role of each component
Alloy: The backbone of our monitoring setup, allowing us to collect, process, and export telemetry data (metrics, logs, and traces — for instrumentation use-cases) from a single tool.
Prometheus: Allows us to store and query time-series metrics efficiently.
Loki: Allows us to aggregate and query logs to provide detailed insights, reducing reliance on expensive cloud logging services.
Tempo: Aggregates and handles distributed tracing to track request flows across services.
Grafana: It allows us to query, visualize, and set alerts on our collected telemetry data by creating charts and dashboards.
OTel: OpenTelemetry Collector acts as the standard for instrumenting applications and collecting traces. We use the OTel Collector to gather traces (with Zipkin as backend receiver) from kserve services running under Istio, process them, and export them to Tempo for efficient querying. We are monitoring traces right from the Istio-Gateway till the kserve-service endpoints.
2. Motivation
As our Data Science teams started deploying more services and we started expanding our infrastructure with Kafka, a real-time OLAP database, and a vector DB, monitoring became increasingly challenging.
There were cases when our services were failing silently, and we didn’t realize it for hours (or even days!). Some of the recurring issues we faced were:
- Disk storage running out
- Disk inodes were getting exhausted
- Continuous out-of-memory (OOM) errors
- API requests failing randomly
- Frequent HTTP 4xx and 5xx status codes
Even when there were no significant failures, we also faced performance degradation issues, such as:
- Database queries were slowing down over time
- CPU throttling
- Increased API latencies
- Unexpected spikes in request volume
Another major challenge was cost. As the number of services grew, our Google Cloud Platform (GCP) logging costs skyrocketed, increasing significantly each month.
Initially, we manually monitored services- constantly checking pod status, logs, and resource utilization (CPU + memory + disk + GPU). But as the number of services grew, this approach became unsustainable. That’s when we decided to build our observability pipeline from scratch, and we haven’t looked back since.
3. Deployment
We deployed Alloy as a DaemonSet and in clustering mode (explained in the challenges section) in each of our GKE clusters, as we wanted to read the logs of each pod in each node. Moreover, having an Alloy pod deployed on each node allowed us to gather node-level metrics like CPU, memory, network health, system load, disk usage, and many more metrics, which aren’t provided by GKE. Each Alloy pod pushes the collected logs to Loki; and metrics to Prometheus via a remote write operation for centralized storage.
We deployed a centralized Prometheus on a single GKE cluster using its Kubernetes operator. One thing to note is that we don’t use Prometheus workers for scraping the metrics; instead, we use Alloy for that. Also, to set alerts on our metrics, we use Grafana alerting instead of the Prometheus Alert Manager. The reason for adding a centralized Prometheus instead of adding it to every GKE cluster was the operational complexity of multiple Prometheus deployments (we haven’t reached the complexities where we would require using Thanos).
Grafana and Tempo are also deployed on a single GKE cluster.
Loki is deployed as a distributed system with separate components for ingestion, querying, and storage. It remotely stores logs in a GCS bucket for cost-effective long-term retention while keeping recent logs on disk for faster querying. Logs are pushed from Alloy to Loki, and Grafana is used to visualize and analyze them.
OTel Collector is deployed on a GKE cluster where Kserve services are running. It scraps the traces through Zipkin, processes them for filters (we are only pushing useful traces to Tempo with a 10% random sampling rate), and exports them through the OTLP exporter.
4. Data flow
Below is a simplified diagram of how data (logs and metrics) move through our infrastructure, highlighting the previously mentioned observability components in action:
The Alloy pods are at the core of our infrastructure. They collect data from multiple sources — pods, VMs, and logs from GKE nodes — before forwarding it to Prometheus and Loki.
Loki stores the historical logs in a GCS bucket at a 30-day set interval, for 90 days before flushing them. Once the data is stored, we can visualize it in Grafana dashboards.
For metric collection, Alloy uses Kubernetes pod and service discovery mechanisms to get the Kubernetes pods and services that can be scraped. We filter out the desired ports and endpoints for scraping, which exposes the Prometheus metrics. Usually, the endpoint is /metrics. For scraping metrics from VMs, we run a Docker container for the exporters (explained below) and add the IP and port of the VMs in the Alloy configuration.
For log collection, Alloy reads log files from each node at the file path(s)- /var/logs/pods/<pod_id>/<container_name>/*.logs; using the Kubernetes pod discovery mechanism to get the pod_id and container_name.
For traces, the OTel Collector scrapes traces from services running under Istio and sends them to Tempo. Tempo aggregates the traces, keeps recent traces in memory for fast querying, and stores historical traces in a GCS bucket for cost-effective retention. Grafana provides a Dashboard for querying traces, enabling us to analyze request flows and debug latency issues.
4.1. Google Cloud Function logs
Many of our Google Cloud Functions generate a high volume of logs. Instead of storing them in GCP (and increasing our cloud bill even further), we save them to Loki. For this, we created a Log Router in GCP, which forwards the logs from GCP to a PubSub topic. Then, we can configure Alloy to read the logs from the PubSub topic and push them to Loki.
4.2. VM logs
We also deployed Alloy Docker containers on our VMs. This allows us to push the logs of all the containers in the VM to Loki, enabling a unified logging view across GKE clusters and VMs.
5. Benefits
We can now easily define Alert Rules in Grafana for any metric, log, or trace; we care about:
- Want to track HTTP 500 status codes in a service? Set an alert
- Need to monitor disk usage on databases and VMs? Set an alert
- Checking for service downtime? Set an alert
Processing logs in Loki rather than in GCP reduced our cloud costs by 17% at our current scale (~250GB of logs in Loki). Previously the logs cost scaled linearly with log volume. With Loki, the cost would remain almost static (up to a certain log volume), despite increasing log volume.
We configured user permissions in Grafana, allowing Data Science teams to monitor their services and set their alerts independently via internal VPN access.
6. Challenges
While setting up our observability stack, we faced several challenges:
6.1. Alloy clustering mode
Deploying Alloy as a DaemonSet can introduce a challenge- if each Alloy pod (in a GKE cluster) had the same configuration, then each would scrape the same services, leading to duplicate metrics. Fortunately, there is Alloy’s Cluster Mode, which allows us to distribute metric scraping across the Alloy pods. For example, in the diagram above, we can see that, on VM 1, two different Alloy pods scrape different metric endpoints. If the clustering mode was disabled, then both of the Alloy pods would scrape both metric endpoints. This would result in duplicate metrics.
6.2. Prometheus exporters
Some critical metrics (like node and GPU metrics) aren’t natively exposed in Prometheus format. To handle this, we use various exporters, which expose metrics to a format that Prometheus can understand. We use a number of exporters in our VMs and GKE clusters- Node exporter (for node-level metrics), GPU exporter (for GPU metrics), and Process exporter (for process-level metrics).
6.3 OTel exporters
Our use case was to build tracing for the End-to-End request flow. We weren’t interested in profiling our code with instrumentation. Thus, Alloy couldn’t be used in our use case. Hence, we implemented OtelCollector with a Zipkin receiver to expose traces and send them to Tempo for aggregation.
6.4 Trace Volume
We have so many Kserve services running for different DS Teams. Enabling trace on all of them was generating an enormous amount of trace data. Which could lead to increased storage cost and heavy query operations while refreshing the tracing Dashboard. Thus, we have applied a process filter in OTelCollector and kept only 10% of the traces with random sampling. This way, we were able to reduce the trace volume while fulfilling our requirements.
7. Future
Our current setup efficiently handles about 280K unique metrics, each with many labels (~ 40GB), ~ 250GB of logs, and approx. 12GB of traces. However, as our service count grows, we would need to scale horizontally–something we haven’t explored yet.
Another unexplored thing is a reliable way to back up Grafana dashboards and alert configurations.
Currently, we retain metrics for only a month in Prometheus. However, to enable historical analysis without incurring high storage costs, we would require tiering old data to a GCS bucket.
8. Conclusion
Building an effective observability pipeline wasn’t just a technical necessity — it was a game-changer for our operations. With automated monitoring, centralized logging, and proactive alerting, we’ve dramatically reduced downtime, improved system reliability, and cut logging costs. This setup has been a game-changer for us, and we’re excited to see how it evolves as our infrastructure scales.
Loved this article?
Hit the like button
Share this article
Spread the knowledge
More from the world of CARS24
How we supercharged our Auction flows with TanStack query
Switching to TanStack Query transformed how we handle server state, making our app faster and easier to maintain.
Refactoring Auth, Not Breaking Prod: A Case Study in S2S Migration
We needed a control centre — something that unified authentication, gave us visibility, and offered extensibility. So we leaned into something we already trusted internally: UMS, our Keycloak-based authentication platform.
Navigating AWS Service Migration: Challenges and Lessons Learned
A recent audit compliance reasons forced us to migrate NBFC tech to a separate account, ensuring clear data ownership, proper segregation, and controlled data sharing with other entities.