Real-Time, All the Time: How We Streamed Data from Everywhere into Dashboards and APIs in Seconds

From hourly dashboards to sub-second APIs — how we built a blazing-fast, cost-efficient analytics pipeline using cloud-native tools and open-source tech.

Why We Built This

The ops team and analysts at CARS24 used to rely on hourly data syncs into Snowflake, visualized in Tableau. Meanwhile, internal services and APIs needing fresh insights had to deal with stale or incomplete data — not ideal when seconds matter.

We needed real-time visibility and instant data access — for both human dashboards and real-time DS/ML APIs.

So, we built a real-time data ingestion and analytics pipeline on GKE using:

Google Pub/Sub
Apache Kafka
StarRocks
MinIO + Google Cloud Storage
Apache Superset

Now, the same live data that fuels dashboards also powers low-latency APIs for DS/ML use-cases — unlocking new automation, intelligence, and responsiveness across systems.

Data Ingestion/Delivery Architecture

Step 1: Unified Ingestion with Google Pub/Sub

We start by streaming data into Google Pub/Sub, which acts as our real-time message broker. This includes:

BigQuery exports/CDC pushed as events
Internal APIs pushing app metrics and events
Third-party vendor data feeds

This gave us a consistent entry point for all inbound data, regardless of origin. Pub/Sub acts as the unified, decoupled transport layer that feeds the rest of our pipeline.

Step 2: From Pub/Sub to Kafka (Yes, Both!)

Why both? Because:

Pub/Sub is great for ingesting external data.
Kafka gives us better buffering, replayability, and tooling for downstream processing.

We use the Pub/Sub Source Connector to stream messages directly into Kafka topics. This makes the system more flexible and robust — with Kafka acting as a durable message bus between ingestion and analytics.

Step 3: Real-Time Ingestion into StarRocks

Here’s where the magic happens.

We use StarRocks’ Routine Load feature to ingest Kafka topics continuously. No more batch jobs, no more delays — just real-time data landing in StarRocks tables, ready to be queried.

Why StarRocks?

Blazing fast query performance
Native support for Kafka ingestion
Scales horizontally
Great for real-time analytics workloads
Materialized Views which further improve query performance

We run StarRocks in shared-data mode, which separates compute from storage. This lets us autoscale Compute Nodes (CNs) based on real-time load using Kubernetes HPA.

Step 4: Smart Storage with MinIO & GCS

Storage costs can escalate quickly, especially with SSDs. So we introduced tiered storage:

Hot Data → Stored on MinIO (an S3-compatible object store inside our cluster)
Cold Data → Offloaded to Google Cloud Storage (GCS) to save cost

StarRocks integrates smoothly with this setup. Queries hit hot storage for recent data and fetch older data from cold storage only when needed — with minimal performance impact. There is a feature within StarRocks for data tiering which can move data from SSD to HDD. This is useful where you want to further reduce tiered data retrieval time, but comes at the cost of having HDDs.

💰 Result: Big storage savings, no compromise on query performance.

Step 5: Real-Time Dashboards with Superset

Finally, we plugged Apache Superset into StarRocks.

Superset is fast, open-source, and user-friendly. It lets the team:

Build and view dashboards in real time
Run complex queries with sub-second response times
No need to wait for the Snowflake sync or Tableau refresh
A fully open-source analytics stack (bye-bye, Tableau license fees!)

Before vs. After

Key Takeaways

Decoupling ingestion from processing via Pub/Sub and Kafka gave us flexibility and reliability.
StarRocks provided the perfect balance of real-time performance and cost efficiency.
Tiered storage with MinIO and GCS dramatically cut down our storage costs.
Superset + StarRocks made real-time analytics truly self-service for the Ops team.

This pipeline has completely transformed how we operate. Our Ops team no longer waits for stale dashboards — they act on live data. And it’s not just people who benefit:

Our real-time DS/ML APIs now fetch up-to-the-second data directly from StarRocks, powering internal tools, automations, and services that rely on fast, accurate insights. Whether it’s a live personalization service or a backend service making decisions based on operational metrics — they’re all plugged into the same real-time engine.

We’ve reduced costs, improved performance, and unlocked a new level of agility and intelligence across the organization.

Future Work & Limitations

While our real-time pipeline has significantly improved visibility and responsiveness across teams, it’s still evolving — and there are a few areas we’re actively exploring and others where trade-offs exist.

Schema Evolution & Data Contracts

Right now, handling schema changes across producers (Pub/Sub, Kafka) and consumers (StarRocks, APIs) is manual and fragile. We’re exploring:

Schema registry integration (e.g., Confluent Schema Registry)
Enforced data contracts between producers and downstream consumers

High-Cardinality Metrics & Joins

StarRocks handles high-concurrency and massive ingest well, but:

Joins across large datasets can get expensive in real time
High-cardinality dimensions (e.g., user-level analytics) need careful modeling to avoid bloated tables or slow queries

We’re experimenting with materialised views, pre-aggregations, and even hybrid models (mixing real-time + batch where needed).

Lakehouse Integration

As we continue to scale our real-time data platform, we’re exploring how to integrate it with lakehouse architecture — to bridge the gap between streaming, historical, and analytical workloads.

Currently, StarRocks serves as our high-speed analytical engine, but it doesn’t store long-term, large-volume datasets. Integrating with a lakehouse would enable us to:

Persist raw + processed data in an open format (e.g., Parquet, Iceberg, or Delta Lake)
Enable retrospective analysis beyond the retention period of StarRocks
Run AI/ML workflows and batch analytics on historical data without overloading our real-time pipeline
Support replays or reprocessing from deep storage when business logic changes

The long-term goal is to have a fully unified architecture:

Data is streamed, stored, queried, and analyzed — across real-time and historical timelines — from one logical platform.