Building Asynchronous ML Inference Pipelines with Knative Eventing and KServe

Modern machine learning infrastructures require flexibility in how services communicate with inference endpoints. While many microservices are deployed as synchronous HTTP services, evolving business requirements often demand asynchronous processing capabilities to enhance resilience and scalability.

Consider a frontend application that interacts with multiple microservices, all relying on synchronous request-response communication. If one of these services, such as a machine learning service for generating product recommendations, has high latency, the traditional request-response architecture forces the application to wait for the ML service’s response before proceeding. As traffic increases, this tightly coupled system can lead to bottlenecks, degraded performance, and even failures under heavy load.

In a decoupled architecture, the frontend application can send requests to a message queue instead of directly calling the ML service. The ML service can then pull requests from the message queue, process them, and send the results to another queue. The frontend application retrieves the response from this queue once it becomes available. This approach allows the application to continue interacting with other services without being blocked by the ML service’s response time.

However, implementing this approach often requires transitioning services that currently rely on receiving HTTP requests to consuming messages via message queues.

This article explores a practical solution that leverages Knative Eventing with Kafka (message queue) and KServe (an ML serving framework for Kubernetes) to transform a standard synchronous request-response-based ML system into an asynchronous system. This transformation is achieved without modifying the underlying model serving code, ensuring compatibility with existing HTTP-based inference services.

KServe

KServe is an open-source model serving framework built on Kubernetes. It simplifies the deployment of machine learning models in a Kubernetes environment while providing options to tweak settings based on the user’s expertise. For more details, refer to their official website.

We used KServe to deploy ML models on a Kubernetes cluster as RESTful services, called Inference Services. These services accept HTTP POST requests, perform inference on the payload, and return the response. For more details on how we used KServe to productionize ML models, refer to this blog post.

Knative Eventing

Knative Eventing is an open-source platform that provides APIs to help you build event-driven applications on Kubernetes. For more details, refer to the Knative Eventing documentation. For our use case, we used KafkaSource, which is one of the APIs provided by Knative Eventing.

KafkaSource

KafkaSource is a type of event source in Knative Eventing. Event sources are components that pull events from an event producer (source) and send them to sinks. They act as connectors between external systems (like message queues or databases) and sink like a KServe Inference service. For more details, refer KafkaSource documentation

Sinks

When you create an event source, you specify a sink where events are sent from the source. A sink is an addressable or callable resource that can receive incoming events. For more details, refer Sink documentation

Addressable Objects: Receive and acknowledge events delivered over HTTP to an address defined in their status.address.url field.
Callable Objects: Receive events over HTTP, transform them, and optionally return new events in the HTTP response.

In our architecture, KafkaSource bridges Kafka and KServe. It pulls inference requests from a Kafka topic, transforms them into HTTP POST requests via the KafkaSource Dispatcher (explained below), and sends them to the KServe inference service for processing.

Our Problem Statement

At CARS24, we deploy ML services using KServe. These deployed ML services receive HTTP requests for serving.

Typically, other microservices consuming ML services make POST HTTP requests to the ML service as shown below:

A requirement arose where a microservice needed to call an ML service asynchronously using a message queue. The challenge was that our ML service was HTTP-compatible, and modifying the ML service code to consume events from the message queue was not feasible.

We needed an architecture that would allow the ML service to consume messages from a message queue without requiring code changes.

Our Solution: Asynchronous Processing with Knative

To solve this problem, we used Knative Eventing and Apache Kafka as the message broker. Using these technologies, we implemented an asynchronous architecture without any code changes to the ML service.

With this architecture, the microservice and ML service are decoupled. It works in this way:

Message Publication: The microservice publishes inference requests to a Kafka topic (inference-requests) and continues performing other tasks without waiting for the response.
Message Consumption: The KafkaSource Dispatcher pulls messages from the Kafka topic.
HTTP Transformation: The KafkaSource Dispatcher transforms the Kafka messages into HTTP POST requests.
Inference Processing: The KServe inference service processes the HTTP POST requests and publishes the results in a pub/sub system from which other services can consume the results.

Implementation Steps

We implemented our solution following these steps:

Install Knative Eventing: We installed Knative Eventing by following the Knative Eventing installation guide.
Install KafkaSource: We installed KafkaSource to integrate Kafka with Knative Eventing, which created a KafkaSource Dispatcher pod to handle messages from Kafka topics to sinks.
Create Addressable-Resolver ClusterRole: We created an RBAC role to allow the KServe inference service to act as a sink:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: inferenceservice-addressable-resolver
  labels:
    kafka.eventing.knative.dev/release: devel
    duck.knative.dev/addressable: "true"
rules:
  - apiGroups:
      - serving.kserve.io
    resources:
      - inferenceservices
      - inferenceservices/status
    verbs:
      - get
      - list
      - watch

Create KafkaSource: We configured KafkaSource to connect Kafka topics to the KServe inference service:

apiVersion: sources.knative.dev/v1beta1
kind: KafkaSource
metadata:
  name: inference-kafka-source
  namespace: ml-services
spec:
  consumerGroup: inference-consumer-group
  bootstrapServers:
    - kafka-broker.kafka:9092
  topics:
    - inference-requests
  sink:
    ref:
      apiVersion: serving.kserve.io/v1beta1
      kind: InferenceService
      name: my-inference-service
      namespace: ml-services
      uri: /v1/models/my_model:predict

Conclusion

As organizations continue to scale their ML services, such event-driven approaches will become increasingly critical in maintaining system performance and reliability. The combination of modern cloud-native technologies like Kubernetes, Knative Eventing, and KServe provides a robust foundation for building adaptable machine learning infrastructure.

Building Asynchronous ML Inference Pipelines with Knative Eventing and KServe

KServe

Knative Eventing

KafkaSource

Sinks

Our Problem Statement

Our Solution: Asynchronous Processing with Knative

Implementation Steps

Conclusion

More from the world of CARS24

Unlocking the Power of Personalisation in Automotive E-commerce

Cracking OCR at Scale: Lessons from Real-World Document Processing

Real-Time, All the Time: How We Streamed Data from Everywhere into Dashboards and APIs in Seconds