Navigating AWS Service Migration: Challenges and Lessons Learned
CARS24 is an online platform that simplifies the process of selling and buying cars. We are a technology-driven marketplace and a pioneer in the used car segment.
The NBFC subsidiary, CARS24 Financial Services Private Limited, was founded in 2018 with the idea of financing the cars purchased via our platform.
We have historically hosted our services on the same AWS account, but recent audit compliance reasons forced us to migrate NBFC tech to a separate account, ensuring clear data ownership, proper segregation, and controlled data sharing with other entities.
In this post, we will share our journey of migrating services, including resources/dependencies, along with pitfalls and learnings.
Planning & Strategy
Given that our customers rely on our systems 24/7, achieving near zero downtime during migration was a top priority.
Our legacy setup used AWS ECS for container orchestration, while the new AWS account was built on AWS EKS. Since both environments were containerized, we decided to start with the migration of stateless services — essentially moving the application code first.
Stateless Migration
To kick things off, we deployed the application code in the new AWS account, while continuing to connect to dependencies (such as DB, S3, RabbitMQ) in the existing account.
We used Cloudflare for DNS, which allowed us to implement weighted routing. This enabled us to gradually shift traffic to the new environment, monitoring stability and performance before transitioning all traffic.
Stateful Migration
Once the stateless components were successfully running in the new account, we moved on to the more complex part: migrating stateful dependencies. Each one was evaluated for its criticality and the complexity involved in ensuring zero downtime.
1. AWS S3 — Object Storage
AWS supports cross-account S3 replication, which we used in two phases:
- Initial batch copy of existing objects
- Live replication for new/updated objects
Because S3 bucket names are globally unique, the application had to be updated to reference the new bucket names in the new account.
2. RabbitMQ (CloudMQ)
Since we use CloudMQ to host RabbitMQ, no data migration was necessary. We simply established VPC peering between the old and new environments to maintain seamless communication.
3. AWS MySQL RDS — Database
This was by far the most challenging part of the migration. To ensure continuous replication with minimal downtime, we set up real-time replication from the old RDS instance to the new one. This allowed us to switch over with minimal service interruption.
DB Replication — Approaches & Challenges
When it came to replicating our MySQL RDS database, we initially evaluated AWS Database Migration Service (DMS) for continuous replication. While DMS is often the go-to choice for such scenarios, we quickly ran into issues.
DMS Limitations
DMS failed to replicate certain tables that contained LOBs (Large Objects). We reached out to AWS Support to explore possible workarounds, but despite various tweaks and attempts, the replication remained unreliable for these critical tables.
Switching to Native MySQL Replication
Eventually, based on recommendations and our own assessment, we opted for native MySQL replication. It offered greater control and reliability for our specific use case. Here’s a high-level overview of the steps we followed:
- Take a snapshot of the replica from the source account(Account A) and share it with the destination account(Account B).
- Restore the shared snapshot in Account B and set up the replication from Account A’s master instance.
- Once the replication is complete, stop the replication and direct the application to Account B’s instance, which now functions as the master.
- Create a replica in Account B.
Learnings & Key Takeaways
One of the key decisions that contributed to the success of our migration was avoiding a “big bang” release. Instead, we opted for a phased approach, starting with stateless services before moving on to the stateful components. This not only reduced risk but also gave us the flexibility to handle unexpected issues more gracefully.
We also leveraged Cloudflare’s weighted routing to gradually shift traffic to the new environment. This allowed us to test the waters with real user traffic — without affecting the entire customer base.
Another major factor in our smooth migration was the time we invested in POCs, trials, and thorough testing in lower environments. These early-stage efforts helped us uncover potential issues well before they could impact production.
Loved this article?
Hit the like button
Share this article
Spread the knowledge
More from the world of CARS24
Chaos to Clarity: How We Built a Scalable Observability System for Logs, Metrics and Traces
This blog focuses on how we built our Observability Stack at CARS24, which includes Monitoring, Logging, and Tracing, to gain insights about our deployed services and receive instant alerts when something unexpected happens.
Building Asynchronous ML Inference Pipelines with Knative Eventing and KServe
The combination of modern cloud-native technologies like Kubernetes, Knative Eventing, and KServe provides a robust foundation for building adaptable machine learning infrastructure.
How 10X Engineers Think: Beyond Just Shipping Features
Engineering is not just about delivering features; it’s about shaping the future with thoughtful, scalable design. The real impact of engineering lies in how we solve problems, not just what problems we solve.