Navigating AWS Service Migration: Challenges and Lessons Learned

CARS24 is an online platform that simplifies the process of selling and buying cars. We are a technology-driven marketplace and a pioneer in the used car segment.

The NBFC subsidiary, CARS24 Financial Services Private Limited, was founded in 2018 with the idea of financing the cars purchased via our platform.

We have historically hosted our services on the same AWS account, but recent audit compliance reasons forced us to migrate NBFC tech to a separate account, ensuring clear data ownership, proper segregation, and controlled data sharing with other entities.

In this post, we will share our journey of migrating services, including resources/dependencies, along with pitfalls and learnings.

Planning & Strategy

Given that our customers rely on our systems 24/7, achieving near zero downtime during migration was a top priority.

Our legacy setup used AWS ECS for container orchestration, while the new AWS account was built on AWS EKS. Since both environments were containerized, we decided to start with the migration of stateless services — essentially moving the application code first.

Stateless Migration

To kick things off, we deployed the application code in the new AWS account, while continuing to connect to dependencies (such as DB, S3, RabbitMQ) in the existing account.

We used Cloudflare for DNS, which allowed us to implement weighted routing. This enabled us to gradually shift traffic to the new environment, monitoring stability and performance before transitioning all traffic.

Stateful Migration

Once the stateless components were successfully running in the new account, we moved on to the more complex part: migrating stateful dependencies. Each one was evaluated for its criticality and the complexity involved in ensuring zero downtime.

1. AWS S3 — Object Storage

AWS supports cross-account S3 replication, which we used in two phases:

Initial batch copy of existing objects
Live replication for new/updated objects

Because S3 bucket names are globally unique, the application had to be updated to reference the new bucket names in the new account.

2. RabbitMQ (CloudMQ)

Since we use CloudMQ to host RabbitMQ, no data migration was necessary. We simply established VPC peering between the old and new environments to maintain seamless communication.

3. AWS MySQL RDS — Database

This was by far the most challenging part of the migration. To ensure continuous replication with minimal downtime, we set up real-time replication from the old RDS instance to the new one. This allowed us to switch over with minimal service interruption.

DB Replication — Approaches & Challenges

When it came to replicating our MySQL RDS database, we initially evaluated AWS Database Migration Service (DMS) for continuous replication. While DMS is often the go-to choice for such scenarios, we quickly ran into issues.

DMS Limitations

DMS failed to replicate certain tables that contained LOBs (Large Objects). We reached out to AWS Support to explore possible workarounds, but despite various tweaks and attempts, the replication remained unreliable for these critical tables.

Switching to Native MySQL Replication

Eventually, based on recommendations and our own assessment, we opted for native MySQL replication. It offered greater control and reliability for our specific use case. Here’s a high-level overview of the steps we followed:

Take a snapshot of the replica from the source account(Account A) and share it with the destination account(Account B).
Restore the shared snapshot in Account B and set up the replication from Account A’s master instance.
Once the replication is complete, stop the replication and direct the application to Account B’s instance, which now functions as the master.
Create a replica in Account B.

Learnings & Key Takeaways

One of the key decisions that contributed to the success of our migration was avoiding a “big bang” release. Instead, we opted for a phased approach, starting with stateless services before moving on to the stateful components. This not only reduced risk but also gave us the flexibility to handle unexpected issues more gracefully.

We also leveraged Cloudflare’s weighted routing to gradually shift traffic to the new environment. This allowed us to test the waters with real user traffic — without affecting the entire customer base.

Another major factor in our smooth migration was the time we invested in POCs, trials, and thorough testing in lower environments. These early-stage efforts helped us uncover potential issues well before they could impact production.