Building Resilient Node.js BFFs at Scale: Hard-Earned Lessons from Production
CARS24's BFF layer now handles millions of requests across multiple countries, orchestrating data from multiple microservices. It powers superapp, home pages, car listings, search, checkout, and everything in between. When it works, nobody notices. When it fails, the business notices immediately.
These aren’t theoretical patterns from a side project. They’re solutions that survived the test of real traffic, real failures, and real business consequences. They’re the hard-won lessons from outages, incidents, and countless hours of debugging.
I won’t promise you’ll never face another production issue. But maybe, just maybe, you’ll recognise the warning signs a little earlier. You’ll avoid some of the mistakes we made. You’ll have a better mental model for why things fail in ways you never expected.
So grab your coffee (or tea, I don’t judge), and let’s dive into the messy, beautiful reality of building resilient systems in production. This is the story of what actually works, warts and all.
Welcome to the real world of BFF architecture. I’m glad you’re here
Problem #1: Promise.all() is a Footgun for Production BFFs
The Trap Everyone Falls Into
You learn about Promise.all() and think you're done with parallel execution:
const [users, orders, inventory] = await Promise.all([ getUserData(userId), getOrders(userId), getInventory(userId) ]);
This works great until one service throws an error. Then everything fails.
In production, this means:
- Inventory service has a hiccup → entire user dashboard crashes
- One timeout → user sees error page
- Partial failures become total failures
The Real Solution: Layered Failure Handling
// Layer 1: Critical dependencies that MUST succeed
async function getCriticalData(userId) {
try {
return await Promise.all([
getAuthContext(userId),
getUserProfile(userId)
]);
} catch (error) {
// If critical fails, the request should fail
throw new Error('Critical dependency failure');
}
}
// Layer 2: Important but degradable dependencies
async function getImportantData(userId) {
const results = await Promise.allSettled([
getOrders(userId),
getPaymentStatus(userId)
]);
// Return what succeeded, log what failed
return results.map((result, idx) => {
if (result.status === 'fulfilled') {
return result.value;
}
logger.warn({
service: ['orders', 'payment'][idx],
reason: result.reason.message
});
return null;
});
}
// Layer 3: Optional enrichments (fire and don't wait)
async function startOptionalEnrichments(userId, responseId) {
// Don't await - let these run in background
Promise.allSettled([
getRecommendations(userId),
trackAnalytics(userId),
prefetchRelatedData(userId)
]).then(results => {
// Push updates via WebSocket or cache for next request
notifyClient(responseId, results.filter(r => r.status === 'fulfilled'));
});
}
The key insight: Not all data needs to be in the initial response. Split your dependencies into:
- Blocking critical (must have)
- Blocking degradable (nice to have in response, handle failures)
- Non-blocking (can arrive later via WebSocket/polling)
This pattern reduced our error rate from 15% to <1% during downstream instability.
Problem #2: Axios Timeout Doesn’t Actually Cancel Requests
The Misconception
await axios.get(url, { timeout: 1000 });
What developers think: “Request will be cancelled after 1 second”
What actually happens:
- Promise rejects after 1 second ✓
- Socket connection stays open ✗
- Memory isn’t freed ✗
- Database connection on the downstream service keeps running ✗
Under load, this causes:
- Socket pool exhaustion
- Memory leaks
- Cascading failures on downstream services
- “Connection reset by peer” errors
The Actual Solution: AbortController + Proper Cleanup
class ServiceClient {
constructor(baseURL, defaultTimeout = 1000) {
this.axios = axios.create({
baseURL,
httpAgent: new http.Agent({
keepAlive: true,
keepAliveMsecs: 30000,
maxSockets: 50,
maxFreeSockets: 10,
timeout: 60000 // Socket timeout (different from request timeout)
})
});
}
async request(url, options = {}) {
const timeout = options.timeout || this.defaultTimeout;
const controller = new AbortController();
const timeoutId = setTimeout(() => {
controller.abort();
logger.warn({
url,
timeout,
message: 'Request aborted due to timeout'
});
}, timeout);
try {
const response = await this.axios.get(url, {
...options,
signal: controller.signal,
timeout: timeout // Still set axios timeout as fallback
});
return response.data;
} catch (error) {
// Differentiate timeout vs other errors
if (axios.isCancel(error) || error.code === 'ECONNABORTED') {
throw new TimeoutError(`Request to ${url} timed out after ${timeout}ms`);
}
throw error;
} finally {
clearTimeout(timeoutId);
// AbortController cleanup is automatic, but clearing timeout prevents memory leaks
}
}
}
Critical details that must be remembered:
- httpAgent.timeout vs axios.timeout:
- httpAgent.timeout: How long a socket can be idle before closure
- axios.timeout: Maximum time for the entire request
- You need both for different failure modes
2. AbortSignal actually cancels the underlying request:
- Closes the socket immediately
- Frees memory
- Stops waiting for response
3. Connection pooling matters:
- maxSockets: Limit concurrent connections per host
- keepAlive: Reuse TCP connections (massive perf win)
- Without pooling, you’ll hit EMFILE (too many open files) errors
This reduced our socket exhaustion incidents from daily to never.
Problem #3: The Event Loop Lag You Can’t See
The Silent Killer
Your BFF is serving requests fine. Latency looks good in metrics. Then suddenly during peak traffic:
- Timeouts spike
- Requests queue up
- Nothing appears to be CPU-bound in profiling
The culprit: Event loop blocking from unexpected sources
Common blockers we found:
// 1. Synchronous JSON parsing of large payloads
app.post('/upload', (req, res) => {
const data = JSON.parse(req.body); // Blocks if body is 10MB+
});
// 2. Large object stringification in logging
logger.info('Response:', JSON.stringify(hugeObject)); // Blocks event loop
// 3. Regex on untrusted input
const matches = userInput.match(/complex.*regex/gi); // Can be O(2^n)
// 4. Synchronous crypto operations
const hash = crypto.createHash('sha256').update(data).digest('hex'); // Blocks
The Solution: Measure and Offload
First, make event loop lag visible:
const { performance } = require('perf_hooks');
class EventLoopMonitor {
constructor(threshold = 50) {
this.threshold = threshold;
this.lastCheck = performance.now();
setInterval(() => {
const now = performance.now();
const lag = now - this.lastCheck - 100; // Expected interval is 100ms
if (lag > this.threshold) {
logger.warn({
eventLoopLag: Math.round(lag),
message: 'Event loop blocked'
});
}
this.lastCheck = now;
}, 100);
}
}
// Start monitoring
new EventLoopMonitor(50);
Then offload blocking operations:
const { Worker } = require('worker_threads');
// worker.js
const { parentPort } = require('worker_threads');
parentPort.on('message', ({ action, data }) => {
switch(action) {
case 'parse-json':
try {
const parsed = JSON.parse(data);
parentPort.postMessage({ success: true, result: parsed });
} catch (error) {
parentPort.postMessage({ success: false, error: error.message });
}
break;
case 'hash-data':
const hash = crypto.createHash('sha256').update(data).digest('hex');
parentPort.postMessage({ success: true, result: hash });
break;
}
});
// main.js
class WorkerPool {
constructor(workerPath, size = 4) {
this.workers = Array.from({ length: size }, () => ({
worker: new Worker(workerPath),
busy: false
}));
this.queue = [];
}
async execute(action, data, timeout = 5000) {
return new Promise((resolve, reject) => {
const worker = this.workers.find(w => !w.busy);
if (!worker) {
// Queue if all workers busy
this.queue.push({ action, data, resolve, reject });
return;
}
worker.busy = true;
const timeoutId = setTimeout(() => {
worker.worker.terminate(); // Kill stuck worker
worker.worker = new Worker(workerPath); // Spawn new one
worker.busy = false;
reject(new Error('Worker timeout'));
this.processQueue();
}, timeout);
worker.worker.once('message', (result) => {
clearTimeout(timeoutId);
worker.busy = false;
if (result.success) {
resolve(result.result);
} else {
reject(new Error(result.error));
}
this.processQueue();
});
worker.worker.postMessage({ action, data });
});
}
processQueue() {
if (this.queue.length === 0) return;
const worker = this.workers.find(w => !w.busy);
if (worker) {
const task = this.queue.shift();
this.execute(task.action, task.data)
.then(task.resolve)
.catch(task.reject);
}
}
}
// Usage
const workerPool = new WorkerPool('./worker.js');
app.post('/upload', async (req, res) => {
try {
const parsed = await workerPool.execute('parse-json', req.body);
res.json({ success: true, data: parsed });
} catch (error) {
res.status(400).json({ error: error.message });
}
});
When to use worker threads (our rule of thumb):
- JSON parsing/stringifying of objects >100KB
- Crypto operations on data >10KB
- Any synchronous operation taking >10ms in profiling
- Regex on untrusted user input
This reduced our P99 event loop lag from 300ms to <5ms.
You may use some 3rd party libs for managing the worker thread if its adding overhead to your application eg: Piscina, Workerpool
Problem #4: Cache Stampede During Downstream Outages
The Scenario
Your cache has a 5-minute TTL. Downstream service goes down. What happens?
async function getData(key) {
const cached = await cache.get(key);
if (cached) return cached;
const fresh = await downstreamService.fetch(key); // This fails
await cache.set(key, fresh, 300);
return fresh;
}
At scale:
- Cache expires for popular key
- 100 requests try to fetch simultaneously
- All 100 requests hit the failing downstream
- Downstream gets overwhelmed further
- None of the requests cache the result (because fetch failed)
- Next 100 requests repeat the cycle
This is a cache stampede amplified by downstream failure.
The Solution: Request Coalescing + Stale-While-Revalidate
class ResilientCache {
constructor(redisClient) {
this.cache = redisClient;
this.inflightRequests = new Map();
}
async get(key, fetchFn, options = {}) {
const { ttl = 300, staleWhileRevalidate = 600 } = options;
// 1. Check cache
const cached = await this.cache.get(key);
if (cached) {
const { data, cachedAt, staleAt } = JSON.parse(cached);
const now = Date.now();
// Fresh data - return immediately
if (now < staleAt) {
return data;
}
// Stale but within grace period - return stale, refresh in background
if (now < cachedAt + staleWhileRevalidate * 1000) {
// Fire and forget background refresh
this.backgroundRefresh(key, fetchFn, ttl, staleWhileRevalidate)
.catch(err => logger.warn({ key, error: err.message }));
return data; // Return stale data immediately
}
}
// 2. Check if another request is already fetching this key
if (this.inflightRequests.has(key)) {
return this.inflightRequests.get(key);
}
// 3. Fetch fresh data (only one request does this)
const fetchPromise = this.fetchAndCache(key, fetchFn, ttl, staleWhileRevalidate);
this.inflightRequests.set(key, fetchPromise);
try {
const result = await fetchPromise;
return result;
} finally {
this.inflightRequests.delete(key);
}
}
async fetchAndCache(key, fetchFn, ttl, staleWhileRevalidate) {
try {
const data = await fetchFn();
const now = Date.now();
await this.cache.set(
key,
JSON.stringify({
data,
cachedAt: now,
staleAt: now + ttl * 1000
}),
'EX',
ttl + staleWhileRevalidate
);
return data;
} catch (error) {
// On fetch failure, try to return stale data even if expired
// Note: For repeatedly failing services, consider adding a circuit breaker
// to stop retry attempts temporarily and fail fast
const stale = await this.cache.get(key);
if (stale) {
const { data } = JSON.parse(stale);
logger.warn({
key,
message: 'Returning stale data due to fetch failure',
error: error.message
});
return data;
}
throw error;
}
}
async backgroundRefresh(key, fetchFn, ttl, staleWhileRevalidate) {
try {
await this.fetchAndCache(key, fetchFn, ttl, staleWhileRevalidate);
} catch (error) {
// Silently fail background refreshes
logger.debug({ key, message: 'Background refresh failed' });
}
}
}
// Usage
const cache = new ResilientCache(redisClient);
async function getUserData(userId) {
return cache.get(
`user:${userId}`,
() => userService.fetch(userId),
{
ttl: 300, // Fresh for 5 minutes
staleWhileRevalidate: 600 // Serve stale for up to 10 more minutes
}
);
}
Why this works:
- Request coalescing: Only one request fetches data for a key, others wait for that result
- Stale-while-revalidate: During outages, serve stale data instead of failing
- Background refresh: Users get instant responses with stale data while fresh data loads
- Graceful degradation: System stays up even when downstream is completely down
This pattern eliminated our cache stampede incidents entirely and kept availability >99.9% even during downstream outages.
Problem #5: Memory Leaks from Forgotten Cleanup
The Subtle Leak
async function processWithTimeout(fn, timeout) {
return Promise.race([
fn(),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), timeout)
)
]);
}
This leaks memory. If fn() resolves after timeout, the setTimeout timer still exists. Over thousands of requests, you accumulate thousands of timers.
The Proper Pattern
async function processWithTimeout(fn, timeout) {
let timeoutId;
const timeoutPromise = new Promise((_, reject) => {
timeoutId = setTimeout(() => reject(new Error('Timeout')), timeout);
});
try {
return await Promise.race([fn(), timeoutPromise]);
} finally {
clearTimeout(timeoutId); // Always cleanup
}
}
Other common leak sources:
// Event listeners that aren't removed
class RequestHandler {
constructor() {
this.controller = new AbortController();
// BAD: Listener never removed
this.controller.signal.addEventListener('abort', () => {
this.cleanup();
});
}
}
// GOOD: Remove listeners
class RequestHandler {
constructor() {
this.controller = new AbortController();
this.abortHandler = () => this.cleanup();
this.controller.signal.addEventListener('abort', this.abortHandler);
}
destroy() {
this.controller.signal.removeEventListener('abort', this.abortHandler);
}
}
// Axios interceptors that accumulate
// BAD: Adds a new interceptor on every request
app.use((req, res, next) => {
axios.interceptors.request.use(config => {
config.headers['X-Request-ID'] = req.id;
return config;
});
next();
});
// GOOD: Add interceptor once at startup
const requestInterceptor = axios.interceptors.request.use(config => {
config.headers['X-Request-ID'] = asyncLocalStorage.getStore()?.requestId;
return config;
});
We found these leaks by running heap snapshots at 1-hour intervals and comparing:
// Memory profiling in staging
if (process.env.NODE_ENV === 'staging') {
const v8 = require('v8');
const fs = require('fs');
setInterval(() => {
const snapshot = v8.writeHeapSnapshot();
logger.info({ heapSnapshot: snapshot });
}, 3600000); // Every hour
}
Problem #6: Logging at Scale Kills Performance
The Innocent Looking Code
logger.info('Request received', {
userId: req.user.id,
body: req.body,
timestamp: new Date().toISOString()
});
At 1000 RPS:
- JSON.stringify(req.body) runs 1000 times/second
- If body is 50KB, that’s 50MB/s of serialization
- Blocks event loop 1–5ms per call
- Cumulative blocking: 1–5 seconds of event loop time per second
Your logs are causing the latency you’re logging about.
The Solution: Sampling + Async Logging
class SmartLogger {
constructor(options = {}) {
this.sampleRate = options.sampleRate || 0.1; // Log 10% of requests
this.alwaysLogErrors = options.alwaysLogErrors !== false;
this.queue = [];
this.flushInterval = options.flushInterval || 1000;
// Async flush to transport
setInterval(() => this.flush(), this.flushInterval);
}
shouldSample() {
return Math.random() < this.sampleRate;
}
info(message, meta = {}) {
// Sample normal requests
if (!this.shouldSample()) return;
this.queue.push({
level: 'info',
message,
...this.sanitizeMeta(meta),
timestamp: Date.now()
});
}
error(message, meta = {}) {
// Always log errors
this.queue.push({
level: 'error',
message,
...this.sanitizeMeta(meta),
timestamp: Date.now()
});
}
sanitizeMeta(meta) {
// Don't log large objects - log their size instead
return Object.entries(meta).reduce((acc, [key, value]) => {
if (typeof value === 'object' && value !== null) {
const size = Buffer.byteLength(JSON.stringify(value));
acc[`${key}Size`] = size;
// Only include full object if small
if (size < 1024) {
acc[key] = value;
}
} else {
acc[key] = value;
}
return acc;
}, {});
}
async flush() {
if (this.queue.length === 0) return;
const batch = this.queue.splice(0, this.queue.length);
// Async write - doesn't block event loop
setImmediate(() => {
// Send to logging service
this.transport.write(batch).catch(err => {
console.error('Failed to flush logs:', err);
});
});
}
}
// Usage with adaptive sampling
class AdaptiveLogger extends SmartLogger {
constructor(options) {
super(options);
this.errorRate = 0;
this.lastMinuteErrors = [];
}
error(message, meta) {
super.error(message, meta);
// Track error rate
this.lastMinuteErrors.push(Date.now());
this.lastMinuteErrors = this.lastMinuteErrors.filter(
t => t > Date.now() - 60000
);
this.errorRate = this.lastMinuteErrors.length / 60;
// Increase sampling when error rate is high
if (this.errorRate > 10) {
this.sampleRate = 1.0; // Log everything
} else if (this.errorRate > 5) {
this.sampleRate = 0.5; // Log 50%
} else {
this.sampleRate = 0.1; // Log 10%
}
}
}
This reduced our logging overhead from 15% CPU to <2% CPU while maintaining debuggability.
The Complete BFF Request Pattern
const { AsyncLocalStorage } = require('async_hooks');
const asyncLocalStorage = new AsyncLocalStorage();
class BFFHandler {
constructor() {
this.cache = new ResilientCache(redis);
this.logger = new AdaptiveLogger();
this.workerPool = new WorkerPool('./worker.js');
}
async handleRequest(req, res) {
const requestId = req.headers['x-request-id'] || generateId();
const startTime = Date.now();
// Store request context for access anywhere
return asyncLocalStorage.run({ requestId }, async () => {
try {
const userId = req.user.id;
// Phase 1: Critical data (must succeed)
const critical = await this.getCriticalData(userId);
// Phase 2: Important data (degrade gracefully)
const important = await this.getImportantData(userId);
// Phase 3: Optional data (fire and forget)
this.startOptionalEnrichments(userId, requestId);
// Phase 4: Transform (offload if needed)
const response = await this.transformResponse(critical, important);
// Log success
this.logger.info('request_completed', {
userId,
latency: Date.now() - startTime,
path: req.path
});
res.json(response);
} catch (error) {
this.logger.error('request_failed', {
error: error.message,
stack: error.stack,
userId: req.user?.id,
latency: Date.now() - startTime
});
res.status(500).json({ error: 'Internal server error' });
}
});
}
async getCriticalData(userId) {
const timeout = 800;
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), timeout);
try {
return await Promise.all([
serviceClient.request('/auth', { userId, signal: controller.signal }),
serviceClient.request('/profile', { userId, signal: controller.signal })
]);
} finally {
clearTimeout(timeoutId);
}
}
async getImportantData(userId) {
const results = await Promise.allSettled([
this.cache.get(`orders:${userId}`, () =>
serviceClient.request('/orders', { userId }),
{ ttl: 60, staleWhileRevalidate: 300 }
),
this.cache.get(`payment:${userId}`, () =>
serviceClient.request('/payment', { userId }),
{ ttl: 30, staleWhileRevalidate: 120 }
)
]);
return results.map(r => r.status === 'fulfilled' ? r.value : null);
}
startOptionalEnrichments(userId, requestId) {
Promise.allSettled([
serviceClient.request('/recommendations', { userId }),
serviceClient.request('/analytics', { userId })
]).then(results => {
// Send via WebSocket or cache for next request
this.notifyClient(requestId, results);
});
}
async transformResponse(critical, important) {
const combined = { ...critical, ...important };
// Check if transformation is CPU-intensive
const dataSize = Buffer.byteLength(JSON.stringify(combined));
if (dataSize > 100000) {
// Offload to worker thread
return this.workerPool.execute('transform', combined);
}
// Simple transformation on main thread
return this.transform(combined);
}
}
Key Takeaways
- Promise.all() vs Promise.allSettled() vs fire-and-forget — Know when to use each based on criticality
- Timeouts need AbortController — Otherwise you’re not actually cancelling requests
- Event loop lag is invisible — Instrument it explicitly or you’ll never know what’s blocking
- Cache stampedes amplify outages — Use request coalescing + stale-while-revalidate
- Memory leaks are cumulative — Always cleanup timers, listeners, and interceptors
- Logging at scale requires sampling — Adaptive sampling based on error rate works well
Conclusion
These aren’t beginner problems. They’re the issues that emerge when “basic” Node.js patterns meet production traffic.
The difference between a BFF that handles 100rps and one that handles 10,000rps isn’t some magical framework; it’s understanding these failure modes and designing around them.
Most importantly: These patterns compound. Each one individually gives modest improvements. Together, they’re the difference between a system that falls over and one that stays up.
Loved this article?
Hit the like button
Share this article
Spread the knowledge
More from the world of Cars24
When Voice AI Loses Track of Reality
I hope my findings shared here make a meaningful contribution and have a positive impact on how voice AI systems are designed and evaluated.
From Feedback to Visible Change: A Gentle Rhythm for Team Observability
Leaders spend less time firefighting and more time enabling, because patterns in the trend point to structural fixes.
Supercharge Your Gupshup Campaigns with Cloudflare workers
Deep linking third-party URL shortening services via Cloudflare bypasses limitations like restricted file hosting by leveraging Cloudflare Workers to dynamically serve necessary files.