Building Resilient Node.js BFFs at Scale: Hard-Earned Lessons from Production

CARS24's BFF layer now handles millions of requests across multiple countries, orchestrating data from multiple microservices. It powers superapp, home pages, car listings, search, checkout, and everything in between. When it works, nobody notices. When it fails, the business notices immediately.

These aren’t theoretical patterns from a side project. They’re solutions that survived the test of real traffic, real failures, and real business consequences. They’re the hard-won lessons from outages, incidents, and countless hours of debugging.

I won’t promise you’ll never face another production issue. But maybe, just maybe, you’ll recognise the warning signs a little earlier. You’ll avoid some of the mistakes we made. You’ll have a better mental model for why things fail in ways you never expected.

So grab your coffee (or tea, I don’t judge), and let’s dive into the messy, beautiful reality of building resilient systems in production. This is the story of what actually works, warts and all.

Welcome to the real world of BFF architecture. I’m glad you’re here

Problem #1: Promise.all() is a Footgun for Production BFFs

The Trap Everyone Falls Into

You learn about Promise.all() and think you're done with parallel execution:

const [users, orders, inventory] = await Promise.all([
  getUserData(userId),
  getOrders(userId),
  getInventory(userId)
]);

This works great until one service throws an error. Then everything fails.

In production, this means:

Inventory service has a hiccup → entire user dashboard crashes
One timeout → user sees error page
Partial failures become total failures

The Real Solution: Layered Failure Handling

// Layer 1: Critical dependencies that MUST succeed
async function getCriticalData(userId) {
  try {
    return await Promise.all([
      getAuthContext(userId),
      getUserProfile(userId)
    ]);
  } catch (error) {
    // If critical fails, the request should fail
    throw new Error('Critical dependency failure');
  }
}
// Layer 2: Important but degradable dependencies
async function getImportantData(userId) {
  const results = await Promise.allSettled([
    getOrders(userId),
    getPaymentStatus(userId)
  ]);
  
  // Return what succeeded, log what failed
  return results.map((result, idx) => {
    if (result.status === 'fulfilled') {
      return result.value;
    }
    
    logger.warn({ 
      service: ['orders', 'payment'][idx],
      reason: result.reason.message 
    });
    return null;
  });
}
// Layer 3: Optional enrichments (fire and don't wait)
async function startOptionalEnrichments(userId, responseId) {
  // Don't await - let these run in background
  Promise.allSettled([
    getRecommendations(userId),
    trackAnalytics(userId),
    prefetchRelatedData(userId)
  ]).then(results => {
    // Push updates via WebSocket or cache for next request
    notifyClient(responseId, results.filter(r => r.status === 'fulfilled'));
  });
}

The key insight: Not all data needs to be in the initial response. Split your dependencies into:

Blocking critical (must have)
Blocking degradable (nice to have in response, handle failures)
Non-blocking (can arrive later via WebSocket/polling)

This pattern reduced our error rate from 15% to <1% during downstream instability.

Problem #2: Axios Timeout Doesn’t Actually Cancel Requests

The Misconception

await axios.get(url, { timeout: 1000 });

What developers think: “Request will be cancelled after 1 second”

What actually happens:

Promise rejects after 1 second ✓
Socket connection stays open ✗
Memory isn’t freed ✗
Database connection on the downstream service keeps running ✗

Under load, this causes:

Socket pool exhaustion
Memory leaks
Cascading failures on downstream services
“Connection reset by peer” errors

The Actual Solution: AbortController + Proper Cleanup

class ServiceClient {
  constructor(baseURL, defaultTimeout = 1000) {
    this.axios = axios.create({
      baseURL,
      httpAgent: new http.Agent({
        keepAlive: true,
        keepAliveMsecs: 30000,
        maxSockets: 50,
        maxFreeSockets: 10,
        timeout: 60000  // Socket timeout (different from request timeout)
      })
    });
  }
  
  async request(url, options = {}) {
    const timeout = options.timeout || this.defaultTimeout;
    const controller = new AbortController();
    
    const timeoutId = setTimeout(() => {
      controller.abort();
      logger.warn({
        url,
        timeout,
        message: 'Request aborted due to timeout'
      });
    }, timeout);
    
    try {
      const response = await this.axios.get(url, {
        ...options,
        signal: controller.signal,
        timeout: timeout  // Still set axios timeout as fallback
      });
      
      return response.data;
    } catch (error) {
      // Differentiate timeout vs other errors
      if (axios.isCancel(error) || error.code === 'ECONNABORTED') {
        throw new TimeoutError(`Request to ${url} timed out after ${timeout}ms`);
      }
      throw error;
    } finally {
      clearTimeout(timeoutId);
      // AbortController cleanup is automatic, but clearing timeout prevents memory leaks
    }
  }
}

Critical details that must be remembered:

httpAgent.timeout vs axios.timeout:

httpAgent.timeout: How long a socket can be idle before closure
axios.timeout: Maximum time for the entire request
You need both for different failure modes

2. AbortSignal actually cancels the underlying request:

Closes the socket immediately
Frees memory
Stops waiting for response

3. Connection pooling matters:

maxSockets: Limit concurrent connections per host
keepAlive: Reuse TCP connections (massive perf win)
Without pooling, you’ll hit EMFILE (too many open files) errors

This reduced our socket exhaustion incidents from daily to never.

Problem #3: The Event Loop Lag You Can’t See

The Silent Killer

Your BFF is serving requests fine. Latency looks good in metrics. Then suddenly during peak traffic:

Timeouts spike
Requests queue up
Nothing appears to be CPU-bound in profiling

The culprit: Event loop blocking from unexpected sources

Common blockers we found:

// 1. Synchronous JSON parsing of large payloads
app.post('/upload', (req, res) => {
  const data = JSON.parse(req.body);  // Blocks if body is 10MB+
});
// 2. Large object stringification in logging
logger.info('Response:', JSON.stringify(hugeObject));  // Blocks event loop
// 3. Regex on untrusted input
const matches = userInput.match(/complex.*regex/gi);  // Can be O(2^n)
// 4. Synchronous crypto operations
const hash = crypto.createHash('sha256').update(data).digest('hex');  // Blocks

The Solution: Measure and Offload

First, make event loop lag visible:

const { performance } = require('perf_hooks');
class EventLoopMonitor {
  constructor(threshold = 50) {
    this.threshold = threshold;
    this.lastCheck = performance.now();
    
    setInterval(() => {
      const now = performance.now();
      const lag = now - this.lastCheck - 100;  // Expected interval is 100ms
      
      if (lag > this.threshold) {
        logger.warn({
          eventLoopLag: Math.round(lag),
          message: 'Event loop blocked'
        });
      }
      
      this.lastCheck = now;
    }, 100);
  }
}
// Start monitoring
new EventLoopMonitor(50);

Then offload blocking operations:

const { Worker } = require('worker_threads');
// worker.js
const { parentPort } = require('worker_threads');
parentPort.on('message', ({ action, data }) => {
  switch(action) {
    case 'parse-json':
      try {
        const parsed = JSON.parse(data);
        parentPort.postMessage({ success: true, result: parsed });
      } catch (error) {
        parentPort.postMessage({ success: false, error: error.message });
      }
      break;
      
    case 'hash-data':
      const hash = crypto.createHash('sha256').update(data).digest('hex');
      parentPort.postMessage({ success: true, result: hash });
      break;
  }
});
// main.js
class WorkerPool {
  constructor(workerPath, size = 4) {
    this.workers = Array.from({ length: size }, () => ({
      worker: new Worker(workerPath),
      busy: false
    }));
    this.queue = [];
  }
  
  async execute(action, data, timeout = 5000) {
    return new Promise((resolve, reject) => {
      const worker = this.workers.find(w => !w.busy);
      
      if (!worker) {
        // Queue if all workers busy
        this.queue.push({ action, data, resolve, reject });
        return;
      }
      
      worker.busy = true;
      
      const timeoutId = setTimeout(() => {
        worker.worker.terminate();  // Kill stuck worker
        worker.worker = new Worker(workerPath);  // Spawn new one
        worker.busy = false;
        reject(new Error('Worker timeout'));
        this.processQueue();
      }, timeout);
      
      worker.worker.once('message', (result) => {
        clearTimeout(timeoutId);
        worker.busy = false;
        
        if (result.success) {
          resolve(result.result);
        } else {
          reject(new Error(result.error));
        }
        
        this.processQueue();
      });
      
      worker.worker.postMessage({ action, data });
    });
  }
  
  processQueue() {
    if (this.queue.length === 0) return;
    
    const worker = this.workers.find(w => !w.busy);
    if (worker) {
      const task = this.queue.shift();
      this.execute(task.action, task.data)
        .then(task.resolve)
        .catch(task.reject);
    }
  }
}
// Usage
const workerPool = new WorkerPool('./worker.js');
app.post('/upload', async (req, res) => {
  try {
    const parsed = await workerPool.execute('parse-json', req.body);
    res.json({ success: true, data: parsed });
  } catch (error) {
    res.status(400).json({ error: error.message });
  }
});

When to use worker threads (our rule of thumb):

JSON parsing/stringifying of objects >100KB
Crypto operations on data >10KB
Any synchronous operation taking >10ms in profiling
Regex on untrusted user input

This reduced our P99 event loop lag from 300ms to <5ms.

You may use some 3rd party libs for managing the worker thread if its adding overhead to your application eg: Piscina, Workerpool

Problem #4: Cache Stampede During Downstream Outages

The Scenario

Your cache has a 5-minute TTL. Downstream service goes down. What happens?

async function getData(key) {
  const cached = await cache.get(key);
  if (cached) return cached;
  
  const fresh = await downstreamService.fetch(key);  // This fails
  await cache.set(key, fresh, 300);
  return fresh;
}

At scale:

Cache expires for popular key
100 requests try to fetch simultaneously
All 100 requests hit the failing downstream
Downstream gets overwhelmed further
None of the requests cache the result (because fetch failed)
Next 100 requests repeat the cycle

This is a cache stampede amplified by downstream failure.

The Solution: Request Coalescing + Stale-While-Revalidate

class ResilientCache {
  constructor(redisClient) {
    this.cache = redisClient;
    this.inflightRequests = new Map();
  }
  
  async get(key, fetchFn, options = {}) {
    const { ttl = 300, staleWhileRevalidate = 600 } = options;
    
    // 1. Check cache
    const cached = await this.cache.get(key);
    
    if (cached) {
      const { data, cachedAt, staleAt } = JSON.parse(cached);
      const now = Date.now();
      
      // Fresh data - return immediately
      if (now < staleAt) {
        return data;
      }
      
      // Stale but within grace period - return stale, refresh in background
      if (now < cachedAt + staleWhileRevalidate * 1000) {
        // Fire and forget background refresh
        this.backgroundRefresh(key, fetchFn, ttl, staleWhileRevalidate)
          .catch(err => logger.warn({ key, error: err.message }));
        
        return data;  // Return stale data immediately
      }
    }
    
    // 2. Check if another request is already fetching this key
    if (this.inflightRequests.has(key)) {
      return this.inflightRequests.get(key);
    }
    
    // 3. Fetch fresh data (only one request does this)
    const fetchPromise = this.fetchAndCache(key, fetchFn, ttl, staleWhileRevalidate);
    this.inflightRequests.set(key, fetchPromise);
    
    try {
      const result = await fetchPromise;
      return result;
    } finally {
      this.inflightRequests.delete(key);
    }
  }
  
  async fetchAndCache(key, fetchFn, ttl, staleWhileRevalidate) {
    try {
      const data = await fetchFn();
      const now = Date.now();
      
      await this.cache.set(
        key,
        JSON.stringify({
          data,
          cachedAt: now,
          staleAt: now + ttl * 1000
        }),
        'EX',
        ttl + staleWhileRevalidate
      );
      
      return data;
    } catch (error) {
      // On fetch failure, try to return stale data even if expired
      // Note: For repeatedly failing services, consider adding a circuit breaker
      // to stop retry attempts temporarily and fail fast
      const stale = await this.cache.get(key);
      
      if (stale) {
        const { data } = JSON.parse(stale);
        logger.warn({
          key,
          message: 'Returning stale data due to fetch failure',
          error: error.message
        });
        return data;
      }
      
      throw error;
    }
  }
  
  async backgroundRefresh(key, fetchFn, ttl, staleWhileRevalidate) {
    try {
      await this.fetchAndCache(key, fetchFn, ttl, staleWhileRevalidate);
    } catch (error) {
      // Silently fail background refreshes
      logger.debug({ key, message: 'Background refresh failed' });
    }
  }
}
// Usage
const cache = new ResilientCache(redisClient);
async function getUserData(userId) {
  return cache.get(
    `user:${userId}`,
    () => userService.fetch(userId),
    { 
      ttl: 300,              // Fresh for 5 minutes
      staleWhileRevalidate: 600  // Serve stale for up to 10 more minutes
    }
  );
}

Why this works:

Request coalescing: Only one request fetches data for a key, others wait for that result
Stale-while-revalidate: During outages, serve stale data instead of failing
Background refresh: Users get instant responses with stale data while fresh data loads
Graceful degradation: System stays up even when downstream is completely down

This pattern eliminated our cache stampede incidents entirely and kept availability >99.9% even during downstream outages.

Problem #5: Memory Leaks from Forgotten Cleanup

The Subtle Leak

async function processWithTimeout(fn, timeout) {
  return Promise.race([
    fn(),
    new Promise((_, reject) => 
      setTimeout(() => reject(new Error('Timeout')), timeout)
    )
  ]);
}

This leaks memory. If fn() resolves after timeout, the setTimeout timer still exists. Over thousands of requests, you accumulate thousands of timers.

The Proper Pattern

async function processWithTimeout(fn, timeout) {
  let timeoutId;
  
  const timeoutPromise = new Promise((_, reject) => {
    timeoutId = setTimeout(() => reject(new Error('Timeout')), timeout);
  });
  
  try {
    return await Promise.race([fn(), timeoutPromise]);
  } finally {
    clearTimeout(timeoutId);  // Always cleanup
  }
}

Other common leak sources:

// Event listeners that aren't removed
class RequestHandler {
  constructor() {
    this.controller = new AbortController();
    
    // BAD: Listener never removed
    this.controller.signal.addEventListener('abort', () => {
      this.cleanup();
    });
  }
}
// GOOD: Remove listeners
class RequestHandler {
  constructor() {
    this.controller = new AbortController();
    this.abortHandler = () => this.cleanup();
    this.controller.signal.addEventListener('abort', this.abortHandler);
  }
  
  destroy() {
    this.controller.signal.removeEventListener('abort', this.abortHandler);
  }
}
// Axios interceptors that accumulate
// BAD: Adds a new interceptor on every request
app.use((req, res, next) => {
  axios.interceptors.request.use(config => {
    config.headers['X-Request-ID'] = req.id;
    return config;
  });
  next();
});
// GOOD: Add interceptor once at startup
const requestInterceptor = axios.interceptors.request.use(config => {
  config.headers['X-Request-ID'] = asyncLocalStorage.getStore()?.requestId;
  return config;
});

We found these leaks by running heap snapshots at 1-hour intervals and comparing:

// Memory profiling in staging
if (process.env.NODE_ENV === 'staging') {
  const v8 = require('v8');
  const fs = require('fs');
  
  setInterval(() => {
    const snapshot = v8.writeHeapSnapshot();
    logger.info({ heapSnapshot: snapshot });
  }, 3600000);  // Every hour
}

Problem #6: Logging at Scale Kills Performance

The Innocent Looking Code

logger.info('Request received', {
  userId: req.user.id,
  body: req.body,
  timestamp: new Date().toISOString()
});

At 1000 RPS:

JSON.stringify(req.body) runs 1000 times/second
If body is 50KB, that’s 50MB/s of serialization
Blocks event loop 1–5ms per call
Cumulative blocking: 1–5 seconds of event loop time per second

Your logs are causing the latency you’re logging about.

The Solution: Sampling + Async Logging

class SmartLogger {
  constructor(options = {}) {
    this.sampleRate = options.sampleRate || 0.1;  // Log 10% of requests
    this.alwaysLogErrors = options.alwaysLogErrors !== false;
    this.queue = [];
    this.flushInterval = options.flushInterval || 1000;
    
    // Async flush to transport
    setInterval(() => this.flush(), this.flushInterval);
  }
  
  shouldSample() {
    return Math.random() < this.sampleRate;
  }
  
  info(message, meta = {}) {
    // Sample normal requests
    if (!this.shouldSample()) return;
    
    this.queue.push({
      level: 'info',
      message,
      ...this.sanitizeMeta(meta),
      timestamp: Date.now()
    });
  }
  
  error(message, meta = {}) {
    // Always log errors
    this.queue.push({
      level: 'error',
      message,
      ...this.sanitizeMeta(meta),
      timestamp: Date.now()
    });
  }
  
  sanitizeMeta(meta) {
    // Don't log large objects - log their size instead
    return Object.entries(meta).reduce((acc, [key, value]) => {
      if (typeof value === 'object' && value !== null) {
        const size = Buffer.byteLength(JSON.stringify(value));
        acc[`${key}Size`] = size;
        
        // Only include full object if small
        if (size < 1024) {
          acc[key] = value;
        }
      } else {
        acc[key] = value;
      }
      return acc;
    }, {});
  }
  
  async flush() {
    if (this.queue.length === 0) return;
    
    const batch = this.queue.splice(0, this.queue.length);
    
    // Async write - doesn't block event loop
    setImmediate(() => {
      // Send to logging service
      this.transport.write(batch).catch(err => {
        console.error('Failed to flush logs:', err);
      });
    });
  }
}
// Usage with adaptive sampling
class AdaptiveLogger extends SmartLogger {
  constructor(options) {
    super(options);
    this.errorRate = 0;
    this.lastMinuteErrors = [];
  }
  
  error(message, meta) {
    super.error(message, meta);
    
    // Track error rate
    this.lastMinuteErrors.push(Date.now());
    this.lastMinuteErrors = this.lastMinuteErrors.filter(
      t => t > Date.now() - 60000
    );
    
    this.errorRate = this.lastMinuteErrors.length / 60;
    
    // Increase sampling when error rate is high
    if (this.errorRate > 10) {
      this.sampleRate = 1.0;  // Log everything
    } else if (this.errorRate > 5) {
      this.sampleRate = 0.5;  // Log 50%
    } else {
      this.sampleRate = 0.1;  // Log 10%
    }
  }
}

This reduced our logging overhead from 15% CPU to <2% CPU while maintaining debuggability.

The Complete BFF Request Pattern

const { AsyncLocalStorage } = require('async_hooks');
const asyncLocalStorage = new AsyncLocalStorage();
class BFFHandler {
  constructor() {
    this.cache = new ResilientCache(redis);
    this.logger = new AdaptiveLogger();
    this.workerPool = new WorkerPool('./worker.js');
  }
  
  async handleRequest(req, res) {
    const requestId = req.headers['x-request-id'] || generateId();
    const startTime = Date.now();
    
    // Store request context for access anywhere
    return asyncLocalStorage.run({ requestId }, async () => {
      try {
        const userId = req.user.id;
        
        // Phase 1: Critical data (must succeed)
        const critical = await this.getCriticalData(userId);
        
        // Phase 2: Important data (degrade gracefully)
        const important = await this.getImportantData(userId);
        
        // Phase 3: Optional data (fire and forget)
        this.startOptionalEnrichments(userId, requestId);
        
        // Phase 4: Transform (offload if needed)
        const response = await this.transformResponse(critical, important);
        
        // Log success
        this.logger.info('request_completed', {
          userId,
          latency: Date.now() - startTime,
          path: req.path
        });
        
        res.json(response);
        
      } catch (error) {
        this.logger.error('request_failed', {
          error: error.message,
          stack: error.stack,
          userId: req.user?.id,
          latency: Date.now() - startTime
        });
        
        res.status(500).json({ error: 'Internal server error' });
      }
    });
  }
  
  async getCriticalData(userId) {
    const timeout = 800;
    const controller = new AbortController();
    const timeoutId = setTimeout(() => controller.abort(), timeout);
    
    try {
      return await Promise.all([
        serviceClient.request('/auth', { userId, signal: controller.signal }),
        serviceClient.request('/profile', { userId, signal: controller.signal })
      ]);
    } finally {
      clearTimeout(timeoutId);
    }
  }
  
  async getImportantData(userId) {
    const results = await Promise.allSettled([
      this.cache.get(`orders:${userId}`, () => 
        serviceClient.request('/orders', { userId }), 
        { ttl: 60, staleWhileRevalidate: 300 }
      ),
      this.cache.get(`payment:${userId}`, () =>
        serviceClient.request('/payment', { userId }),
        { ttl: 30, staleWhileRevalidate: 120 }
      )
    ]);
    
    return results.map(r => r.status === 'fulfilled' ? r.value : null);
  }
  
  startOptionalEnrichments(userId, requestId) {
    Promise.allSettled([
      serviceClient.request('/recommendations', { userId }),
      serviceClient.request('/analytics', { userId })
    ]).then(results => {
      // Send via WebSocket or cache for next request
      this.notifyClient(requestId, results);
    });
  }
  
  async transformResponse(critical, important) {
    const combined = { ...critical, ...important };
    
    // Check if transformation is CPU-intensive
    const dataSize = Buffer.byteLength(JSON.stringify(combined));
    
    if (dataSize > 100000) {
      // Offload to worker thread
      return this.workerPool.execute('transform', combined);
    }
    
    // Simple transformation on main thread
    return this.transform(combined);
  }
}

Key Takeaways

Promise.all() vs Promise.allSettled() vs fire-and-forget — Know when to use each based on criticality
Timeouts need AbortController — Otherwise you’re not actually cancelling requests
Event loop lag is invisible — Instrument it explicitly or you’ll never know what’s blocking
Cache stampedes amplify outages — Use request coalescing + stale-while-revalidate
Memory leaks are cumulative — Always cleanup timers, listeners, and interceptors
Logging at scale requires sampling — Adaptive sampling based on error rate works well

Conclusion

These aren’t beginner problems. They’re the issues that emerge when “basic” Node.js patterns meet production traffic.

The difference between a BFF that handles 100rps and one that handles 10,000rps isn’t some magical framework; it’s understanding these failure modes and designing around them.

Most importantly: These patterns compound. Each one individually gives modest improvements. Together, they’re the difference between a system that falls over and one that stays up.