Engineering schedule 8 min read

Optimizing Image Generation Pipelines at Scale

Learn how we reduced latency by 40% using edge caching and predictive pre-generation strategies for our high-throughput API endpoints.

Author Avatar
Alex Chen
Senior Infrastructure Engineer • Oct 24, 2023
Abstract technical graphic showing network nodes
$ latency --check
> 45ms (optimized)

When we first launched Banatie's image generation API, we optimized for quality. But as our user base grew, so did the demand for speed. Here is how we tackled the challenge of delivering AI-generated assets in milliseconds.

The Latency Bottleneck

Our initial architecture was straightforward: a request hits our API gateway, gets queued, processed by a GPU worker, and the resulting image is uploaded to S3. Simple, but slow.

Users integrating our API into real-time applications needed faster response times. We identified two main areas for improvement:

  • Cold Starts: Spinning up new GPU instances took 2-3 minutes.
  • Network Overhead: Round trips between the inference server and storage added 200ms+.
info
Pro Tip: Analyze your P99

Don't just look at average latency. Your P99 (99th percentile) latency tells you the experience of your users during worst-case scenarios. Optimizing for P99 often yields the most stable system.

Implementing Edge Caching

To solve the network overhead, we moved our delivery layer to the edge. By utilizing a global CDN, we could serve cached results instantly for repeated prompts.

middleware/cache-control.ts
export function setCacheHeaders(res: Response) {
  // Cache for 1 hour at the edge, validate stale in background
  res.setHeader(
    
  );
  // Custom tag for purging
  res.setHeader();
}

The Results

After deploying these changes, we saw a dramatic drop in TTFB (Time To First Byte).

"The latency improvements were immediate. Our dashboard loads felt instantaneous compared to the previous version, directly impacting our user retention metrics."
Graph comparing latency before and after optimization
insert_chart Latency reduction over a 24-hour period post-deployment

Predictive Pre-Generation

For our enterprise clients, we introduced predictive generation. By analyzing usage patterns, we can pre-warm the cache with variations of commonly requested assets before the user even asks for them.

This is particularly useful for e-commerce clients who update their catalogs at predictable times.

Conclusion

Optimization is never finished. We are currently exploring WebAssembly for client-side resizing to further offload our servers. Stay tuned for Part 2!