2026-04-14

Handling Translation API Rate Limits

Practical patterns for dealing with translation API rate limits: exponential backoff, token buckets, circuit breakers, request queuing, and fallback chains.

You've built a translation pipeline, it works great in development, and then you deploy it and immediately start getting 429 errors. Every translation API has rate limits, and most are more restrictive than you'd expect. Here's how to handle them properly.

Know your limits

Common rate limits for translation APIs (as of 2026):

| API | Requests/second | Characters/second | Daily limit | | ------------------------ | ------------------ | ----------------- | --------------- | | Google Cloud Translation | 600 req/s | 600K chars/s | Varies by quota | | Amazon Translate | 40 req/s (default) | 40K chars/s | Configurable | | DeepL API Pro | 50 req/s | — | Varies by plan | | Microsoft Translator | 100 req/s | 50K chars/s | 2M chars/hour | | OpenAI (for translation) | Varies by tier | Token-based | Token-based |

These are default limits. Many services let you request increases, but the approval process takes days to weeks.

The gotcha: even if your average request rate is well under the limit, bursts can trigger throttling. A deployment that translates 10,000 strings at once will blow past any per-second limit.

Exponential backoff: the minimum viable approach

When you get a 429 (Too Many Requests) or 503 (Service Unavailable), retry with increasing delays:

async function translateWithBackoff(
  text: string,
  targetLang: string,
  maxRetries: number = 5,
): Promise<string> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const result = await translationAPI.translate(text, targetLang);
      return result;
    } catch (error) {
      if (error.status === 429 || error.status === 503) {
        const delay = Math.min(1000 * Math.pow(2, attempt), 30000);
        const jitter = delay * (0.5 + Math.random() * 0.5);
        console.log(`Rate limited. Retrying in ${Math.round(jitter)}ms...`);
        await new Promise((resolve) => setTimeout(resolve, jitter));
      } else {
        throw error; // Don't retry non-rate-limit errors
      }
    }
  }
  throw new Error("Max retries exceeded");
}

Key details:

Jitter is essential. Without jitter, all your retrying requests hit the API at the same time (the "thundering herd" problem). Random jitter spreads them out.
Cap the delay. Math.min(..., 30000) prevents absurd wait times if the API is down for extended periods.
Check for Retry-After header. Some APIs tell you exactly how long to wait:

if (error.headers?.["retry-after"]) {
  const retryAfter = parseInt(error.headers["retry-after"]) * 1000;
  await new Promise((resolve) => setTimeout(resolve, retryAfter));
}

Client-side rate limiting: prevent 429s entirely

Exponential backoff is reactive — you hit the limit, then slow down. A proactive approach is to rate-limit your own requests before they hit the API.

Token bucket implementation:

class TokenBucket {
  private tokens: number;
  private lastRefill: number;

  constructor(
    private maxTokens: number,
    private refillRate: number, // tokens per second
  ) {
    this.tokens = maxTokens;
    this.lastRefill = Date.now();
  }

  private refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(
      this.maxTokens,
      this.tokens + elapsed * this.refillRate,
    );
    this.lastRefill = now;
  }

  async acquire(count: number = 1): Promise<void> {
    this.refill();

    if (this.tokens >= count) {
      this.tokens -= count;
      return;
    }

    // Wait until enough tokens are available
    const deficit = count - this.tokens;
    const waitTime = (deficit / this.refillRate) * 1000;
    await new Promise((resolve) => setTimeout(resolve, waitTime));

    this.refill();
    this.tokens -= count;
  }
}

// Usage: limit to 30 requests per second (leaving headroom below the 50 req/s API limit)
const bucket = new TokenBucket(30, 30);

async function translate(text: string, targetLang: string): Promise<string> {
  await bucket.acquire();
  return translationAPI.translate(text, targetLang);
}

Set your client-side limit to 60-80% of the API's actual limit. This gives you headroom for other clients (if you have multiple services hitting the same API key) and prevents hitting the edge of the limit where you'd start getting intermittent 429s.

Request queuing for batch jobs

When you need to translate thousands of strings (a documentation build, a new language rollout), you want a queue that processes at a controlled rate:

class TranslationQueue {
  private queue: Array<{
    text: string;
    targetLang: string;
    resolve: (value: string) => void;
    reject: (error: Error) => void;
  }> = [];
  private processing = false;
  private bucket: TokenBucket;

  constructor(requestsPerSecond: number) {
    this.bucket = new TokenBucket(requestsPerSecond, requestsPerSecond);
  }

  async translate(text: string, targetLang: string): Promise<string> {
    return new Promise((resolve, reject) => {
      this.queue.push({ text, targetLang, resolve, reject });
      this.process();
    });
  }

  private async process() {
    if (this.processing) return;
    this.processing = true;

    while (this.queue.length > 0) {
      // Process in batches where the API supports it
      const batch = this.queue.splice(0, 50);
      await this.bucket.acquire(batch.length);

      try {
        const results = await translationAPI.translateBatch(
          batch.map((b) => b.text),
          batch[0].targetLang,
        );
        batch.forEach((item, i) => item.resolve(results[i]));
      } catch (error) {
        batch.forEach((item) => item.reject(error as Error));
      }
    }

    this.processing = false;
  }
}

This pattern lets you call queue.translate() from anywhere in your code and it handles batching, rate limiting, and ordering automatically.

Circuit breakers: stop hammering a dead API

If the translation API is genuinely down (not just rate-limiting), continuing to send requests wastes time and resources. A circuit breaker stops trying after repeated failures:

class CircuitBreaker {
  private failures = 0;
  private lastFailure = 0;
  private state: "closed" | "open" | "half-open" = "closed";

  constructor(
    private threshold: number = 5, // failures before opening
    private resetTimeout: number = 60000, // ms before trying again
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      if (Date.now() - this.lastFailure > this.resetTimeout) {
        this.state = "half-open";
      } else {
        throw new Error("Circuit breaker is open — API unavailable");
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failures = 0;
    this.state = "closed";
  }

  private onFailure() {
    this.failures++;
    this.lastFailure = Date.now();
    if (this.failures >= this.threshold) {
      this.state = "open";
    }
  }
}

When the circuit opens, you can fall back to cached translations, a secondary API, or showing untranslated content — all better than hanging on a dead request.

Fallback chains

Don't depend on a single translation provider. Set up a fallback chain:

const providers = [
  { name: "primary", client: primaryAPI, priority: 1 },
  { name: "secondary", client: secondaryAPI, priority: 2 },
  { name: "cache-only", client: cacheClient, priority: 3 },
];

async function translateWithFallback(
  text: string,
  targetLang: string,
): Promise<{ translation: string; provider: string }> {
  for (const provider of providers) {
    try {
      const translation = await provider.client.translate(text, targetLang);
      return { translation, provider: provider.name };
    } catch (error) {
      console.warn(`${provider.name} failed: ${error.message}`);
      continue;
    }
  }

  // Last resort: return the original text
  return { translation: text, provider: "none" };
}

The cache layer as a fallback is important. If both APIs are down, showing a previously cached translation is better than showing nothing or raw English to a Japanese user.

Caching to reduce API calls

The most effective rate-limit mitigation is simply making fewer API calls. Cache aggressively:

import { createHash } from "crypto";

class TranslationCache {
  private cache: Map<string, { translation: string; timestamp: number }>;
  private ttl: number;

  constructor(ttlSeconds: number = 86400 * 30) {
    // 30 days default
    this.cache = new Map();
    this.ttl = ttlSeconds * 1000;
  }

  private key(text: string, targetLang: string): string {
    return createHash("sha256").update(`${targetLang}:${text}`).digest("hex");
  }

  get(text: string, targetLang: string): string | null {
    const k = this.key(text, targetLang);
    const entry = this.cache.get(k);
    if (!entry) return null;
    if (Date.now() - entry.timestamp > this.ttl) {
      this.cache.delete(k);
      return null;
    }
    return entry.translation;
  }

  set(text: string, targetLang: string, translation: string): void {
    const k = this.key(text, targetLang);
    this.cache.set(k, { translation, timestamp: Date.now() });
  }
}

For production, use Redis or a similar persistent cache instead of an in-memory Map. Translation outputs are stable — the same input produces the same output (for NMT) or very similar output (for LLM-based translation with low temperature). A 30-day TTL is reasonable; translations don't go stale like API responses might.

Putting it all together

A production-grade translation client combines all these patterns:

const cache = new TranslationCache();
const rateLimiter = new TokenBucket(30, 30);
const circuitBreaker = new CircuitBreaker(5, 60000);

async function translate(text: string, targetLang: string): Promise<string> {
  // 1. Check cache
  const cached = cache.get(text, targetLang);
  if (cached) return cached;

  // 2. Rate limit
  await rateLimiter.acquire();

  // 3. Circuit breaker + API call
  const translation = await circuitBreaker.execute(() =>
    translateWithFallback(text, targetLang),
  );

  // 4. Cache result
  cache.set(text, targetLang, translation.translation);

  return translation.translation;
}

Cache check first (zero API calls for repeated strings), rate limiting second (prevents 429s), circuit breaker third (fast-fails when the API is down), fallback chain inside the circuit breaker (tries alternatives before giving up).

This is essentially what auto18n handles server-side — caching, rate management, and provider fallback. But if you're building directly against translation APIs, these patterns are the difference between a pipeline that works in a demo and one that works in production.