All posts

I Replaced Google Translate API with LLM Translation — Here's What Changed

A walkthrough of migrating a production app from Google Cloud Translation to LLM-based translation, covering quality improvements, cost changes, and implementation details.

Six months ago, our SaaS product used Google Cloud Translation v2 for everything. We translated UI strings, email templates, help docs, and user-generated content across 12 languages. The bill was around $400/month, and the translations were... acceptable.

Then we started getting support tickets. Spanish-speaking users pointed out that our button text sounded robotic. German users said the formal/informal tone was inconsistent. Japanese translations occasionally broke layout because Google didn't account for string length differences.

We decided to try LLM-based translation. Here's what happened.

The Setup We Were Replacing

Our Google Translate integration was standard:

import { Translate } from "@google-cloud/translate/build/src/v2";

const translate = new Translate({ projectId: "our-project" });

async function translateString(text: string, targetLang: string) { const [result] = await translate.translate(text, targetLang); return result; }

No caching. No glossary. No context. Just fire text at Google and get a translation back. We called this for every string on every deploy, which meant we were paying to re-translate unchanged strings every time.

Our monthly usage was around 15-20 million characters. At $20/1M, that's $300-400/month just for translation API calls.

What We Tried First: GPT-4 Directly

The first thing we tried was calling OpenAI's GPT-4 API with a translation prompt:

const response = await openai.chat.completions.create({
  model: "gpt-4",
  messages: [
    {
      role: "system",
      content: Translate the following text to ${targetLang}. 
        This is a UI string for a project management app. 
        Use informal tone. Keep it concise.,
    },
    { role: "user", content: text },
  ],
  temperature: 0.1,
});

The results were noticeably better. The German translations used the correct informal "du" form consistently. The Spanish text sounded like something a person would write. Japanese translations were more concise.

But there were problems:

  • Cost. GPT-4 is priced per token, and translating 15M characters/month was around $800. Double our Google Translate bill.
  • Latency. 800ms-2s per translation vs Google's 100-200ms.
  • Consistency. LLMs are non-deterministic. The same input could produce slightly different translations across runs, which made our string diffs noisy.
  • Batch support. We had to send one string at a time (or hack together batch prompts that were fragile).

The Migration to a Managed Service

Rather than build all the infrastructure ourselves (caching, batching, consistency, rate limiting), we switched to auto18n. The migration took about two hours.

The API call changed from this:

const [result] = await translate.translate(text, targetLang);

To this:

const response = await fetch("https://api.auto18n.com/translate", {
  method: "POST",
  headers: {
    Authorization: "Bearer " + API_KEY,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    text,
    to: targetLang,
    context: "UI string for a project management app, informal tone",
  }),
});
const { translation } = await response.json();

The key difference: the context parameter. Instead of translating blind, the model knows what kind of text it's working with. This fixed our formal/informal tone inconsistency in German and Japanese.

What Actually Changed

Translation Quality

The improvement was most visible in three areas:

Tone consistency. We specified "informal, friendly" in our context, and every translation maintained that tone. Google Translate would randomly switch between formal and informal in German ("Sie" vs "du") depending on sentence structure.

UI-aware brevity. When we flagged strings as button text, the translations stayed short. Google Translate would sometimes turn a 2-word English button label into a 5-word German phrase that broke the layout.

Idiomatic expressions. "Get started" translated to natural equivalents in each language instead of literal word-for-word translations.

Cost

This surprised us. Our bill went _down_ despite using a more expensive underlying model. The reason: caching.

auto18n caches translations automatically. If you translate "Save changes" to Spanish once, every subsequent request for the same string returns the cached result at no cost. Since roughly 70% of our translation requests were for strings we'd already translated, our effective cost dropped significantly.

Our monthly translation spend went from ~$400 to ~$150.

Latency

First-time translations take about 500ms (LLM inference). Cached translations return in under 50ms. Since most of our translations hit the cache, the average latency actually improved compared to Google Translate's consistent 150ms.

Developer Experience

The biggest quality-of-life improvement was not having to manage GCP credentials. We went from a JSON key file, environment variables, and service account permissions to a single API key.

The Gotchas

It wasn't all smooth. A few things caught us off guard:

Rare language pairs. We had a few users requesting Khmer and Burmese translations. LLM-based translation is weaker for low-resource languages compared to Google, which has spent years optimizing for those pairs. We kept Google Translate as a fallback for languages where the LLM quality wasn't good enough.

Non-determinism. Even with temperature set to 0, LLMs can produce slightly different outputs. For our use case (pre-generated i18n files), this meant we needed to only translate _new_ strings rather than re-translating everything on each deploy. This was actually a better practice anyway — it just forced us to implement proper delta detection.

HTML content. Google Translate handles HTML natively. With LLM translation, you need to either strip tags first or use a service that handles markup preservation. auto18n handles this, but if you're rolling your own LLM translation pipeline, it's a real headache.

The Code: Before and After

Before (Google Translate, no caching):

// translate-strings.ts
import { Translate } from "@google-cloud/translate/build/src/v2";
import { readFileSync, writeFileSync } from "fs";

const translate = new Translate(); const LANGS = ["es", "de", "fr", "ja", "ko", "pt", "zh"];

async function run() { const en = JSON.parse(readFileSync("locales/en.json", "utf-8"));

for (const lang of LANGS) { const translated: Record<string, string> = {}; for (const [key, value] of Object.entries(en)) { const [result] = await translate.translate(value as string, lang); translated[key] = result; } writeFileSync(locales/${lang}.json, JSON.stringify(translated, null, 2)); } }

After (auto18n with delta detection):

// translate-strings.ts
import { readFileSync, writeFileSync, existsSync } from "fs";

const API_KEY = process.env.AUTO18N_API_KEY; const LANGS = ["es", "de", "fr", "ja", "ko", "pt", "zh"];

async function run() { const en = JSON.parse(readFileSync("locales/en.json", "utf-8"));

for (const lang of LANGS) { const existing = existsSync(locales/${lang}.json) ? JSON.parse(readFileSync(locales/${lang}.json, "utf-8")) : {};

// Only translate new or changed keys const toTranslate = Object.entries(en).filter( ([key, value]) => !existing[key] || existing[__src_${key}] !== value, );

for (const [key, value] of toTranslate) { const res = await fetch("https://api.auto18n.com/translate", { method: "POST", headers: { Authorization: Bearer ${API_KEY}, "Content-Type": "application/json", }, body: JSON.stringify({ text: value, to: lang }), }); const { translation } = await res.json(); existing[key] = translation; }

writeFileSync(locales/${lang}.json, JSON.stringify(existing, null, 2)); } }

Should You Migrate?

If your translations are "fine" and nobody's complaining, Google Translate API is a perfectly reasonable choice. It's reliable, fast, and well-documented.

But if you're seeing quality issues — inconsistent tone, overly literal translations, UI strings that don't fit — it's worth testing LLM-based translation on your actual content. The quality difference on European languages is hard to ignore.

Run a side-by-side comparison on 50 of your real strings. You'll know within 30 minutes whether it's worth the switch.