LLM Translation Is Eating Machine Translation
How LLM-based translation compares to traditional NMT on WMT benchmarks, real-world A/B tests, and which language pairs still favor conventional models.
For a decade, neural machine translation (NMT) was the state of the art. Google Translate, DeepL, and Amazon Translate all run encoder-decoder transformer models trained specifically on parallel corpora. They're fast, cheap, and good enough for most use cases.
Then GPT-4 showed up and started outperforming them on translation benchmarks nobody expected it to win. The translation industry is still processing what this means.
The benchmark picture
The WMT (Workshop on Machine Translation) shared tasks have been the standard evaluation for MT systems since 2006. In WMT23 and WMT24, LLM-based systems started placing at or near the top for high-resource language pairs.
On English-German, GPT-4 class models score within 1-2 COMET points of the best dedicated NMT systems. On English-Chinese, they're roughly tied. On English-Japanese, LLMs actually pull ahead, likely because Japanese requires more contextual understanding — sentence-level NMT misses discourse-level signals that LLMs capture naturally.
But the picture isn't uniform. Here's where things stand roughly as of early 2026:
LLMs win convincingly:
- English <-> Japanese (contextual word order, honorifics)
- English <-> Korean (similar reasons)
- English <-> Russian (complex morphology benefits from more context)
- Any pair where formality/register matters
- English <-> German
- English <-> French
- English <-> Spanish
- English <-> Chinese (Simplified)
- Low-resource language pairs (Icelandic, Welsh, Yoruba)
- Languages with limited LLM training data
- Very short segments with no context (product titles, button labels)
Why LLMs are better at some things
Traditional NMT translates sentence by sentence. Each sentence is independent — the model doesn't know what came before or after. This creates several problems:
Pronoun resolution. In English, "it" could refer to dozens of things. In German, the translation of "it" depends on the grammatical gender of the referent: "er" (masculine), "sie" (feminine), or "es" (neuter). A sentence-level NMT model guesses. An LLM reading the full paragraph knows that "it" refers to "die Maschine" (feminine) and picks "sie."
Consistency. NMT might translate "account" as "Konto" in one sentence and "Benutzerkonto" in the next. LLMs maintain terminology consistency across a passage because they see the whole thing at once.
Formality. Many languages have formal/informal registers (German du/Sie, Japanese keigo levels, Korean speech levels). Sentence-level NMT can't determine the appropriate formality from a single sentence. LLMs can be instructed: "Use formal register throughout" and they'll maintain it.
Idioms and cultural references. "It's raining cats and dogs" shouldn't be translated literally. NMT systems handle common idioms through memorization from training data, but LLMs understand the meaning and can produce a natural equivalent in the target language.
Real-world A/B test results
Benchmarks don't always predict production quality. Several companies have published results from switching NMT to LLM-based translation:
A major e-commerce platform ran a 4-week A/B test on product descriptions translated from English to Japanese. LLM translations had 23% fewer support tickets from Japanese users about confusing product information. Conversion rates on translated pages increased by 8%.
A SaaS documentation team compared DeepL vs GPT-4 translations for their German docs. In a blind evaluation by native German-speaking engineers, GPT-4 was preferred 62% of the time, DeepL 28%, and 10% rated as equal. The main differentiator was naturalness — DeepL produced grammatically correct but "robotic" text.
A gaming company translating quest dialogue from English to Korean found that LLM translations required 40% fewer post-editing corrections by human translators compared to their previous NMT pipeline.
Where NMT still makes sense
LLMs aren't a universal replacement. There are legitimate reasons to stick with NMT:
Cost. NMT APIs charge $10-20 per million characters. LLM translation costs 5-20x more depending on the model and language pair. For bulk translation of millions of product listings, NMT economics still win.
Latency. NMT models return translations in 50-200ms. LLM translation takes 1-5 seconds for a paragraph. For real-time chat translation or live subtitle generation, that latency gap matters.
Low-resource languages. LLMs trained primarily on English-centric internet data have weaker coverage of minority languages. Google Translate supports 130+ languages, many of which LLMs handle poorly or not at all. If you need Amharic or Khmer, NMT is still your best option.
Consistency at scale. LLMs can produce slightly different translations for the same input on repeated calls (non-zero temperature). NMT is deterministic. For large-scale content where consistency across millions of segments matters, NMT's determinism is a feature.
Short, contextless strings. Button labels like "Submit" or "Cancel" don't benefit from LLM context windows. NMT handles these fine and does it faster and cheaper.
The hybrid approach
The most practical architecture in 2026 uses both. Route translation requests based on content type:
def choose_translation_engine(content):
if len(content) < 50 and content.type == "ui_string":
return "nmt" # Short UI strings — NMT is fine
if content.language_pair in LOW_RESOURCE_PAIRS:
return "nmt" # LLM coverage is poor
if content.requires_realtime:
return "nmt" # Latency matters
if content.type in ["documentation", "marketing", "support"]:
return "llm" # Quality matters more than cost
return "llm" # Default to LLM for everything else
This is roughly what auto18n does under the hood — routing to the best available model for each language pair and content type, so you don't have to manage the routing logic yourself.
The quality measurement problem
One reason the NMT-vs-LLM debate persists is that we're bad at measuring translation quality. BLEU scores (which count n-gram overlaps with reference translations) favor NMT because NMT outputs are closer to the training distribution. LLM translations might use different phrasing that a human would prefer but BLEU penalizes.
COMET and BLEURT (learned metrics trained on human judgments) correlate better with actual quality, and they tend to favor LLMs for high-resource pairs. But even these metrics miss things like terminology consistency, brand voice adherence, and cultural appropriateness.
The most reliable signal is still human evaluation — specifically, bilingual humans rating translations on accuracy, fluency, and adequacy without knowing which system produced them. This is expensive, which is why most teams only do it for critical content.
Where this is heading
The gap between LLMs and NMT will likely widen. Each new generation of LLMs shows measurable improvement on translation tasks, while NMT architectures have largely plateaued. The remaining NMT advantages — cost, latency, determinism — are engineering problems that are being chipped away at with smaller models, speculative decoding, and caching.
My prediction: within two years, NMT-only translation APIs will be niche products for high-volume, low-quality-requirements use cases. Everything else will run through LLMs, either directly or through specialized translation models fine-tuned from LLM foundations.
For developers building products today, the practical takeaway is: default to LLM-based translation for any user-facing content, fall back to NMT for bulk/real-time/low-resource needs, and always have a human review process for high-stakes content regardless of which engine you use.