LLM Translation Is Eating Machine Translation
How LLM-based translation compares to traditional NMT on WMT benchmarks, real-world A/B tests, and which language pairs still favor conventional models.
For a decade, neural machine translation (NMT) was the state of the art. Google Translate, DeepL, and Amazon Translate all run encoder-decoder transformer models trained specifically on parallel corpora. They're fast, cheap, and good enough for most use cases.
Then GPT-4 showed up and started outperforming them on translation benchmarks nobody expected it to win. The translation industry is still processing what this means.
The benchmark picture
The WMT (Workshop on Machine Translation) shared tasks have been the standard evaluation for MT systems since 2006. In WMT23 and WMT24, LLM-based systems started placing at or near the top for high-resource language pairs.
On English-German, GPT-4 class models score within 1-2 COMET points of the best dedicated NMT systems. On English-Chinese, they're roughly tied. On English-Japanese, LLMs actually pull ahead, likely because Japanese requires more contextual understanding — sentence-level NMT misses discourse-level signals that LLMs capture naturally.
But the picture isn't uniform. Here's where things stand roughly as of early 2026:
LLMs win convincingly:
- English <-> Japanese (contextual word order, honorifics)
- English <-> Korean (similar reasons)
- English <-> Russian (complex morphology benefits from more context)
- Any pair where formality/register matters
- English <-> German
- English <-> French
- English <-> Spanish
- English <-> Chinese (Simplified)
- Low-resource language pairs (Icelandic, Welsh, Yoruba)
- Languages with limited LLM training data
- Very short segments with no context (product titles, button labels)
Why LLMs are better at some things
Traditional NMT translates sentence by sentence. Each sentence is independent — the model doesn't know what came before or after. This creates several problems:
Pronoun resolution. In English, "it" could refer to dozens of things. In German, the translation of "it" depends on the grammatical gender of the referent: "er" (masculine), "sie" (feminine), or "es" (neuter). A sentence-level NMT model guesses. An LLM reading the full paragraph knows that "it" refers to "die Maschine" (feminine) and picks "sie."
Consistency. NMT might translate "account" as "Konto" in one sentence and "Benutzerkonto" in the next. LLMs maintain terminology consistency across a passage because they see the whole thing at once.
Formality. Many languages have formal/informal registers (German du/Sie, Japanese keigo levels, Korean speech levels). Sentence-level NMT can't determine the appropriate formality from a single sentence. LLMs can be instructed: "Use formal register throughout" and they'll maintain it.
Idioms and cultural references. "It's raining cats and dogs" shouldn't be translated literally. NMT systems handle common idioms through memorization from training data, but LLMs understand the meaning and can produce a natural equivalent in the target language.
Real-world A/B test results
Benchmarks don't always predict production quality. Several companies have published results from switching NMT to LLM-based translation:
A major e-commerce platform ran a 4-week A/B test on product descriptions translated from English to Japanese. LLM translations had 23% fewer support tickets from Japanese users about confusing product information. Conversion rates on translated pages increased by 8%.
A SaaS documentation team compared DeepL vs GPT-4 translations for their German docs. In a blind evaluation by native German-speaking engineers, GPT-4 was preferred 62% of the time, DeepL 28%, and 10% rated as equal. The main differentiator was naturalness — DeepL produced grammatically correct but "robotic" text.
A gaming company translating quest dialogue from English to Korean found that LLM translations required 40% fewer post-editing corrections by human translators compared to their previous NMT pipeline.
Where NMT still makes sense
LLMs aren't a universal replacement. There are legitimate reasons to stick with NMT:
Cost. NMT APIs charge $10-20 per million characters. LLM translation costs 5-20x more depending on the model and language pair. For bulk translation of millions of product listings, NMT economics still win.
Latency. NMT models return translations in 50-200ms. LLM translation takes 1-5 seconds for a paragraph. For real-time chat translation or live subtitle generation, that latency gap matters.
Low-resource languages. LLMs trained primarily on English-centric internet data have weaker coverage of minority languages. Google Translate supports 130+ languages, many of which LLMs handle poorly or not at all. If you need Amharic or Khmer, NMT is still your best option.
Consistency at scale. LLMs can produce slightly different translations for the same input on repeated calls (non-zero temperature). NMT is deterministic. For large-scale content where consistency across millions of segments matters, NMT's determinism is a feature.
Short, contextless strings. Button labels like "Submit" or "Cancel" don't benefit from LLM context windows. NMT handles these fine and does it faster and cheaper.
The hybrid approach
The most practical architecture in 2026 uses both. Route translation requests based on content type: