Translation Quality Metrics: BLEU, COMET, and Human Eval
A practical breakdown of BLEU, COMET, BLEURT, and human evaluation for measuring translation quality — what each metric captures, its blind spots, and how to set up automated quality checks.
You've set up a translation pipeline. Translations are flowing. But how do you know if they're any good? "Looks fine to me" doesn't scale, and you probably don't read all 40 languages you're translating into.
Translation quality metrics exist to answer this question automatically. The problem is that each metric measures something slightly different, and none of them fully capture what "good translation" means. Here's what you need to know to pick the right ones.
BLEU: the old standard
BLEU (Bilingual Evaluation Understudy) was introduced in 2002 and became the default metric for MT research. It works by comparing the machine translation against one or more human reference translations and counting n-gram overlaps.
In simplified terms: if the machine output shares a lot of 1-grams, 2-grams, 3-grams, and 4-grams with the reference, it gets a high BLEU score.