All posts

Translation Quality Metrics: BLEU, COMET, and Human Eval

A practical breakdown of BLEU, COMET, BLEURT, and human evaluation for measuring translation quality — what each metric captures, its blind spots, and how to set up automated quality checks.

You've set up a translation pipeline. Translations are flowing. But how do you know if they're any good? "Looks fine to me" doesn't scale, and you probably don't read all 40 languages you're translating into.

Translation quality metrics exist to answer this question automatically. The problem is that each metric measures something slightly different, and none of them fully capture what "good translation" means. Here's what you need to know to pick the right ones.

BLEU: the old standard

BLEU (Bilingual Evaluation Understudy) was introduced in 2002 and became the default metric for MT research. It works by comparing the machine translation against one or more human reference translations and counting n-gram overlaps.

In simplified terms: if the machine output shares a lot of 1-grams, 2-grams, 3-grams, and 4-grams with the reference, it gets a high BLEU score.

from sacrebleu import corpus_bleu

refs = [["The cat sat on the mat."]] hyp = ["The cat was sitting on the mat."] score = corpus_bleu(hyp, refs) print(score) # BLEU = 45.7 (approximate)

BLEU scores range from 0 to 100 (sometimes expressed as 0 to 1). Rough interpretation for MT:

  • 50+: Very good, approaching human quality
  • 30-50: Understandable, decent quality
  • 15-30: Gets the gist across but rough
  • Below 15: Mostly unusable

BLEU's blind spots

BLEU has well-documented problems:

Synonym blindness. "The automobile is scarlet" vs reference "The car is red" — both mean the same thing, but BLEU penalizes because "automobile" != "car" and "scarlet" != "red".

Fluency ignorance. "Mat the on sat cat the" has the same unigram overlap as "The cat sat on the mat" but is gibberish. BLEU partially handles this with higher-order n-grams, but it's not a fluency metric.

Reference bias. BLEU assumes the reference translation is the gold standard. But there are many valid translations for any sentence. If the MT output is correct but uses different phrasing than the reference, it gets a low score.

Length insensitivity. BLEU includes a brevity penalty but doesn't penalize overly verbose translations much.

Despite these problems, BLEU persists because it's fast, free, deterministic, and everyone has published BLEU scores for the last 20 years, making historical comparison easy. It's the BMI of translation metrics — flawed but ubiquitous.

COMET: the learned metric

COMET (Crosslingual Optimized Metric for Evaluation of Translation) uses a pretrained multilingual model to score translations. Instead of counting word overlaps, it learns what "good translation" looks like from human judgment data.

from comet import download_model, load_from_checkpoint

model_path = download_model("Unbabel/wmt22-comet-da") model = load_from_checkpoint(model_path)

data = [{ "src": "The cat sat on the mat.", "mt": "Die Katze saß auf der Matte.", "ref": "Die Katze hat auf der Matte gesessen." }]

output = model.predict(data) print(output.scores) # e.g., [0.87]

COMET scores typically range from 0 to 1, with higher being better. It requires the source text, machine translation, and optionally a reference translation (reference-free variants exist).

Why COMET is better than BLEU

COMET correlates much more strongly with human judgments. In WMT shared tasks, the correlation between COMET and human rankings is around 0.95, while BLEU sits around 0.75-0.85.

COMET handles synonyms and paraphrasing well because it operates on semantic embeddings, not surface strings. "The automobile is scarlet" and "The car is red" get similar COMET scores because they mean the same thing.

COMET's limitations

Compute cost. COMET runs a transformer model, so scoring takes seconds per segment vs milliseconds for BLEU. For large-scale evaluation (millions of segments), this adds up.

Model version dependence. Different COMET checkpoints can give different scores. Always report which model you used.

Language coverage. COMET's accuracy varies by language pair. It's excellent for well-represented pairs (EN-DE, EN-ZH) and less reliable for low-resource languages.

BLEURT: Google's learned metric

BLEURT is Google's answer to the same problem COMET solves. It's based on BERT, fine-tuned on human ratings of translation quality. In practice, COMET and BLEURT produce very similar rankings — they disagree on individual scores but agree on which system is better.

The main difference is ecosystem. COMET is more widely used in academic MT research. BLEURT has tighter integration with Google's tools.

Reference-free metrics

All the above metrics traditionally need a reference translation. But if you had a perfect reference translation, you wouldn't need MT in the first place. Reference-free metrics score translations using only the source and the MT output:

# COMET reference-free
data = [{
    "src": "The cat sat on the mat.",
    "mt": "Die Katze saß auf der Matte."
}]
output = referenceless_model.predict(data)

Reference-free COMET (COMET-QE, for Quality Estimation) scores are less accurate than reference-based scores, but they're practical for production monitoring where you don't have references.

This is what you'd use in a live pipeline: every translation gets a quality estimate score, and translations below a threshold get flagged for human review.

Human evaluation: still the gold standard

Automated metrics approximate human judgment but can't replace it for high-stakes decisions. The standard human evaluation approaches:

Direct Assessment (DA). Evaluators rate translations on a 0-100 scale for adequacy ("does it convey the same meaning?") and fluency ("does it read naturally?"). Simple but noisy — different evaluators have different scales.

MQM (Multidimensional Quality Metrics). Evaluators mark specific errors in the translation and categorize them: accuracy errors (mistranslation, omission, addition), fluency errors (grammar, spelling, style), and terminology errors. More detailed than DA but slower and more expensive.

Pairwise comparison. Show evaluators two translations side by side (from different systems) and ask which is better. Easier for evaluators than absolute scoring. This is what WMT uses for its official rankings.

For a practical setup, I recommend pairwise comparison with a simple interface:

Source: "Your payment has been processed successfully."

Translation A: "Ihre Zahlung wurde erfolgreich verarbeitet." Translation B: "Ihre Bezahlung wurde erfolgreich bearbeitet."

Which is better? [A] [B] [Equal]

Collect 100-200 pairwise judgments per language pair, and you have a statistically meaningful comparison between two systems or configurations.

Setting up automated quality monitoring

Here's a practical pipeline for monitoring translation quality in production:

import json
from comet import load_from_checkpoint

model = load_from_checkpoint("Unbabel/wmt22-cometkiwi-da") # Reference-free

def evaluate_translation(source: str, translation: str) -> dict: data = [{"src": source, "mt": translation}] output = model.predict(data, batch_size=1) score = output.scores[0]

return { "score": score, "flagged": score < 0.7, "source": source, "translation": translation }

# In your translation pipeline results = [] for segment in translated_segments: eval_result = evaluate_translation(segment["source"], segment["translation"]) results.append(eval_result)

flagged = [r for r in results if r["flagged"]] avg_score = sum(r["score"] for r in results) / len(results)

print(f"Average quality: {avg_score:.3f}") print(f"Flagged segments: {len(flagged)} / {len(results)}")

Set up alerts when the average quality drops below a threshold or when the flagged rate exceeds a percentage. This catches systematic quality regressions — like when an API provider changes their underlying model.

Which metric to use when

| Scenario | Recommended metric | | ----------------------------- | --------------------------------------------------- | | Comparing two MT systems | COMET (reference-based) + human pairwise evaluation | | Production quality monitoring | COMET-QE (reference-free) | | Quick sanity check | BLEU (fast, no model needed) | | Detailed error analysis | MQM human evaluation | | Regression testing in CI | BLEU + COMET on a fixed test set |

If you use a translation service like auto18n, quality monitoring on your side still makes sense as a trust-but-verify check. Run COMET-QE on a sample of translations weekly, and do a human eval quarterly for your most important languages.

The key insight: no single metric tells the whole story. Use automated metrics for broad monitoring and human evaluation for calibration and deep dives. Treat quality measurement as an ongoing process, not a one-time benchmark.