2026-04-14

Translating .PO, .XLIFF, and .ARB Files Programmatically

How to parse, extract translatable segments, translate via API, and write back .PO, .XLIFF, and .ARB files with metadata intact.

Most localization workflows revolve around file formats: .po for gettext-based projects, .xliff for Apple and enterprise tools, .arb for Flutter/Dart. If you want to automate translation, you need to parse these files, extract the translatable strings, send them through a translation API, and write the results back without breaking metadata or structure.

Here's how to handle each format.

.PO files (gettext)

PO (Portable Object) files are the standard for C, Python, PHP, Ruby, and many other ecosystems. They have a simple text-based format:

# This is a comment for translators
#: src/login.py:42
#, python-format
msgid "Welcome back, %(name)s!"
msgstr ""
#: src/dashboard.py:15
msgid "You have no new notifications."
msgstr ""
# Plural form
#: src/items.py:88
msgid "%(count)d item"
msgid_plural "%(count)d items"
msgstr[0] ""
msgstr[1] ""

Key elements:

msgid — the source string (English)
msgstr — the translation (empty means untranslated)
msgid_plural / msgstr[N] — plural forms
#: — source file references
#, — flags (like python-format indicating format strings)

Parsing and translating .PO files in Python

The polib library handles all the parsing:

import polib
def translate_po_file(input_path, output_path, target_locale):
    po = polib.pofile(input_path)
# Collect untranslated entries
    untranslated = [entry for entry in po.untranslated_entries()]
    fuzzy = [entry for entry in po.fuzzy_entries()]
    entries_to_translate = untranslated + fuzzy
if not entries_to_translate:
        print(f"No untranslated entries in {input_path}")
        return
# Extract source strings
    source_texts = []
    for entry in entries_to_translate:
        source_texts.append(entry.msgid)
        if entry.msgid_plural:
            source_texts.append(entry.msgid_plural)
# Translate via API
    translations = translate_batch(source_texts, target_locale)
# Write translations back
    idx = 0
    for entry in entries_to_translate:
        entry.msgstr = translations[idx]
        idx += 1
if entry.msgid_plural:
            # Get the number of plural forms for this locale
            num_plurals = get_plural_count(target_locale)
            entry.msgstr_plural = {}
            for i in range(num_plurals):
                if i == 0:
                    entry.msgstr_plural[0] = translations[idx]
                    idx += 1
                else:
                    # Need separate translations for each plural form
                    plural_translation = translate_single(
                        entry.msgid_plural,
                        target_locale,
                        context=f"Plural form {i} for locale {target_locale}"
                    )
                    entry.msgstr_plural[i] = plural_translation
# Remove fuzzy flag after translation
        if 'fuzzy' in entry.flags:
            entry.flags.remove('fuzzy')
# Update the PO header with plural forms info
    po.metadata['Plural-Forms'] = get_plural_forms_header(target_locale)
    po.metadata['Content-Type'] = 'text/plain; charset=UTF-8'
    po.metadata['Language'] = target_locale
po.save(output_path)
    print(f"Translated {len(entries_to_translate)} entries -> {output_path}")

Plural forms per locale

The PO header needs a Plural-Forms declaration that tells gettext how to select the right form:

PLURAL_FORMS = {
    'en': 'nplurals=2; plural=(n != 1);',
    'fr': 'nplurals=2; plural=(n > 1);',
    'de': 'nplurals=2; plural=(n != 1);',
    'ja': 'nplurals=1; plural=0;',
    'pl': 'nplurals=3; plural=(n==1 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2);',
    'ar': 'nplurals=6; plural=(n==0 ? 0 : n==1 ? 1 : n==2 ? 2 : n%100>=3 && n%100<=10 ? 3 : n%100>=11 ? 4 : 5);',
    'ru': 'nplurals=3; plural=(n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2);',
}

Japanese has one plural form (no distinction). Arabic has six. Getting this wrong means your translated strings display the wrong plural variant at runtime.

Preserving format strings

PO entries flagged with python-format, c-format, or similar must preserve their placeholders. When translating "Welcome back, %(name)s!", the translation must still contain %(name)s. Validate after translation:

import re
def validate_format_strings(original, translated, format_type):
    if format_type == 'python-format':
        pattern = r'%\([^)]+\)[sdifFeEgGxXo]|%[sdifFeEgGxXo]'
    elif format_type == 'c-format':
        pattern = r'%[0-9]*[sdifFeEgGxXo]'
    else:
        return True
orig_placeholders = set(re.findall(pattern, original))
    trans_placeholders = set(re.findall(pattern, translated))
return orig_placeholders == trans_placeholders

.XLIFF files

XLIFF (XML Localization Interchange File Format) is the industry standard for translation exchange, used by Apple's Xcode, many CAT (Computer-Assisted Translation) tools, and enterprise localization platforms.

A typical XLIFF 1.2 file:

<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
  <file source-language="en" target-language="de" datatype="plaintext" original="Localizable.strings">
    <body>
      <trans-unit id="welcome_message" xml:space="preserve">
        <source>Welcome back!</source>
        <target state="new"></target>
        <note>Shown on the home screen after login</note>
      </trans-unit>
      <trans-unit id="items_count" xml:space="preserve">
        <source>%d items</source>
        <target state="new"></target>
        <note>Plural form for item count</note>
      </trans-unit>
    </body>
  </file>
</xliff>

Parsing XLIFF in Python

from lxml import etree
XLIFF_NS = 'urn:oasis:names:tc:xliff:document:1.2'
def translate_xliff(input_path, output_path, target_locale):
    tree = etree.parse(input_path)
    root = tree.getroot()
    ns = {'x': XLIFF_NS}
# Update target language
    for file_elem in root.findall('.//x:file', ns):
        file_elem.set('target-language', target_locale)
# Collect translatable segments
    trans_units = root.findall('.//x:trans-unit', ns)
    segments = []
for tu in trans_units:
        source = tu.find('x:source', ns)
        target = tu.find('x:target', ns)
        note = tu.find('x:note', ns)
if source is None or source.text is None:
            continue
# Skip already translated segments (state="translated" or "final")
        if target is not None and target.get('state') in ('translated', 'final'):
            continue
segments.append({
            'trans_unit': tu,
            'source_text': source.text,
            'note': note.text if note is not None else None,
        })
if not segments:
        print("No untranslated segments found")
        return
# Translate with context from notes
    source_texts = [s['source_text'] for s in segments]
    contexts = [s['note'] or '' for s in segments]
    translations = translate_batch_with_context(source_texts, contexts, target_locale)
# Write translations back
    for segment, translation in zip(segments, translations):
        tu = segment['trans_unit']
        target = tu.find('x:target', ns)
if target is None:
            target = etree.SubElement(tu, f'{{{XLIFF_NS}}}target')
target.text = translation
        target.set('state', 'translated')
tree.write(output_path, xml_declaration=True, encoding='UTF-8', pretty_print=True)
    print(f"Translated {len(segments)} segments -> {output_path}")

XLIFF 2.0 has a slightly different schema (different namespace, instead of ), but the same approach applies — parse XML, find source elements, translate, write back target elements.

Using translator notes as context

The element in XLIFF is gold for translation quality. These notes, originally written for human translators, work just as well as context for LLM-based translation:

def translate_batch_with_context(texts, contexts, target_locale):
    """Translate texts with context notes for disambiguation."""
    payloads = []
    for text, context in zip(texts, contexts):
        payloads.append({
            "text": text,
            "context": context,
            "targetLocale": target_locale,
        })
    return api_translate_batch(payloads)

.ARB files (Flutter/Dart)

ARB (Application Resource Bundle) is the standard format for Flutter apps. It's JSON with metadata:

{
  "@@locale": "en",
  "welcomeMessage": "Welcome back, {name}!",
  "@welcomeMessage": {
    "description": "Greeting shown on home screen",
    "placeholders": {
      "name": {
        "type": "String",
        "example": "John"
      }
    }
  },
  "itemCount": "{count, plural, =0{No items} =1{1 item} other{{count} items}}",
  "@itemCount": {
    "description": "Number of items in cart",
    "placeholders": {
      "count": {
        "type": "int"
      }
    }
  }
}

Keys starting with @@ are file-level metadata. Keys starting with @ (followed by another key name) are entry-level metadata containing descriptions and placeholder definitions. Everything else is a translatable string.

Parsing and translating ARB files

import json
def translate_arb(input_path, output_path, target_locale):
    with open(input_path, 'r', encoding='utf-8') as f:
        arb = json.load(f)
translated_arb = {"@@locale": target_locale}
    entries_to_translate = []
for key, value in arb.items():
        if key.startswith('@@'):
            continue
        if key.startswith('@'):
            # Copy metadata as-is
            translated_arb[key] = value
            continue
# Get metadata for context
        meta_key = f'@{key}'
        metadata = arb.get(meta_key, {})
        description = metadata.get('description', '')
        placeholders = metadata.get('placeholders', {})
entries_to_translate.append({
            'key': key,
            'source': value,
            'description': description,
            'placeholders': list(placeholders.keys()),
        })
# Translate
    source_texts = [e['source'] for e in entries_to_translate]
    contexts = [e['description'] for e in entries_to_translate]
    translations = translate_batch_with_context(source_texts, contexts, target_locale)
for entry, translation in zip(entries_to_translate, translations):
        # Validate placeholders are preserved
        for ph in entry['placeholders']:
            if f'{{{ph}}}' not in translation and ph not in translation:
                print(f"WARNING: Placeholder {{{ph}}} missing in translation of {entry['key']}")
translated_arb[entry['key']] = translation
with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(translated_arb, f, ensure_ascii=False, indent=2)
print(f"Translated {len(entries_to_translate)} entries -> {output_path}")

ICU MessageFormat in ARB files

ARB files use ICU MessageFormat for plurals and select patterns. These are tricky because the translation needs to preserve the ICU syntax while translating the text portions:

{count, plural, =0{No items} =1{1 item} other{{count} items}}

When sending this to a translation API, you need to either:

Send the full ICU string and instruct the API to preserve the syntax (works with LLM-based APIs that understand ICU format)

Extract the text portions separately, translate them, and reassemble

Option 1 is simpler and works well with auto18n and similar services that handle format preservation. Option 2 is more reliable but requires ICU parsing logic.

General tips across all formats

Always validate after translation. Check that placeholders, format specifiers, and structural markers survived translation. A missing %s in a format string causes crashes at runtime.

Preserve metadata. Comments, notes, source references, and flags in these files exist for a reason. Copy them to the output unchanged.

Handle encoding. PO files declare their encoding in the header. XLIFF is UTF-8 by default. ARB is JSON (UTF-8). Make sure your pipeline reads and writes with the correct encoding — mojibake from encoding mismatches is a classic localization bug.

Track translation state. XLIFF has state attributes (new, translated, reviewed, final). PO has fuzzy flags. Use these to track which translations are machine-generated and which have been human-reviewed. This lets you filter for "needs review" later.

Batch for efficiency. Extract all strings from all files first, deduplicate, translate in bulk, then write back. This is cheaper and faster than translating file by file, especially when different files contain the same strings.