All posts

Translating Markdown and MDX Content at Scale

How to translate Markdown and MDX files without breaking formatting, links, code blocks, or frontmatter. AST-based parsing, segment extraction, and reassembly.

Translating Markdown sounds simple until you try it. You can't just throw the raw file at a translation API — code blocks get translated, link URLs get mangled, and frontmatter metadata ends up in French. Here's what actually works.

The naive approach and why it fails

The tempting first attempt:

const markdown = fs.readFileSync("docs/getting-started.md", "utf-8");
const translated = await translate(markdown, { to: "ja" });
fs.writeFileSync("docs/ja/getting-started.md", translated);

This produces garbage. The translation engine will:

  • Translate variable names inside code blocks (const user becomes const utilisateur)
  • Convert URLs like /docs/api-reference into /docs/référence-api
  • Translate frontmatter keys (title: might become titre:)
  • Break Markdown syntax (mismatched backticks, mangled link references)
  • Merge or split lines in ways that destroy the document structure

Parse the AST, translate the text nodes

The right approach is to parse the Markdown into an AST, extract only the translatable text segments, translate those, and then reassemble the document.

For standard Markdown, unified with remark-parse gives you a clean AST:

import { unified } from "unified";
import remarkParse from "remark-parse";
import remarkStringify from "remark-stringify";

const tree = unified().use(remarkParse).parse(markdownContent);

The AST has node types like paragraph, heading, code, link, image, inlineCode, etc. You only want to translate text content inside certain node types.

Here's a walker that extracts translatable segments:

import { visit } from "unist-util-visit";

const segments = [];

visit(tree, (node) => { // Skip code blocks entirely if (node.type === "code" || node.type === "inlineCode") { return "skip"; }

// Extract text from text nodes if (node.type === "text") { segments.push({ node, original: node.value, }); }

// Translate alt text but not URLs if (node.type === "image") { segments.push({ node, field: "alt", original: node.alt, }); return "skip"; } });

After translation, write the translated values back into the AST nodes and serialize:

segments.forEach((seg, i) => {
  if (seg.field) {
    seg.node[seg.field] = translatedTexts[i];
  } else {
    seg.node.value = translatedTexts[i];
  }
});

const output = unified().use(remarkStringify).stringify(tree);

Handling links

Links have two translatable parts and one that should never be translated:

Click here to learn more
  • Link text (Click here to learn more): translate
  • Title attribute (Getting started guide): translate
  • URL (/docs/getting-started): do NOT translate
But there's a subtlety. If your docs have locale-prefixed URLs (/ja/docs/getting-started), you need to rewrite the URL paths without translating them. That's a separate pass — match internal link patterns and swap the locale prefix:
visit(tree, "link", (node) => {
  if (node.url.startsWith("/docs/")) {
    node.url = /${targetLocale}${node.url};
  }
});

MDX adds complexity

MDX files contain JSX components mixed with Markdown. A typical MDX doc might have:

---
title: "Authentication"
sidebar_position: 3

import { Callout } from "@/components/Callout";

# Setting up authentication

<Callout type="warning"> Make sure you have your API key before proceeding. </Callout>

js const client = new AuthClient({ apiKey: process.env.API_KEY });

The component props, import statements, and JSX structure must survive translation intact.

Use remark-mdx to parse MDX ASTs:

import remarkMdx from "remark-mdx";

const tree = unified().use(remarkParse).use(remarkMdx).parse(mdxContent);

MDX nodes include mdxJsxFlowElement and mdxJsxTextElement. You need rules for each component. Some props are translatable (like text content inside a ), while others aren't (like type="warning"). This requires a component-specific configuration:

const translatableProps = {
  Callout: ["children"],
  Tooltip: ["content"],
  // Props like 'type', 'variant', 'href' are never translated
};

Frontmatter handling

Frontmatter needs selective translation. Given:

---
title: "Getting Started with the API"
description: "Learn how to authenticate and make your first request"
sidebar_position: 3
slug: "getting-started"
tags: ["quickstart", "api"]
---

You want to translate title and description, but leave sidebar_position, slug, and probably tags alone. Parse the frontmatter separately with gray-matter:

import matter from "gray-matter";

const { data, content } = matter(fileContent);

const translatableFrontmatter = ["title", "description"]; const toTranslate = {}; for (const key of translatableFrontmatter) { if (data[key]) toTranslate[key] = data[key]; }

Batch translation for efficiency

A documentation site might have hundreds of Markdown files with thousands of text segments. Translating them one by one is painfully slow and expensive.

The efficient approach: extract all segments from all files, deduplicate (the same phrase appears across many docs), batch them into translation API calls, and then reassemble.

// Extract from all files
const allSegments = [];
for (const file of markdownFiles) {
  const segments = extractSegments(file);
  allSegments.push(...segments);
}

// Deduplicate const unique = [...new Set(allSegments.map((s) => s.text))];

// Translate in batches const batchSize = 50; const translations = new Map(); for (let i = 0; i < unique.length; i += batchSize) { const batch = unique.slice(i, i + batchSize); const results = await translateBatch(batch, { to: targetLocale }); batch.forEach((text, j) => translations.set(text, results[j])); }

With auto18n's batch endpoint, you can send arrays of strings and get back arrays of translations, which maps cleanly onto this segment-based workflow.

Preserving inline formatting

This is where things get tricky. Consider:

You can use the --verbose flag to enable detailed logging.

The AST breaks this into: text node ("You can use the "), strong node containing inline code ("--verbose"), text node (" flag to enable detailed logging."). If you translate the text nodes independently, you lose the context that they're part of the same sentence.

The solution is to translate at the paragraph level, not the text node level. Extract the full paragraph text with placeholder markers for inline elements:

You can use the <1>--verbose</1> flag to enable detailed logging.

Translate the whole thing (the translation API preserves XML-like tags), then map the placeholders back to their original AST nodes. This gives translators the full sentence context while protecting inline formatting.

Incremental translation

Re-translating your entire docs site on every change is wasteful. Track which files changed since the last translation run:

import crypto from "crypto";

function contentHash(text) { return crypto.createHash("sha256").update(text).digest("hex"); }

// Store hashes in a manifest const manifest = JSON.parse( fs.readFileSync(".translation-manifest.json", "utf-8"), );

for (const file of markdownFiles) { const content = fs.readFileSync(file, "utf-8"); const hash = contentHash(content);

if (manifest[file] === hash) continue; // Skip unchanged files

await translateFile(file, content); manifest[file] = hash; }

A realistic pipeline

Putting it all together for a Docusaurus or Next.js docs site:

  • Glob all .md and .mdx files from the source locale directory
  • Parse frontmatter and AST for each file
  • Extract translatable segments with placeholder markers for inline formatting
  • Deduplicate across all files
  • Check against the translation cache/manifest — only translate new or changed segments
  • Batch-translate via API
  • Reassemble each file: write translated frontmatter, rebuild AST with translated nodes, serialize
  • Write to the target locale directory, preserving the directory structure
  • This pipeline handles a 500-page docs site in under a minute for incremental updates (assuming only a few pages changed), and a full initial translation in maybe 10-15 minutes depending on API throughput.

    The main thing to get right is the AST handling. If you skip that and try regex-based extraction, you'll spend more time fixing edge cases than you saved by avoiding the parser.