Translating Markdown and MDX Content at Scale
How to translate Markdown and MDX files without breaking formatting, links, code blocks, or frontmatter. AST-based parsing, segment extraction, and reassembly.
Translating Markdown sounds simple until you try it. You can't just throw the raw file at a translation API — code blocks get translated, link URLs get mangled, and frontmatter metadata ends up in French. Here's what actually works.
The naive approach and why it fails
The tempting first attempt:
const markdown = fs.readFileSync("docs/getting-started.md", "utf-8");
const translated = await translate(markdown, { to: "ja" });
fs.writeFileSync("docs/ja/getting-started.md", translated);
This produces garbage. The translation engine will:
- Translate variable names inside code blocks (
const userbecomesconst utilisateur) - Convert URLs like
/docs/api-referenceinto/docs/référence-api - Translate frontmatter keys (
title:might becometitre:) - Break Markdown syntax (mismatched backticks, mangled link references)
- Merge or split lines in ways that destroy the document structure
Parse the AST, translate the text nodes
The right approach is to parse the Markdown into an AST, extract only the translatable text segments, translate those, and then reassemble the document.
For standard Markdown, unified with remark-parse gives you a clean AST:
import { unified } from "unified";
import remarkParse from "remark-parse";
import remarkStringify from "remark-stringify";
const tree = unified().use(remarkParse).parse(markdownContent);
The AST has node types like paragraph, heading, code, link, image, inlineCode, etc. You only want to translate text content inside certain node types.
Here's a walker that extracts translatable segments:
import { visit } from "unist-util-visit";
const segments = [];
visit(tree, (node) => {
// Skip code blocks entirely
if (node.type === "code" || node.type === "inlineCode") {
return "skip";
}
// Extract text from text nodes
if (node.type === "text") {
segments.push({
node,
original: node.value,
});
}
// Translate alt text but not URLs
if (node.type === "image") {
segments.push({
node,
field: "alt",
original: node.alt,
});
return "skip";
}
});
After translation, write the translated values back into the AST nodes and serialize:
segments.forEach((seg, i) => {
if (seg.field) {
seg.node[seg.field] = translatedTexts[i];
} else {
seg.node.value = translatedTexts[i];
}
});
const output = unified().use(remarkStringify).stringify(tree);
Handling links
Links have two translatable parts and one that should never be translated:
Click here to learn more
- Link text (
Click here to learn more): translate - Title attribute (
Getting started guide): translate - URL (
/docs/getting-started): do NOT translate
/ja/docs/getting-started), you need to rewrite the URL paths without translating them. That's a separate pass — match internal link patterns and swap the locale prefix:
visit(tree, "link", (node) => {
if (node.url.startsWith("/docs/")) {
node.url = /${targetLocale}${node.url};
}
});
MDX adds complexity
MDX files contain JSX components mixed with Markdown. A typical MDX doc might have:
---
title: "Authentication"
sidebar_position: 3
import { Callout } from "@/components/Callout";
# Setting up authentication
<Callout type="warning"> Make sure you have your API key before proceeding. </Callout>js const client = new AuthClient({ apiKey: process.env.API_KEY });
The component props, import statements, and JSX structure must survive translation intact.
Use remark-mdx to parse MDX ASTs:
import remarkMdx from "remark-mdx";
const tree = unified().use(remarkParse).use(remarkMdx).parse(mdxContent);
MDX nodes include mdxJsxFlowElement and mdxJsxTextElement. You need rules for each component. Some props are translatable (like text content inside a ), while others aren't (like type="warning"). This requires a component-specific configuration:
const translatableProps = {
Callout: ["children"],
Tooltip: ["content"],
// Props like 'type', 'variant', 'href' are never translated
};
Frontmatter handling
Frontmatter needs selective translation. Given:
---
title: "Getting Started with the API"
description: "Learn how to authenticate and make your first request"
sidebar_position: 3
slug: "getting-started"
tags: ["quickstart", "api"]
---
You want to translate title and description, but leave sidebar_position, slug, and probably tags alone. Parse the frontmatter separately with gray-matter:
import matter from "gray-matter";
const { data, content } = matter(fileContent);
const translatableFrontmatter = ["title", "description"];
const toTranslate = {};
for (const key of translatableFrontmatter) {
if (data[key]) toTranslate[key] = data[key];
}
Batch translation for efficiency
A documentation site might have hundreds of Markdown files with thousands of text segments. Translating them one by one is painfully slow and expensive.
The efficient approach: extract all segments from all files, deduplicate (the same phrase appears across many docs), batch them into translation API calls, and then reassemble.
// Extract from all files
const allSegments = [];
for (const file of markdownFiles) {
const segments = extractSegments(file);
allSegments.push(...segments);
}
// Deduplicate
const unique = [...new Set(allSegments.map((s) => s.text))];
// Translate in batches
const batchSize = 50;
const translations = new Map();
for (let i = 0; i < unique.length; i += batchSize) {
const batch = unique.slice(i, i + batchSize);
const results = await translateBatch(batch, { to: targetLocale });
batch.forEach((text, j) => translations.set(text, results[j]));
}
With auto18n's batch endpoint, you can send arrays of strings and get back arrays of translations, which maps cleanly onto this segment-based workflow.
Preserving inline formatting
This is where things get tricky. Consider:
You can use the --verbose flag to enable detailed logging.
The AST breaks this into: text node ("You can use the "), strong node containing inline code ("--verbose"), text node (" flag to enable detailed logging."). If you translate the text nodes independently, you lose the context that they're part of the same sentence.
The solution is to translate at the paragraph level, not the text node level. Extract the full paragraph text with placeholder markers for inline elements:
You can use the <1>--verbose</1> flag to enable detailed logging.
Translate the whole thing (the translation API preserves XML-like tags), then map the placeholders back to their original AST nodes. This gives translators the full sentence context while protecting inline formatting.
Incremental translation
Re-translating your entire docs site on every change is wasteful. Track which files changed since the last translation run:
import crypto from "crypto";
function contentHash(text) {
return crypto.createHash("sha256").update(text).digest("hex");
}
// Store hashes in a manifest
const manifest = JSON.parse(
fs.readFileSync(".translation-manifest.json", "utf-8"),
);
for (const file of markdownFiles) {
const content = fs.readFileSync(file, "utf-8");
const hash = contentHash(content);
if (manifest[file] === hash) continue; // Skip unchanged files
await translateFile(file, content);
manifest[file] = hash;
}
A realistic pipeline
Putting it all together for a Docusaurus or Next.js docs site:
.md and .mdx files from the source locale directoryThis pipeline handles a 500-page docs site in under a minute for incremental updates (assuming only a few pages changed), and a full initial translation in maybe 10-15 minutes depending on API throughput.
The main thing to get right is the AST handling. If you skip that and try regex-based extraction, you'll spend more time fixing edge cases than you saved by avoiding the parser.