← Back to the tool
01 / 14
AI LAB · TRANSLATION AGENT · DESIGN NOTES

Canon starts
with the names.

A translation agent designed around terminology, not generic machine translation.

Framing the problem

The hard part isn't translation. It's terminology.

Same Doraemon: unconstrained literal/pinyin renderings vs the canonical names fans expect in each market —

大雄Character · ENDa XiongNobita
胖虎Character · ENFat TigerGian
任意门Gadget · ENArbitrary DoorAnywhere Door
竹蜻蜓Gadget · ENBamboo DragonflyBamboo Copter
记忆面包Gadget · JA記憶パンアンキパン
Why · the thinking

Why frame it this way: fluency has been commoditized by modern machine translation — making a sentence read smoothly is now a cheap, universal capability, no longer a differentiator. A wrong proper noun is a lore accident — players notice instantly. In real game localization the termbase is a core asset co-owned by dev, publisher, and linguists.

Real-data validation surfaced an even nastier version: names outside the term layer get completed into the real-world entity the model knows best — 蒙德→Borussia Dortmund, 艾莲→Garfield, 三笠→the battleship Mikasa (all actually happened). These cannot be prompted away: the root cause is the name not being in any constraint layer.

The key call

Terminology lives outside the prompt.

Source大雄使用了时光机
Mask⟦0⟧使用了⟦1⟧
LLM⟦0⟧ used ⟦1⟧
RestoreNobita used Time Machine
An LLM is probabilistic by nature — any approach that relies on it obeying cannot reach 100%. Verified path by path:✗ Ask in the instructions (tested) — tell the model “translate 大雄 as Nobita”. It mostly complies, but only mostly: an early build drifted ドラえもん into ドラえモン under exactly such instructions, and in the controlled run on the Guarantee slide the unconstrained model hit only 18 of 90 term occurrences.✗ Turn temperature to 0 (unreliable in practice) — temperature is the model's randomness dial; at 0, output should repeat exactly. But modern models are MoE (mixture-of-experts): which internal sub-network handles a request depends on what else is in the batch — so even at zero, two runs can produce different names.✗ Find-and-replace afterwards (structurally impossible) — fix wrong names in the output. But a mistranslation takes unpredictable forms (Da Xiong / Daxiong / Big Hero…); with no anchor to search for, there is nothing to replace.✗ Constrained decoding (not available) — the academic route: force the model, token by token, to emit only approved words. It requires control over the model's internal probability distribution; closed APIs don't expose it.✓ Masking (the one path that doesn't depend on obedience) — swap 大雄 for the placeholder ⟦0⟧ before translation; a placeholder isn't a word, so it can't be translated; afterwards swap ⟦0⟧ back to Nobita. The name never passes through the model — exactness is structural. Paired with a confidence-gated term cache (a term is stored once settled and reused), names can't drift between runs either.
The pipeline

Six steps, each with a clear contract.

1Read & identifyIP + entities needing canon
2Build termbaseseed → wiki → model
3Review gateeditable; approve before spend
4Masked translate⟦n⟧ sentinels, parallel chunks
5QC supervisormechanical checks + strong-model fixes
6Deliver CSVsource + per-locale cols + report
Why · the thinking

Why a deterministic orchestrator, not an autonomous loop: six fixed contracts mean every step can be located, debugged, and evaluated in isolation. In this problem the loop's freedom adds risk without value — 'what to do next' was never the hard part.

Why the human gate sits at step 3: HITL belongs before the spend and on the asset humans actually own (the termbase) — not as a rubber stamp at the end. You review the strategy, not every row.

Term selection

How terms get picked out of the CSV.

This expands steps 1–2 of the pipeline. Selection is a funnel: collect every candidate first, then tighten layer by layer —

1Rules firstSpeaker prefixes (「角色:台词」) and labeled fields (「人物名称:××」) are extracted by plain code. Zero tokens, nothing missed, nothing imagined — and these are exactly the names most likely to be mistranslated.
2The model reads in chunksThe model reads the text chunk by chunk, extracting only proper nouns unique to this franchise, typed as character / place / item / skill. Large files fan out automatically — 734 rows ran as 13 parallel sub-tasks in testing — and a supervisor model merges, dedups, and fixes types.
3Negative filtersA generic-word stoplist plus compound segmentation blocks words like 任务, 签到, 冒险世界. The full five-layer defense is the next slide.
4Each survivor gets researchedSurviving candidates are researched one by one: seed/cache → exact-title Wikipedia → model normalization, producing a per-locale name and a confidence score.
5A human is the final filterAuto-drafted terms lock only at ≥0.8 confidence, and everything — with its confidence and provenance — lands in the editable review table. Add, delete, edit: a human decides, before the full translation spend starts.
Why · the thinking

Why the labor is divided this way: recall goes to rules and chunked scanning (cheap, misses nothing); judgment goes to the model (good at classifying, bad at verbatim fidelity); vetoes go to rules (no cooperation required); the final call goes to a human. Missing a term costs little — the translator handles it, review can add it back. Locking a wrong term costs a lot — it gets enforced at 100%. So every layer of the funnel leans conservative.

Who owns correctness

Every term in the lock must deserve the lock.

Humans own glossary correctness — maintained, approved, always editable·Agent guarantees 100% application — masked, recounted, KPI on the result card
1Extraction excludes generic vocab at the sourceThe extraction prompt asks for one thing only: proper nouns unique to THIS franchise. Words like 任务/签到/好感度 are ones any translator handles fine — locking them is zero upside, pure risk — so they never become candidates in the first place.
2Stoplist + compound segmentation as the netWhat the model misses, rules catch: a generic-vocabulary stoplist blocks single words, and greedy longest-match segmentation blocks compounds — 冒险世界 splits into 冒险+世界, both generic, so the whole term is rejected. This layer is pure rules — there is no probability involved.
3Wikipedia: exact-title + redirect onlyFuzzy search returns the most famous related page, not the page for the term — search 今日签到 and the top hit is Toutiao the news app. Exact-title lookup keeps redirect resolution (静香 still resolves to 源靜香), but when there's no page, there's no answer: absence over noise.
4High confidence requires two independent sourcesThe wiki name is only a hint handed to the model, which answers independently; agreement earns 0.9. Either source alone can be wrong — wiki can hit a homonym page (梅雨 = the rainy season), the model can hallucinate — independent agreement is what makes the error rate acceptably low.
5Below 0.8: never locked, never cachedThe final hard gate: an auto-drafted term below 0.8 simply isn't locked. A wrong term enforced at 100% by the masking layer does far more damage than a term left to the translator — and the review table lets a human add it back deliberately. The cache shares the same bar, and the once-polluted cache was retired with a version bump.
Why · the thinking

These five layers come from a real incident: v1 cached the first fuzzy-search result, so 今日签到 got locked as 'Toutiao' and 冒险世界 as 'Digimon Adventure' — then enforced with perfect fidelity by the masking layer. The lesson: the more deterministic the application, the harsher the admission bar.

The core guarantee · a controlled experiment

Same data, two runs: with and without the harness.

invented IP — nothing to memorize6 hand-authored terms (EN+JA)24 rows × 2 languages90 term occurrences total
Bare model · no harness18/9020% term fidelity · Japanese side: 0/45凛霜 → Frost心火 → Heartfire星陨之刃 → 星inovの刃 (garbled)
Constraint layer · this product90/90100% · zero wrong substitutions凛霜 → Rimefrost / リムフロスト心火 → Emberheart / エンバーハート星陨之刃 → Starfall Blade / スターフォールの刃
Why · the thinking

Why the IP had to be invented: a public IP proves nothing — the model may have memorized its names. With hand-written terms in a fictional world, the model has exactly one path: through the constraint layer. The bare-model column also shows what drift looks like — English loses half the names, Japanese improvises nearly all of them, even emitting garbled tokens.

The boundary, as always: this validates deterministic application, not automatic term discovery — whether a drafted name is canonical is the review gate's job. Term fidelity (applied/total) is the product's headline KPI, printed on every result card.

Designing for failure

Failures surface. Nothing degrades silently.

Truncated / misaligned outputbisect the batch, isolate the row
Dropped sentinel ⟦n⟧reject the batch, retry with new context
A row that keeps failingkeep source + flag high-severity in QC
A chunk-level API failurebackoff retry ×3 — never kills the job
No Chinese column / no languagea clear 422 question — never a guess
Why · the thinking

Why the obsession: silent degradation is the most expensive failure mode. 'Looks translated, actually bad' destroys more trust than one honest failure — especially when the product's claim is 100%. So every failure class has a name, a handler, and a surface: the result card explicitly warns which cells kept their source text.

The interaction

One chat thread, end to end.

◆ game_strings.csv translate to Japanese and English
Termbase ready — 12 terms, 120 rows · review, or reply “start”
drop the term “任务”, start
✓ Translated 120 rows × 2 languages · term fidelity 100% · download CSV
Why · the thinking

Why a conversation, not a form: translation requests clarify progressively — file first, then languages, then drop a term. Chat carries the back-and-forth; structured UI (review card, progress, result) embeds inside the thread as messages. Chat is the shell — the structure isn't lost.

Two details that matter: the typed message wins for target languages (a real bug once had pre-selected chips silently overriding it); and with no language stated, the agent asks instead of defaulting — guessing a market wrong costs far more than one extra question.

Tech choices

Every choice, with what it rejected.

StackNext.js + VercelOne codebase carries both the chat UI and the API: the model key lives only in server-side env vars, never reaching the browser; deploying yields exactly what the assignment asks for — a directly accessible URL, zero setup for the reviewer.✗ Rejected the standard split (Python backend + separate frontend) — every extra service adds deploys, CORS, and auth to babysit. On a 1–2 hour take-home, the complexity budget belongs in the agent, not the plumbing.
Model routingMiniMax M2 / M3Two tiers split by how much judgment a step needs: hundred-row bulk translation and entity extraction run on the lower-cost M2; the steps that happen a few times per job but hurt when wrong — IP identification, QC review, bad-row fixes — run on the stronger M3. Same budget, expensive compute only where judgment lives.✗ Rejected 'strongest model everywhere' — bulk rows dominate token spend yet need no judgment; money buying nothing. Also rejected 'cheap model everywhere' — QC and hard rows collapse.
GroundingWikipedia langlinksFor 'this character's official name in the Japanese market', Wikipedia's interlanguage links are the widest keyless free source: the Chinese article links straight to its Japanese/English titles. Exact-title + redirect lookup only (静香 still resolves to 源靜香); no page, no answer — it's a discovery signal, not the source of truth, which lives in the publisher's glossary.✗ Rejected fuzzy search — it returns the most famous related page, not the term's page; the pollution actually happened (今日签到→Toutiao). ✗ Rejected paid search APIs — keys to manage, latency added, and the returned pages still need verification anyway.
OrchestrationDeterministic orchestratorSix steps with hard-coded order and contracts: each one logs, tests, and evaluates in isolation. When something breaks you can point at the line.✗ Rejected LangGraph / Dify (for now) — they solve durable state, resumability, and multi-agent coordination: production-scale problems. In a demo they only widen the debugging surface. At scale the orchestration migrates; the terminology architecture doesn't change by a line.
Demo → production · I

Foundations first: data, orchestration, mining.

Data connectors
Job orchestration (the state machine)
Term-mining pipeline
Versioned termbase
The HITL approval workflow
Quality & cost
1Data connectorsCSV is a demo entry point. Production connects to the DB / CMS / TMS: a read-only ingestion service normalizes text into job payloads carrying key, version, context, speaker, and length limits — the agent never holds DB credentials, it only sees payloads. Results return through a writeback service with draft → reviewed → approved → published states, not a file download.
2Job orchestration (the state machine)Today the browser drives the chunks — refresh and the job is gone. In production, jobs live on a queue with durable state: disconnects, refreshes, and redeploys all resume mid-run; retries, branching, and concurrency go to a durable-execution framework (LangGraph / Temporal). The orchestration layer is replaced; the terminology architecture stays as is.
3Term-mining pipelineAt 10k rows you can't have the LLM read everything repeatedly. Rules mine first (speaker prefixes, field patterns like 「获得了 X」), statistics mine second (frequent n-grams, cross-file repeats), and the LLM only classifies, explains, and fills gaps at the end — one pass, full recall, tokens spent on judgment rather than reading.
Why · the thinking

Why these three come first: without reliable data pipelines and job orchestration, nothing built on top holds — long jobs can't resume, approvals have nowhere to write back. The frameworks the Tech-choices slide rejected (LangGraph and similar) earn their place at station 2: surplus complexity in a demo, a precondition at production scale. The same tool gets opposite verdicts in two phases — the trade-off depends on the phase, not the tool.

Demo → production · II

Then the assets: terms, approval, economics.

Data connectors
Job orchestration (the state machine)
Term-mining pipeline
Versioned termbase
The HITL approval workflow
Quality & cost
4Versioned termbaseApproved terms enter a versioned termbase with script / alias / abbreviation variants; matching upgrades from per-term scans to Aho-Corasick longest-match multi-pattern search — one pass hits every term, and 艾伦 can't steal 艾伦·耶格尔. Every batch binds to a termbase version: change a term and you re-run exactly the affected rows, not the world.
5The HITL approval workflowToday's editable table grows into a real workflow: terms move proposed → approved / rejected through a queue, each with occurrence examples, the agent's rationale, and alternatives; for a new IP the agent reads the world bible and character sheets first and proposes a naming strategy plus termbase v1; the style guide (tone, honorifics, what stays untranslated) is approved the same way. Agent proposes, human approves, the system enforces.
6Quality & costQC becomes a funnel: deterministic checks (placeholders, term hits, leftover source) run on everything at zero tokens; statistical signals (embedding drift) sit in the middle; LLM review reads only flagged rows; humans read only high-severity low-confidence. On cost: TM + dedup so identical sources translate once, prompts carry only the terms the chunk actually hits, and per-job budgets (rows × locales × retry caps × sample rate) — cheap tokens still get budgeted.
Why · the thinking

These three extend demo decisions directly: the 0.8 confidence gate grows into the approval workflow, the QC supervisor into the layered funnel, and 'correctness belongs to humans' goes from slogan to institution — with a queue, rationales, and a decision history.

The actual bottlenecks today: not compute, but terminology-asset coverage (the cold-IP long tail), per-row context (tone and reference can't be judged from one line), and browser-driven orchestration (long jobs can't resume) — addressed by stations 4, 1, and 2. This roadmap is ordered by that list.

Next · plugging in the archive

Shipped translations are three assets.

1Sentence level: exact TM reuseShipped source → translation pairs are a translation memory. New files are normalized and matched by hash first: a verbatim hit reuses the published line — zero tokens, zero risk, byte-identical to what's live. Game text repeats heavily (UI, system prompts, stock lines), so this layer collects the free wins first: what a lookup can solve never touches a model.
2Term level: alignment mining — confidence becomes a measurementRow-aligned pairs allow word alignment: when a source term maps to the same target in 98% of rows, that consistency ratio is the confidence — measured from shipped translations, not self-reported by a model. High-consistency terms enter at seed level. The byproduct is a terminology-consistency audit: every place past translations disagree with themselves, surfaced.
3Style: profiles + exemplars — retrieval serves tone onlyThe same Chinese line deserves two voices in Doraemon vs One Piece. Tone can't be masked, only conditioned: distill per-IP style profiles and per-character voice sheets (Japanese first-person pronouns, sentence endings, politeness) from the corpus, and inject a few similar shipped lines as exemplars via hybrid retrieval (BM25 for overlap + embeddings for meaning). Retrieval serves style, never terminology — that stays the constraint layer's job.
4Eval: a golden set, with privilege separationHold out rows from the corpus as a golden set: the translating side never sees the reference (or the test measures copying, not translating); the QC reviewer holds it, scores against it, and sends low scores back with specific feedback — bounded rounds. The same set runs as regression: any prompt or model change passes the benchmark before it ships.
Why · the thinking

Why this page matters: in the demo, terms come from live research, style from the model's instincts, eval from an invented IP. With the real corpus, all three become assets grown from shipped translations. The division of labor stands: what must be guaranteed goes structural (lookup, masking); what benefits from similarity goes retrieval (style exemplars). Using retrieval for terminology turns the one promised guarantee back into a probability.

The reference isn't truth either: past translations carry their own errors and drift, and the same source line can legitimately differ by scene. So scoring is semantic with tolerance — disagreement is a lead before it's an error — and a brand-new IP with no corpus falls back to the current pipeline.

“Put the probabilistic model where judgment helps. Keep determinism where correctness is non-negotiable. Gate the cost with people.”

That boundary judgment isn't specific to translation — it's how I build AI products.

AI Lab · Matt Zheng · 2026