A translation agent designed around terminology, not generic machine translation.
Same Doraemon: unconstrained literal/pinyin renderings vs the canonical names fans expect in each market —
Why frame it this way: fluency has been commoditized by modern machine translation — making a sentence read smoothly is now a cheap, universal capability, no longer a differentiator. A wrong proper noun is a lore accident — players notice instantly. In real game localization the termbase is a core asset co-owned by dev, publisher, and linguists.
Real-data validation surfaced an even nastier version: names outside the term layer get completed into the real-world entity the model knows best — 蒙德→Borussia Dortmund, 艾莲→Garfield, 三笠→the battleship Mikasa (all actually happened). These cannot be prompted away: the root cause is the name not being in any constraint layer.
Why a deterministic orchestrator, not an autonomous loop: six fixed contracts mean every step can be located, debugged, and evaluated in isolation. In this problem the loop's freedom adds risk without value — 'what to do next' was never the hard part.
Why the human gate sits at step 3: HITL belongs before the spend and on the asset humans actually own (the termbase) — not as a rubber stamp at the end. You review the strategy, not every row.
This expands steps 1–2 of the pipeline. Selection is a funnel: collect every candidate first, then tighten layer by layer —
Why the labor is divided this way: recall goes to rules and chunked scanning (cheap, misses nothing); judgment goes to the model (good at classifying, bad at verbatim fidelity); vetoes go to rules (no cooperation required); the final call goes to a human. Missing a term costs little — the translator handles it, review can add it back. Locking a wrong term costs a lot — it gets enforced at 100%. So every layer of the funnel leans conservative.
These five layers come from a real incident: v1 cached the first fuzzy-search result, so 今日签到 got locked as 'Toutiao' and 冒险世界 as 'Digimon Adventure' — then enforced with perfect fidelity by the masking layer. The lesson: the more deterministic the application, the harsher the admission bar.
Why the IP had to be invented: a public IP proves nothing — the model may have memorized its names. With hand-written terms in a fictional world, the model has exactly one path: through the constraint layer. The bare-model column also shows what drift looks like — English loses half the names, Japanese improvises nearly all of them, even emitting garbled tokens.
The boundary, as always: this validates deterministic application, not automatic term discovery — whether a drafted name is canonical is the review gate's job. Term fidelity (applied/total) is the product's headline KPI, printed on every result card.
Why the obsession: silent degradation is the most expensive failure mode. 'Looks translated, actually bad' destroys more trust than one honest failure — especially when the product's claim is 100%. So every failure class has a name, a handler, and a surface: the result card explicitly warns which cells kept their source text.
Why a conversation, not a form: translation requests clarify progressively — file first, then languages, then drop a term. Chat carries the back-and-forth; structured UI (review card, progress, result) embeds inside the thread as messages. Chat is the shell — the structure isn't lost.
Two details that matter: the typed message wins for target languages (a real bug once had pre-selected chips silently overriding it); and with no language stated, the agent asks instead of defaulting — guessing a market wrong costs far more than one extra question.
Why these three come first: without reliable data pipelines and job orchestration, nothing built on top holds — long jobs can't resume, approvals have nowhere to write back. The frameworks the Tech-choices slide rejected (LangGraph and similar) earn their place at station 2: surplus complexity in a demo, a precondition at production scale. The same tool gets opposite verdicts in two phases — the trade-off depends on the phase, not the tool.
These three extend demo decisions directly: the 0.8 confidence gate grows into the approval workflow, the QC supervisor into the layered funnel, and 'correctness belongs to humans' goes from slogan to institution — with a queue, rationales, and a decision history.
The actual bottlenecks today: not compute, but terminology-asset coverage (the cold-IP long tail), per-row context (tone and reference can't be judged from one line), and browser-driven orchestration (long jobs can't resume) — addressed by stations 4, 1, and 2. This roadmap is ordered by that list.
Why this page matters: in the demo, terms come from live research, style from the model's instincts, eval from an invented IP. With the real corpus, all three become assets grown from shipped translations. The division of labor stands: what must be guaranteed goes structural (lookup, masking); what benefits from similarity goes retrieval (style exemplars). Using retrieval for terminology turns the one promised guarantee back into a probability.
The reference isn't truth either: past translations carry their own errors and drift, and the same source line can legitimately differ by scene. So scoring is semantic with tolerance — disagreement is a lead before it's an error — and a brand-new IP with no corpus falls back to the current pipeline.
“Put the probabilistic model where judgment helps. Keep determinism where correctness is non-negotiable. Gate the cost with people.”
That boundary judgment isn't specific to translation — it's how I build AI products.