leonx.ai Don't trust, verify
← Home

中文 · English

I built a validator to judge whether AI translation has "soul" — then had to validate the validator

2026-06-21 · build log of a translation extension

My last post, townhall, ended with a line: verification is everything, and a validator is only worth as much as it's trustworthy. That came out of a multi-agent orchestration toy. This post is the same line in a completely unrelated setting — a translation extension — and it held up again. This time I learned it more bluntly: I built a validator to judge whether translations were good, and found I had to validate the validator itself first.

It's called AI 译匠 ("translation artisan"), a Chrome extension — fully client-side, bring-your-own-key, talking straight to DeepSeek. The selling point isn't "it can translate" — Google does that — it's "it translates with soul." None of that stiff, word-for-word Chinese full of translationese; it should read like a Chinese engineer wrote it.

The problem shows up right away: how do you get a machine to judge "does this have soul"? I can't sit down and eyeball dozens of paragraphs every time I touch the prompt. Without something that can score it, this doesn't iterate at all.

First, split "good" into three gates

"Translates well" is too vague. Break it into three things you can test separately: one, it translates — doesn't choke on the whole page, doesn't drop paragraphs; two, it doesn't over-translate — leave code, nav, and anything that shouldn't move alone, don't wreck the layout; three, it has soul — idiomatic, no translationese, terms handled right.

The first two are dead simple — you compute them off the DOM, for free. The third is subjective; you need an LLM as a judge. So, two validators, mapping neatly onto townhall's three moves: static checks, sandbox runs, AI critique.

The translation validator: the deterministic half — free, runs on everything

Pure Ruby, Ferrum to drive a headless Chrome, open a batch of real pages one by one, run the exact content extraction the extension actually uses, and compute coverage — characters extracted over the visible body characters that should be translated. Zero blocks means it can't translate; low coverage means it dropped text, and it lists which fragments it dropped.

I grabbed 15 pages of different shapes: blog, wiki, news, docs, SPA, forum, repo, marketing, a Chinese site. 13/15. The other two first read as "dropped text," but a closer look cleared the extraction logic: one was a dead link (I'd fetched an Apache error page), the other got blocked by Cloudflare and never returned any body.

Stuck for a second: when it reports "dropped text," is that my extraction's fault, or did the page just not load right?

These two have to be split apart, or you'll go fix code that was fine. So I added a BLOCKED state: if the body has things like "Cloudflare / verify / 环境异常," or there's basically no body at all, call it "blocked," not "dropped." Figure out whether it's your fault or the environment's, then talk about fixing.

This half was smooth. That's the nice thing about deterministic stuff: it doesn't lie to you. The trouble was the other half.

The soul validator, and how it fooled me

The soul half works like this: for one English passage, line up three translations — ours (what the prompt got DeepSeek to produce), the gold (a good human translation I picked by hand), and Google (literal, the soulless reference). Then bring in a stronger model, thinking turned on, as the judge, to do two things: score it on a rubric, and rank the three blind.

First version of the judge, I had it score ours directly, five dimensions, 1 to 5. Here's what came back:

faithful 5.0 · no-translationese 5.0 · terms 5.0 · register 5.0 · native 5.0 · overall 5.0
faithful 5.0 · no-translationese 5.0 · terms 5.0 · register 5.0 · native 5.0 · overall 5.0
...all 5s, every time

Every line maxed out. I almost bought it — oh nice, this translates great.

But this is the same disease as that 0%-recall Critic in townhall: you think you've got a validator, and it's rubber-stamping with its eyes shut. A judge that gives everyone full marks carries zero information — same as having none. The validator itself has to pass verification first.

The fix was easy: don't let it score in a vacuum. Put the gold in front of it as the ceiling — only give a 5 when ours reaches the level of this human translation; dock a point for anything you'd change. One sentence in, and the scores came alive: lots of 4s, the odd 3, points off where they belonged, real discrimination.

Only once the validator is trustworthy do you open the loop

Now the loop gets to show up: tweak the prompt, rerun the validator, see if the score went up, tweak again. The most ordinary loop there is. But whether it runs at all comes down to the previous step — the validator is the steering wheel of this loop, and if the wheel's off, the harder you hit the gas the further you drift.

And my wheel had play in it. The blind-ranking signal was alarmingly noisy: the same translation, barely touched, "beats Google" 80% one round and 0% the next. A strong model with thinking on still can't rank near-equivalent wordings — "use tool" vs "call tool" — consistently.

round 1  tool-definition  beats Google 80%  pass
round 2  tool-definition  beats Google  0%  fail   <- translation barely changed

So the loop gets dragged around by noise: the pass count bounces 3/4 ↔ 2/4, looks like "that change made it worse," when it's just two samples jittering. Chase that number to tune the prompt and you're chasing noise.

So how do you tell noise from real signal?

Three moves: score each item a few times and take the median/majority (the GEMBA approach); demote the jitteriest metric ("tie the human") from a hard gate to something you just track; and stop watching pass/fail per round — watch the trend in the dimension scores. Real improvements are visible — e.g. I dropped the translation temperature from 1.0 to 0.3 and the made-up words halved on the spot: it used to turn "painkillers always win" into "止痛药永远卖得更好" (inventing a "sell"); after cooling it down, gone, and the faithfulness score climbed for real. That kind of gain you can see without the blind ranking.

When do you stop? When the judge can't tell things apart anymore. The leftover disagreements — which near-synonym reads slightly better — the judge can't rank steadily, and honestly I can't call it either. That means you've hit the resolution ceiling of this eval; the translation is good enough. Good enough, ship it; grinding past that is just burning tokens.

The validator and the loop are a pair

The one thing I most want to leave here: the validator and the loop are a pair; neither stands without the other.

A loop without a trustworthy validator is Yegge's slot machine — you think you're "iterating," you're really just pulling a lever: win without knowing why, lose without knowing why. And a validator without a loop is a ruler hanging on the wall that nobody uses — you measure once, and then what?

Put them right: the loop is the engine, the validator is the dashboard. One gives you the push for "one more version," the other tells you whether this version is actually better. Once the two gears mesh, iteration is climbing, not a random walk.

So townhall's "Don't trust, verify" needs one more turn inward: don't even trust your own validator. Take known answers first — the gold, or planted bugs like in townhall — verify the validator itself, confirm it can actually tell right from wrong, then let it judge. The judge can take the stand, but it gets a physical first.

Last

AI 译匠 is a usable little tool, and the code's cheap to toss. What you take with you is the same routine as townhall, walked through in a new setting: spot a judging component (a critic, a validator), measure it against known answers first, and only once it's trustworthy let it drive a loop — until it can't tell things apart anymore.

A bit of a plug: the thing we're seriously building, leonclass.com, an AI assistant that helps science teachers vet exam questions, is at heart a validator — judging whether a problem is self-consistent, whether it hits the capability it's meant to test. Two toys, one orchestration and one translation, kept teaching me the same lesson: a validator is only worth as much as it's trustworthy, and the only way to make it trustworthy is to put it, too, inside a loop.

If you're also chewing on verification / education / agents, come say hi.