We read it everywhere now. AI can do 80% of the work in localization. That sounds great! Almost as if you could save 80% of your production costs, right?
Of course, most of us realize that those 80% aren’t free. AI isn’t free. You have to invest in research, training, tests, etc. But then it runs on its own. You will save lots of money. It does sound like a great deal.
Why 80/20? Why not 90/10? Or 99/1? Well – until someone can figure out how to separate the “AI can do it” part from the “this is what AI can’t do,” it’s just a random number. But the real question isn’t the number.
The real question is, what exactly is in those 80% that AI can – supposedly – do on its own? In her essay on AI Dubbing, Dubformer co-founder Elena Chernysheva writes that AI can do 70-80% of routine tasks, and identifies those tasks as “accurate translation, error-free alignment, basic emotional rendering.” The rest, those magical 20%, are “acting, emotional nuance, and directing the audience’s attention.”
Now, anyone who has ever translated a text (never mind written a dubbing script) knows that “accurate translation” isn’t a routine task. Not when you translate fiction, and definitely not when you translate dialogue for audiovisual material. Try letting DeepL translate the transcript of a movie script, and you’ll see what I mean. Also – what is “accurate” in translation? Translation involves interpretation, judgement, taste – all the things that AI doesn’t have, and for which “accuracy” is simply the wrong category.
“Error-free alignment?” I assume that means lip sync. Have you ever tried to write a lip-sync text? And, do you realize that lip sync depends as much on the actors and the directors of a dub as it does on the script? Also – yes, a closed mouth is a closed mouth. Beyond that, lip sync is extremely subjective, as well as culturally determined. Talking about “error” here is using inapplicable terminology.
“Basic emotional rendering” is also supposed to be part of the 80% that AI can do on its own. True, AI programs can produce voices that do, for example, angry. But “angry” is at maximum 5% of what we need for dubbing. What we need to make “angry” sound anywhere close to the complex system that is human communication, isn’t an added dollop of “emotional nuance”. It’s a plethora of details and contexts.
Why are you angry? Because you are exasperated, frustrated, helpless, powerless? Because you have been discriminated against, or been treated unfairly? Because you want something that someone is denying you? Because you want to achieve something but don’t have the money, or the skill? Because you made a mistake?
Angry can be wanting to kill someone, but maybe not for real, it’s yelling or whispering, it’s threatening, it’s crying, it’s violent, it’s tired, it’s wanting to run away, or wanting for everything to end, it’s asserting superiority and admitting defeat.
This isn’t nuance. This is the basics. It’s the very foundation of this particular emotional expression, anger. It’s extremely personal, yet simultaneously determined by cultural norms that go far beyond the individual, and far beyond the current time. And unless you can analyze it, rationally or emotionally, with empathy or with reason, you cannot replicate or even imitate it. You cannot express it.
Anyone who thinks that one can separate x% basics (AI) and x% complexity (human nuance) and end up with 100% performance that moves an audience, has not understood the first thing about what makes humans tick.
80/20 is a fictional formula, and it’s a formula that lives in a different universe from language or localization. People that use this formula might as well be talking about woodworking or an oven timer. You just cut 80 centimeters off a 100-centimeter-long stick. Or you stop the clock after 80 minutes of a 100-minute cooking process.
Language isn’t math. Emotion is not a graph. Empathy cannot be measured.
