From what I’ve learned so far about AI Dubbing, one realization stands out. It#s based on the written word … My realization is based on something I heard in the AVT Masterclass course on AI Dubbing, that is, that all of the AI dubbing programs and applications are based on subtitles, which means, they are based on the written word. The basis for an AI dub is a transcription of the dialogue, and a translation of that written text into another written text in a different language, which is then voiced.
I use “translation” in a relatively wide sense of the word. All of us the industry at the creative level, but shockingly few at the administrative or technical levels, understand that a literal, line-by-line translation, without taking into account the context of the visual image, is completely useless when it comes to localizing films, games, or series.
Just a small example: I put three lines of a very simple dialogue into DeepL:
RAB Where were you?
LIAM Out.
STAN Is that all you can say, out?
And DeepL translated it into German for me:
RAB Wo warst du?
LIAM Draußen.
STAN Ist das alles, was du sagen kannst, raus?
Maybe some algorithm told DeepL that repetition isn’t good style, but apart from the fact that “raus” isn’t at all an answer to the question “wo warst du/where were you?”, the fact that Stan repeats Liam’s snappy answer verbatim is the point of the entire dialogue.
Another one:
KEITH: Word of advice. Keep your travellers cheques in a bum bag.
DAWN: Thanks. I’ll, I’ll buy one.
KEITH: What, when you get there?
In German:
KEITH: Ein Ratschlag. Bewahren Sie Ihre Reiseschecks in einer Gürteltasche.
DAWN: Danke. Ich… ich werde einen kaufen.
KEITH: Was, wenn du da bist?
Wrong gender correspondence, formality level inconsistency, and the verb “keep” is mistranslated (“bewahren” means “to preserve”, when what is meant here is “to store”). I’m really not very good at math and numbers, but I would calculate that by the time automatic translation of dialogues is checked and corrected, and if you add to this the problem of getting an even reasonably correct STT dialogue script to begin with, you are spending just as much on just the automatic translation of the script that you would have spent by employing a human translator who is used to overlooking, or working with, or fixing the script and to translating it into a target language dialogue that makes sense.
This is just the first step of the process. Now this foreign-language dialogue needs to become a script that can be voiced. If you want this to be an actual lip-sync dub, one that an audience will accept, you’ll have to adjust the translation to comply with basic synchronies – the basic ones being time and lip flaps. I have not seen any program yet that comes even close to doing this adequately. And we haven’t even looked at rhythm and body movement yet. So a human is – still – necessary and will be for quite some time, adding to the cost you have saved by automatic translation. And just to be clear – the “adjustment” isn’t just a matter of replacing a word by another that might just have the right bilabials. The adjustment could mean that a sentence like “32-year-old male, tried to beat a speeding train. Didn’t make it” becomes “Autounfall an nem Bahnübergang. Er wollte noch rüber, aber der Zug war schneller”, which, translated back, would be “Car crash at a railroad crossing. He wanted to get across but the train was faster.” Every change was made here not for lip sync, but for reasons to do with context, understandability, cultural differences, and dramaturgy.
Let’s say, for the sake of the argument, that all these problems will be solved by the smart people working on them right now (and I mean this with the utmost respect – I can totally see how this would be a fascinating challenge). Then at some point we would have a very good imitation of a translation. But …
Translation isn’t what dubbing is about AT ALL. Translation, and dubbing scriptwriting, is only there to enable a performance by a dubbing actor. A live performance, with a studio audience, who will – in place of the applause for a stage performance – approve or disapprove, make the actor give it another go, or ask them to go on to the next sentence. The text that this is based on is completely irrelevant as a text. It’s dead and useless without the person performing it. And here is where the entire idea of AI dubbing is based on a faulty premise when it comes to art and entertainment and “content” that’s supposed to touch an audience’s heart. Dubbing isn’t text, and it isn’t translation. Dubbing is acting, it is performance. Every good dub reflects the power of live performance, about which I have written in an earlier blog entry. It has that kernel of truth in performance that is so intangible, but so powerful.
This is why the other approach to lip-sync dubbing, providing a wonderful target-language script and then adjusting the lips, the face, the eyes, and the body movements of the original performers, would be just as difficult as adjusting the text to fit a fixed image. Because the lips, the face, the eyes, and the body movements of the original actors are also, originally, a live performance. Something that happens in real space where people breathe and are surrounded by air and light and sound.
AI dubbing will always be unsuccessful unless it comes up, not with ever-perfect translations, but with a way to invent, or construct, or imitate PERFORMANCE. And I have serious doubts that this is possible.
