While fans and producers of AI dubbing keep talking about the supposed fact that AI dubbing is here and market-ready, I’m still trying to wrap my head around something I noticed recently. AI dubbing has it backward.
Let me explain. In the dubbing process, after the script is written and has been reviewed and we start recording, someone needs to let the voice actor know how to act. That’s what a dubbing director is there for. Part of what happens in dubbing is imitation. An original actor yells, the dubbing actor yells. This is either satisfying to the director (and everyone else in the recording studio), or not. If not, the dubbing director will ask the voice actor to adjust their performance.
For those of you who have never attended a recording session, these instructions will very rarely be “please yell louder”, or, “don’t go up with your voice at the end of the sentence”. They sound more like, “you’re incredibly terrified when you’re saying that”, or “you need to sound more threatening”, or, “remember that you’ve said this five times already.”
If this sounds vague, yes it does. To you and me. Not to a voice actor. Because they will hear this and dig into their emotions and do what actors do – act, pretend, do as if they were terrified, threatening, or annoyed, and say the sentence again. And their voice will follow and sound more terrified, threatening, annoyed.
Whatever process is used in AI dubbing, whether text-to-speech or speech-to-speech, it begins by imitating. When the imitation isn’t satisfying, whoever adjusts the “performance” will not be able to say, “you’re terrified.” Because there is no “you” who is, or can pretend to be, terrified. This isn’t a problem that can be solved with voice-prompting the commands. Even if the interface between the demand and the response is verbal, what happens remains the same – the system has analyzed what someone sounds like when the express, say anger. It can describe an angry voice in relationship to a voice that has no strong emotion: louder, more emphasis, higher pitch. And one might ultimately be able to prompt a synthetic voice to change all these parameters, even to a very fine degree. I can tell it to be at 75% higher pitch than baseline normal, to carry 15% more emphasis, and to be 25% louder.
Not only does that seem highly inefficient to me compared to feeling an emotion and letting one’s own voice do its thing more or less automatically. But it also reminds me of something humans have done since we existed. Adjusting the parameters of our speech to let the listener think that we’re feeling an emotion that doesn’t exist. Us humans, we have a word for that. It’s called lying.
