Available voices for robots

We wrote an article a few days ago about the reasons why we decided that Heasy would not talk from the very beginning.

One of the reasons that is not listed but that is still worth mentioning is that digital voices aren’t really great yet. They always sound kind of flat, and it’s hard to sense any emotion in them. They have the same way to express happiness or sadness or any other emotion.

Honestly, it’s fine in a lot of situations. There’s no reason to have any emotion in the voice of your GPS for instance. You’re not starting a conversation with a GPS, you’re just listening for directions to follow. The situation is different when you’re involved in a conversation, something that’s supposed to convey some emotion at some point.

One of the tricks is to play with the pitch and the speed of the voice, or to insert sounds: you have the flat voice, then the sound of someone crying, and then the voice again. It’s still flat, but the sound is a clue to help you understand the emotional state, in this case sadness.

Here comes duplex

The situation with voices has changed  drastically lately. Google made the headline with the demo of Duplex, its AI making phone calls and speaking just like a human on the phone. We won’t discuss the AI here but just the voice.

What is fascinating is how they managed to create a voice with inflections, and that the voice can do very human things that are not words. Things like “eeer”” or “hmmmm”. If there was a Turing test for voice, this one – maybe for the very first time – would pass it.

And for robotics?

Now, what about robotics? The era of human-like voices is beginning in front of our eyes. How long will it be before we have these voices embedded on humanoid robots?

The fact is that so far, this voice has been demoed in a situation where speakers don’t actually see each other. Our human-like voice is disembodied. If the voice gets a body with a humanoid shape, it will be extremely embarrassing if the body-language isn’t as sharp and precise as the voice. There is a huge amount of work required from robot manufacturers in order to have human-like body language at the same level of sophistication as the voice.

With a robot such as Heasy – let’s say it can talk for a minute – what would we have to play with? Its body is pretty limited: we’d have to work on the eyes (the upper screen), the movement of the body and the tilt of the head. Let’s focus on the eyes: when you talk with someone and you hmmmm, you tend to stare into space. How could we reflect that on the eyes? Probably by changing the orientation of the head and making the eyes tinier on the screen. But this answer wouldn’t work if the eyes are not displayed on a screen but made of something different like LEDs for example.

If you think about more human-like robots, the challenge just keeps getting bigger and bigger. The robot has a mouth? The mouth will need to move accordingly. No mouth looks the same when sad or happy. The robot has arms? Arms will have to move accordingly for the very same reasons. You can’t express joy with hanging arms. If the robot hmmmmm-s, it’s probably better to have him touch his chin. The list goes on.

The more-human like the robot, the more human-like body language will have to be in order to fit with these human-like voices. Hands like clamps will be forgiven more compared to hands with fingers.

For sure, humanoid robots will take advantage of these advanced voices at some point. But it might be long before we start hearing them from robots. Not because it’s technically hard to implement them, but because voice is highly related to body language, where a lot of progress still has to be made.