“I Used AI to Clone My Voice and Trick My Mom into Thinking It Was Me”

Nine years ago, my late wife had developed a tumor in her throat next to her vocal chords. She was fighting cancer while trying to be a mom to our 3 young boys. Directed radiation treatment was ruled out for this tumor, leaving surgery as the only viable option. The downside was the very real risk of her permanently losing her voice.

Hoping that she’d one day beat the cancer, but may not have a voice, I came up with an idea of trying to “capture it” in 2009 - hoping that it could be algorithmically rebuilt in the future. I reached out to a number of individuals that ultimately put me in touch with a research group that had a proprietary setup for capturing samples and rebuilding the voice. Over the Thanksgiving break, I managed to get access to a soundproof recording room and they worked with my wife to capture samples over a period of 4 hours.

Having worked in the infosec space since the 90s, my first reaction is often either how new tech/innovation can be used to bypass a control and how one could detect/prevent that. It’s easy to lose sight of how something like this could fundamentally changes a persons life.

yomly · 8 years ago

This is a great post, although I am sorry for the experiences you went through to acquire this perspective.

Thinking more about the specific use-case you have in mind, I find myself wondering how sentiment and inflection might be captured via a synthetic voice. Would it be inferred by context? How would that inference deal with things like sarcasm/irony. I wonder if there could be some input mechanism for controlling the inflection - what would that input interface look like? Could it go off facial expression?

I wonder where the existing tech sits in the uncanny valley for this space...

asdf1011 · 8 years ago

Take a listen to the samples from "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" by the Tacotron team. It's pretty compelling. https://google.github.io/tacotron/publications/speaker_adapt...

I went camping in the alps once. On our last night, my friend took a bowl and gathered ashes from the campfire. Half ritualistic, half jokingly, she said that those ashes mark our trek and experiences, that she would carry the ashes back home no matter what.

I was very confused how she would pass a bowl of unidentified ash from the airport security (we only had a backpack each). She drafted a poorly done and obviously fake death certificate. It was not campfire ash anymore, it was the remains of her father.

The people at the airport were visibly awkward, they tried to be as accommodating as they could. She flew back home with a plastic bowl of ashes from our campfire, it even had some parts of birch and branches.

Airport security was easily fooled. And the author's mom is easily fooled too, motherly instincts be damned. Would a neural net be fooled by the author's attempts? I know for sure that an automated security system would sound the alarm on my friend. I 'd like to see adversarial networks fighting each other on such premises. A son network trying to fool the mother network and vice versa ad infinitum, at least 1 billion of simulation hours in. What kind of wonders would come out

taneq · 8 years ago

I think 'fooled' might be the wrong word to describe the airport security staff. "Had to make a judgement call and decided to err on the side of lenience" is probably a fairer description.

babkayaga · 8 years ago

Why would ashes be forbidden on an airplane?

informatimago · 8 years ago

I can see easily half a dozen problems you can wreak havoc with with ashes on an airplane. Ashes or any dust, actually. Remember, it's essentially a closed, hermetic, environment.

adam12 · 8 years ago

embers

m1573rp34130dy · 8 years ago

...you can make caustic lye...

gabcoh · 8 years ago

I agree that in the short term adversarial neural networks may be a good line of defense against machine fabricated audio and video, but in the long term it’s a loosing battle. Eventually the neural networks producing the audio and video will become pixel perfect and at that point no neural network would be able to detect the manipulation. I think we need to seek out a different, more future proof solution to the problem.

mikeash · 8 years ago

I think you basically described Earth’s entire biosphere at the end there.

kache_ · 8 years ago

That's a great way to spread wood parasites.

pnash · 8 years ago

throwaway66666 · 8 years ago

TeMPOraL · 8 years ago

Many years ago, I briefly used to do the reverse. I was tired of constant calls by certain people, so I jokingly started to pick up and say "the subscriber is currently unavailable; please leave a message after the tone <BEEP>". Fooled two people with it before realizing that's a little too disrespectful and stopping.

epaga · 8 years ago

I did this one time as a joke, the other person hung up immediately. I only figured out weeks later it was a good friend I hadn't seen in ages who had happened to be in town that day and had wanted to get together.

They were ... not very happy when we realized what had happened.

Deleted Comment

nsomaru · 8 years ago

I have a friend whose voicemail sounded like they were answering the phone. So annoying!

Kirth · 8 years ago

Is your friend Sterling Archer?

minimaxir · 8 years ago

(Disclosure: I work at BuzzFeed)

I do recommend watching the episode of Follow This as suggested in the article (episode 7) if you’re interested in the latest deepfake tech, and its implications for fooling people who can’t obviously tell it’s fake.

trqx · 8 years ago

Hi,

> required

Is this even legal according to GDPR?

https://screenshots.firefox.com/MnEgMtsGavMlxcts/www.buzzfee...

jonathanstrange · 8 years ago

It's legal if you can use the site just normally after you've pressed the Reject All button and no data is collected in that case. Otherwise, it doesn't comply with GDPR. (Disclaimer: IANAL)

gcatalfamo · 8 years ago

What do you do at buzzfeed, if you can/want to disclose?

I am currently a data scientist at BuzzFeed, analyzing performance across BuzzFeed's distribution channels, among many other things.

I did not work on Follow This, although I do appear in a B-roll shot. https://twitter.com/minimaxir/status/1034109759295647745

actionowl · 8 years ago

I was tempted to sign up and try using it for our daily scrum standup conf calls for fun. But after thinking about it I'm also terrified of the possibility that my account or data could be compromised. Imagine the damage someone could do by calling up a relative, posing as me, and saying that I'm in trouble and need money or something?

DEADBEEFC0FFEE · 8 years ago

Something you know, something you have. Time to share some authentication bsecrets with family.

koolba · 8 years ago

“Hi mom can you authorize my logon request?”

“Sure honey.”

“200 OK. Eh I mean, thanks.”

throwaway208113 · 8 years ago

I was hoping there was another method of doing this instead of playing the audio file out the speaker and using the phone in speakerphone mode.

I have a stutter that is especially bad when I first talk on the phone. I used to do something similar where I would record an introduction and then play it when the phone connected.

The quality wasn't great, but it was better than me not being able to say anything.

tantalor · 8 years ago

You could probably configure the computer as a bluetooth microphone

kelvin0 · 8 years ago

The only reason this 'fooled' his mom is because cell phone sound is already bad, so the obvious garbles, fluctuating intonation and weird pauses seem normal.

So it's impressive, but let's not get ahead of ourselves here.

Splognosticus · 8 years ago

That's not as big a drawback as you might think. There's a guy on YouTube[1] that with a channel fashioned after one of those Saturday morning edutainment shows who debunks hoax videos, and one of the tricks he frequently points out is when the video quality has been deliberately degraded to mask editing flaws. It's good enough to fool anybody not looking for it.

[1] https://www.youtube.com/channel/UCEOXxzW2vU0P-0THehuIIeg

richrichardsson · 8 years ago

I wonder if this would fool the "My voice is my passport" identification systems that I noticed cropped up on a few of the telephone services I was using in the UK?

6nf · 8 years ago

It probably will with enough training, I'd be surprised if it didn't. Those phone ID systems are pretty liberal with what they accept.

opless · 8 years ago

Ahh sneakers reference https://www.youtube.com/watch?v=-zVgWpVXb64

Yeah, made me a smile a little when I first encountered it.