Nine years ago, my late wife had developed a tumor in her throat next to her vocal chords. She was fighting cancer while trying to be a mom to our 3 young boys. Directed radiation treatment was ruled out for this tumor, leaving surgery as the only viable option. The downside was the very real risk of her permanently losing her voice.
Hoping that she’d one day beat the cancer, but may not have a voice, I came up with an idea of trying to “capture it” in 2009 - hoping that it could be algorithmically rebuilt in the future. I reached out to a number of individuals that ultimately put me in touch with a research group that had a proprietary setup for capturing samples and rebuilding the voice. Over the Thanksgiving break, I managed to get access to a soundproof recording room and they worked with my wife to capture samples over a period of 4 hours.
Having worked in the infosec space since the 90s, my first reaction is often either how new tech/innovation can be used to bypass a control and how one could detect/prevent that. It’s easy to lose sight of how something like this could fundamentally changes a persons life.
This is a great post, although I am sorry for the experiences you went through to acquire this perspective.
Thinking more about the specific use-case you have in mind, I find myself wondering how sentiment and inflection might be captured via a synthetic voice. Would it be inferred by context? How would that inference deal with things like sarcasm/irony. I wonder if there could be some input mechanism for controlling the inflection - what would that input interface look like? Could it go off facial expression?
I wonder where the existing tech sits in the uncanny valley for this space...
I went camping in the alps once. On our last night, my friend took a bowl and gathered ashes from the campfire. Half ritualistic, half jokingly, she said that those ashes mark our trek and experiences, that she would carry the ashes back home no matter what.
I was very confused how she would pass a bowl of unidentified ash from the airport security (we only had a backpack each). She drafted a poorly done and obviously fake death certificate. It was not campfire ash anymore, it was the remains of her father.
The people at the airport were visibly awkward, they tried to be as accommodating as they could. She flew back home with a plastic bowl of ashes from our campfire, it even had some parts of birch and branches.
Airport security was easily fooled. And the author's mom is easily fooled too, motherly instincts be damned. Would a neural net be fooled by the author's attempts? I know for sure that an automated security system would sound the alarm on my friend. I 'd like to see adversarial networks fighting each other on such premises. A son network trying to fool the mother network and vice versa ad infinitum, at least 1 billion of simulation hours in. What kind of wonders would come out
I think 'fooled' might be the wrong word to describe the airport security staff. "Had to make a judgement call and decided to err on the side of lenience" is probably a fairer description.
I can see easily half a dozen problems you can wreak havoc with with ashes on an airplane. Ashes or any dust, actually. Remember, it's essentially a closed, hermetic, environment.
I agree that in the short term adversarial neural networks may be a good line of defense against machine fabricated audio and video, but in the long term it’s a loosing battle. Eventually the neural networks producing the audio and video will become pixel perfect and at that point no neural network would be able to detect the manipulation. I think we need to seek out a different, more future proof solution to the problem.
Many years ago, I briefly used to do the reverse. I was tired of constant calls by certain people, so I jokingly started to pick up and say "the subscriber is currently unavailable; please leave a message after the tone <BEEP>". Fooled two people with it before realizing that's a little too disrespectful and stopping.
I did this one time as a joke, the other person hung up immediately. I only figured out weeks later it was a good friend I hadn't seen in ages who had happened to be in town that day and had wanted to get together.
They were ... not very happy when we realized what had happened.
I do recommend watching the episode of Follow This as suggested in the article (episode 7) if you’re interested in the latest deepfake tech, and its implications for fooling people who can’t obviously tell it’s fake.
It's legal if you can use the site just normally after you've pressed the Reject All button and no data is collected in that case. Otherwise, it doesn't comply with GDPR. (Disclaimer: IANAL)
I was tempted to sign up and try using it for our daily scrum standup conf calls for fun. But after thinking about it I'm also terrified of the possibility that my account or data could be compromised. Imagine the damage someone could do by calling up a relative, posing as me, and saying that I'm in trouble and need money or something?
I was hoping there was another method of doing this instead of playing the audio file out the speaker and using the phone in speakerphone mode.
I have a stutter that is especially bad when I first talk on the phone. I used to do something similar where I would record an introduction and then play it when the phone connected.
The quality wasn't great, but it was better than me not being able to say anything.
The only reason this 'fooled' his mom is because cell phone sound is already bad, so the obvious garbles, fluctuating intonation and weird pauses seem normal.
So it's impressive, but let's not get ahead of ourselves here.
That's not as big a drawback as you might think. There's a guy on YouTube[1] that with a channel fashioned after one of those Saturday morning edutainment shows who debunks hoax videos, and one of the tricks he frequently points out is when the video quality has been deliberately degraded to mask editing flaws. It's good enough to fool anybody not looking for it.
I wonder if this would fool the "My voice is my passport" identification systems that I noticed cropped up on a few of the telephone services I was using in the UK?
Hoping that she’d one day beat the cancer, but may not have a voice, I came up with an idea of trying to “capture it” in 2009 - hoping that it could be algorithmically rebuilt in the future. I reached out to a number of individuals that ultimately put me in touch with a research group that had a proprietary setup for capturing samples and rebuilding the voice. Over the Thanksgiving break, I managed to get access to a soundproof recording room and they worked with my wife to capture samples over a period of 4 hours.
Having worked in the infosec space since the 90s, my first reaction is often either how new tech/innovation can be used to bypass a control and how one could detect/prevent that. It’s easy to lose sight of how something like this could fundamentally changes a persons life.
Thinking more about the specific use-case you have in mind, I find myself wondering how sentiment and inflection might be captured via a synthetic voice. Would it be inferred by context? How would that inference deal with things like sarcasm/irony. I wonder if there could be some input mechanism for controlling the inflection - what would that input interface look like? Could it go off facial expression?
I wonder where the existing tech sits in the uncanny valley for this space...
I was very confused how she would pass a bowl of unidentified ash from the airport security (we only had a backpack each). She drafted a poorly done and obviously fake death certificate. It was not campfire ash anymore, it was the remains of her father.
The people at the airport were visibly awkward, they tried to be as accommodating as they could. She flew back home with a plastic bowl of ashes from our campfire, it even had some parts of birch and branches.
Airport security was easily fooled. And the author's mom is easily fooled too, motherly instincts be damned. Would a neural net be fooled by the author's attempts? I know for sure that an automated security system would sound the alarm on my friend. I 'd like to see adversarial networks fighting each other on such premises. A son network trying to fool the mother network and vice versa ad infinitum, at least 1 billion of simulation hours in. What kind of wonders would come out
They were ... not very happy when we realized what had happened.
Deleted Comment
I do recommend watching the episode of Follow This as suggested in the article (episode 7) if you’re interested in the latest deepfake tech, and its implications for fooling people who can’t obviously tell it’s fake.
> required
Is this even legal according to GDPR?
https://screenshots.firefox.com/MnEgMtsGavMlxcts/www.buzzfee...
I did not work on Follow This, although I do appear in a B-roll shot. https://twitter.com/minimaxir/status/1034109759295647745
“Sure honey.”
“200 OK. Eh I mean, thanks.”
I have a stutter that is especially bad when I first talk on the phone. I used to do something similar where I would record an introduction and then play it when the phone connected.
The quality wasn't great, but it was better than me not being able to say anything.
So it's impressive, but let's not get ahead of ourselves here.
[1] https://www.youtube.com/channel/UCEOXxzW2vU0P-0THehuIIeg