Audio deepfakes are often already pretty convincing, and there’s reason to anticipate their quality only improving over time. But even when humans are trying their hardest, they apparently are not great at discerning original voices from artificially generated ones. What’s worse, a new study indicates that people currently can’t do much about it—even after trying to improve their detection skills.
According to a survey published today in PLOS One, deepfaked audio is already capable of fooling human listeners roughly one in every four attempts. The troubling statistic comes courtesy of researchers at the UK’s University College London, who recently asked over 500 volunteers to review a combination of deepfaked and genuine voices in both English and Mandarin. Of those participants, some were provided with examples of deepfaked voices ahead of time to potentially help prep them for identifying artificial clips.
Regardless of training, however, the researchers found that their participants on average correctly determined the deepfakes about 73 percent of the time. While technically a passing grade by most academic standards, the error rate is enough to raise serious concerns, especially when this percentage was on average the same between those with and without the pre-trial training.
This is extremely troubling given what deepfake tech has already managed to achieve over its short lifespan—earlier this year, for example, scammers almost successfully ransomed cash from a mother using deepfaked audio of her daughter supposedly being kidnapped. And she is already far from alone in dealing with such terrifying situations.
The results are even more concerning when you read (or, in this case, listen) between the lines. Researchers note that their participants knew going into the experiment that their objective was to listen for deepfaked audio, thus likely priming some of them to already be on high alert for forgeries. This implies unsuspecting targets may easily perform worse than those in the experiment. The study also notes that the team did not use particularly advanced speech synthesis technology, meaning more convincingly generated audio already exists.
Interestingly, when they were correctly flagged, deepfakes’ potential giveaways differed depending on which language participants spoke. Those fluent in English most often reported “breathing” as an indicator, while Mandarin speakers focused on fluency, pacing, and cadence for their tell-tale signs.
For now, however, the team concludes that improving automated detection systems is a valuable and realistic goal for combatting unwanted AI vocal cloning, but also suggest that crowdsourcing human analysis of deepfakes could help matters. Regardless, it’s yet another argument in favor of establishing intensive regulatory scrutiny and assessment of deepfakes and other generative AI tech.