Chat GPT passed the Turing Test. Now what?

It seems that every day brings a new headline about the burgeoning capabilities of large language models (LLMs) like ChatGPT and Google’s Gemini—headlines that are either exciting or increasingly apocalyptic, depending on one’s point of view.

One particularly striking story arrived earlier this year: a paper that described how an LLM had passed the Turing Test, an experiment devised in the 1950s by computer science pioneer Alan Turing to determine whether machine intelligence could be distinguished from that of a human. The LLM in question was ChatGPT 4.5, and the paper found that it had been strikingly successful in fooling people into thinking it was human: In an experiment where participants were asked to choose whether the chatbot or an actual human was the real person, nearly three of the four chose the former.

This sounds…significant. But how, exactly? What does it all mean?

What the Turing Test is—and what it isn’t

To answer that question, we first need to look at what the Turing Test is, and what it means for an LLM to pass or fail it.

Cameron Jones, a postdoctoral student at UC San Diego and one of the co-authors of the new paper, explains that Turing introduced the idea of the test in his seminal 1950 paper “Computing Machinery and Intelligence.” The paper set out to address a big, fundamental question that occupied the minds of Turing’s contemporaries: “Can machines think?”

In his paper, Turing quickly rejects the question as being ambiguous and non-rigorous, because it is not clear on what either a “machine” is in this context, nor on what “thinking” is. He argues that a more nuanced and easily addressable question was required, and proposes, “Can a machine act in such a manner that its actions are indistinguishable from those of a human?” To answer this question, he proposes what he calls “The Imitation Game,” and it’s this exercise that has since come to be referred to as simply “The Turing Test.”

A formal black and white portrait of a young man, Alan Turing, with a calm expression, looking slightly to the side. He has neatly combed dark hair, is wearing a tweed suit jacket, a striped collared shirt, and a polka-dot tie. The background is a plain white or light gray.
Alan Turing, the pioneering English mathematician and computer scientist whose ideas shaped modern computing. Image: Public Domain

The test involves one person—the “interrogator”—communicating simultaneously with two hidden interlocutors, referred to as the “witnesses.” All communication is written. The twist is that while one of these witnesses is a real person, the other is a machine. The point of the game, Jones explains, “is for the interrogator to figure out which of the two witnesses is the real human.”

Jones and his team performed this experiment with four LLMs. ChatGPT 4.5 was by far the most successful: 73% of participants identified it as the real human. Another model that goes by the unwieldy name LLaMa-3.1-405B was identified as human 56% of the time. (The other two models—ELIZA and GPT-4o—achieved 23% and 21% success rates, respectively, and will not be spoken of again.)

What does ChatGPT passing the Turing Test mean?

The results for ChatGPT 4.5 and LLaMa are striking enough, but the really interesting question is what their success signifies.

It’s important to note from the outset that the test isn’t designed to detect machine intelligence. In rejecting the question “Can machines think?” Turing also neatly sidesteps the thorny question of exactly who is doing the thinking if the answer is “yes.” Consider Rene Descartes’ famous assertion cogito, ergo sum, “I think, therefore I am,” which essentially demands that the presence of thought requires consciousness.

However, Turing’s paper does argue that success in the Imitation Game means that we can’t deny the possibility that genuine machine intelligence is at work. As Jones explains, Turing “basically [argued] that if we could build a machine that was so good at this game that we couldn’t reliably tell the difference between the witnesses, then essentially we’d have to say that that machine was intelligent.”

Modern readers might well recoil from such an assessment, so it’s worth looking at Turing’s line of reasoning, which went as follows:

  1. We can’t know that our fellow humans are intelligent. We can’t inhabit their minds or see through their eyes.
  2. Nevertheless, we accept them as intelligent.
  3. How do we make this judgment? Turing argues that we do so on the basis of our fellow humans’ behavior.
  4. If we attribute intelligence based on behavior, and we encounter a situation where we can’t distinguish between a machine’s behavior and that of a human’s, we should be prepared to conclude that the machine’s behavior also indicates intelligence.

Again, readers might argue that this feels wrong. And indeed, the key question is with Turing’s premise that we attribute intelligence on the basis of behavior alone. We’ll address counter-arguments in due course, but first, it’s worth thinking about what sort of behavior we might feel conveys intelligence.

Why Turing selected language as a test for machines

It feels like it was no accident that Turing chose language as the basis by which his “Imitation Game” would be conducted. After all, there are many obvious ways in which a machine could never imitate a human, and equally, there are many ways in which a person could never imitate a machine. Printed language, however, is simply a set of letters on a page. It says nothing about whether it was produced by a human with a typewriter or a computer with a printer.

Nevertheless, the simple presence of language comes with a whole set of assumptions. Ever since our ancestors first started putting together sentences, language has—as far as we can tell, at least—been the exclusive domain of humanity (though some apes are getting close).

This has also been the case for the type of intelligence that we possess—other animals are clever, but none of them seem to think the way we do, or possess the degree of self-consciousness that humans demonstrate. On that basis, it’s almost impossible not to conflate language and intelligence. This, in turn, makes it very difficult not to instinctively attribute some degree of intelligence to anything that appears to be talking to you.

This point was made eloquently in a recent essay by Rusty Foster, author of the long-running newsletter Today in Tabs. Foster argues that we tend to conflate language with intelligence because until now, the presence of the former has always indicated the presence of the latter. “The essential problem is this: generative language software is very good at producing long and contextually informed strings of language, and humanity has never before experienced coherent language without any cognition driving it,” writes Foster. “In regular life, we have never been required to distinguish between ‘language’ and ‘thought’ because only thought was capable of producing language.”

Foster makes an exception for “trivial” examples, but even these are surprisingly compelling to us. Consider, for example, a parrot. It’s certainly disconcerting to hear a bird suddenly speaking in our language—but, crucially, it’s also almost impossible not to talk back. (Viewers with a tolerance for profanity might enjoy this example, which features a very Australian woman arguing with a very Australian parrot over the intellectual merits of the family dog.) Even though we know that parrots don’t really know what they’re “saying,” the presence of language demands language in response. So what about LLMs? Are they essentially energy-hungry parrots? 

“I think [this has] been one of the major lines of criticism” of the Turing Test, says Jones. “It’s a super behaviorist perspective on what intelligence is—that to be intelligent is to display intelligent behavior. And so you might want to have other conditions: You might require that a machine produce the behavior in the right kind of way, or have the right kind of history of interaction with the world.”

A light-colored cockatoo with a yellow crest stands on top of a red rotary telephone on the right side of the image. On the left, a gray African parrot stands on a dark purple surface, looking at the telephone's receiver, which is off the hook and connected to the phone by a curly red cord. The background is a solid red color.
A parrot can mimick human language with surprising clarity, though that doesn’t mean the parrot understand what it’s saying. Image: DepositPhotos

The Chinese Room thought experiment

There are also thought experiments that challenge the Turing Test’s assumptions about the indistinguishability of the appearance of intelligence and the presence of genuine intelligence.  Jones cites John Searle’s Chinese Room thought experiment, presented in a paper published in 1980, as perhaps the best known of these. In the paper, Searle imagines himself placed in a room where someone is passing him pieces on paper under the door. These pieces of paper have Chinese characters. Searle speaks no Chinese, but he has been provided with a book of detailed instructions about how to draw Chinese characters and a set of instructions about which characters to provide in response to those he receives under the door.

To a person outside, it might appear that Searle speaks perfect Chinese when in reality, he is simply following instructions—a program—that tells him which characters to draw and how to draw them. As Searle explains in his paper, “It seems to me quite obvious in the example that I do not understand a word of the Chinese stories. I have inputs and outputs that are indistinguishable from those of the native Chinese speaker, and I can have any formal program you like, but I still understand nothing.”

This argument is an explicit rejection of the Turing Test’s premise. With it, Searle proposes a crucial distinction between understanding and appearing to understand, between thinking and appearing to think.

Tweaking ChatGPT to fool people

It also demonstrates another potential issue with the Turing Test: The Chinese Room is clearly designed with the express purpose of fooling the person on the other side of the door—or, to put it another way, it’s a program designed specifically to pass the Turing Test. With this in mind, it’s worth noting that in Jones’s experiment, the LLMs that passed the test required a degree of tweaking and tuning to be convincing. Jones says that his team tested a large number of prompts for the chatbot, and one of the key challenges was “getting [the model] to not do stuff that ChatGPT does.”

Some of the ways that Jones and team got ChatGPT to not sound like ChatGPT are certainly fascinating, and again they revolve around the nuances of language. “You want it to not always speak in complete sentences,” says Jones. “There’s a kind of casual way that people speak when they’re texting—it’s just like sentence fragments. You need to get that sort of thing in.” 

Additionally the team experimented with ChatGPT making spelling errors to sound more human. Typos are “actually quite hard to get right. If you just tell an LLM to try really hard to make spelling errors, they do it in every word, and the errors are really unconvincing. I don’t think they have a good model of what a keyboard substitution looks like, where you hit the wrong key in a word.”

Why ChatGPT is better than other LLMs

LLMs are difficult subjects for research—by their very nature, their internal operations are fundamentally inscrutable. Even the aspects of their construction that can be studied are often hidden behind NDAs and layers of corporate secrecy. Nevertheless, Jones says, the experiment did reveal some things about what sort of LLM is best equipped to perform a credible imitation of a human: “ChatGPT 4.5 is rumored to be one of the biggest models, and I think that being a large model is really helpful.”

What does “big” mean in this sense? A large codebase? A large dataset? No, says Jones. He explains that a big model has a relatively large number of internal variables whose values can be tuned as the model hoovers up training data. “One of the things you see the smaller distilled models often can mimic good performance in math, and even in quite simple reasoning. But I think it’s the really big models that tend to have good social, interpersonal behavioral skills.”

Even the computer programmers that created artificial intelligence don’t know how it works. Credit: TED-Ed
Even the computer programmers that created artificial intelligence don’t know how it works. Credit: TED-Ed

Did Turing predict ChatGPT?

So did Turing ever conceive of his test as something that would actually be carried out? Or was it more of a thought experiment? Jones says that the answer to that question continues to be the subject of debate amongst Turing scholars. For his part, Jones says that he is “just drawing on the paper itself. I think you can read the paper quite literally, as a suggestion that people could run this experiment at some point in the future.”

Having said that, Jones also points out, “I think it’s clear that Turing is not laying out a methodology. I mean, I think he doesn’t imagine this experiment would be worth running for decades. So he’s not telling you how long it should be or, you know, if there’s any rules and what they can talk about.”

If Turing did envisage the test might be passable, he certainly knew that it wouldn’t happen in the 1950s. Nevertheless, his paper makes it clear that he did at least imagine the possibility that one day we might build machines that would succeed: “We are not asking whether all digital computers would do well in the game nor whether the computers at present available would do well, but whether there are imaginable computers which would do well,” he writes.

Turing has often been described—rightly—as a visionary, but there’s one passage in the 1950 paper that’s genuinely startling in its prescience. “I believe that in about 50 years’ time it will be possible to programme computers…to make them play the imitation game so well that an average interrogator will not have more than [a] 70 per cent chance of making the right identification after five minutes of questioning.”

It took 75 years, not 50, but here we are, confronted by a computer—or, at least, a computer-driven model—that does indeed fool people 70% of the time.

What makes human intelligence unique, anyway?

This all brings us back to the original question: what does it all mean? “That’s a question I’m still struggling with,” Jones laughs.

“One line of thinking that I think is useful is that the Turing Test is neither necessary nor sufficient evidence for intelligence—you can imagine something being intelligent that doesn’t pass, because it didn’t use the right kind of slang, and you can also imagine something that does pass that isn’t intelligent.”

Ultimately, he says, the key finding is exactly what it says on the tin: “It’s evidence that these models are becoming able to imitate human-like behavior well enough that people can’t tell the difference.” This, clearly, has all sorts of social implications, many of which appear to interest the public and the scientific community far more than they interest the companies making LLMs.

There are also other philosophical questions raised here. Turing addresses several of these in his paper, most notably what he calls the “Argument from Consciousness.” Even if a machine is intelligent, is it conscious? Turing uses the example of a hypothetical conversation between a person and a sonnet-writing machine—one that sounds strikingly like the sort of conversation one can have with ChatGPT today. The conversation provides an example of something that could be examined “to discover whether [its author] really understands [a subject] or has ‘learnt it parrot-like.’”

Related AI Stories

A simple guide to the expansive world of artificial intelligence

This AI chip is the size of a grain of salt

AI is already changing the ways we fight cancer

Google is expanding AI search, whether you like it or not

AI can transform a photo of your dog into a VR avatar

Of course, there are many more philosophical questions at play here. Perhaps the most disquieting is this: if we reject the Turing Test as a reliable method of detecting genuine artificial intelligence, do we have an alternative? Or, in other words, do we have any reliable method of knowing when (or if) a machine could possess genuine intelligence?

“I think most people would say that our criteria for consciousness [should] go beyond behavior,” says Jones. “We can imagine something producing the same behavior as a conscious entity, but without having the conscious experience. And so maybe we want to have additional criteria.”

What those criteria should be—or even whether reliable criteria exist for a definitive “Is this entity intelligent or not?” test—remains to be determined. After all, it’s not even clear that we have such criteria for a similar test for animals. As humans, we possess an unshakeable certainty that we are somehow unique, but over the years, characteristic after characteristic that we once considered exclusively human have turned out to be no such thing. Examples include the use of tools, the construction of societies, and the experience of empathy.

And yet, it’s hard to give up the idea that we are different. It’s just surprisingly difficult to identify precisely how. Similarly, it proves extremely difficult to determine where this difference begins. Where do we stop being sacks of electrolytes and start being conscious beings? It turns out that this question is no easier to answer than that of where consciousness might arise from the bewildering mess of electrical signals in our computers’ CPUs.

Turing, being Turing, had an answer for this, too. “I do not wish to give the impression that I think there is no mystery about consciousness. There is, for instance, something of a paradox connected with any attempt to localise it.” However, he argued that understanding the source of human consciousness wasn’t necessary to answer the question posed by the test.

In the narrowest sense, he was correct—in and of itself, the question of whether a machine can reliably imitate a human says nothing about consciousness. But the sheer amount of publicity around ChatGPT passing the Turing Test says a lot about the age we’re in: an age in which it may well be very important to know whether genuine artificial intelligence is possible.

To understand if a machine can be intelligent, perhaps we first need to understand how, and from where, intelligence emerges in living creatures. That may provide some insight into whether such emergence is possible in computers—or whether the best we can do is construct programs that do a very, very convincing job of parroting the internet, along with all its biases and prejudices, back at us.

 

More deals, reviews, and buying guides

 

Tom Hawking Avatar

Tom Hawking

Contributor

Tom Hawking is a writer based in New York City. He writes about culture, politics, science and everything in between. His work has appeared in the New York Times, the Guardian, Rolling Stone, and many other publications. You can subscribe to his Substack here.


Related Posts