Large language models like GPT-3 are giving chatbots an uncanny ability to give human-like responses to our probing questions. But how smart are they, really? A new study from psychologists at the University of California-Los Angeles out this week in the journal nature human behavior found that the language model GPT-3 has better reasoning skills than an average college student—an arguably low bar.
The study found that GPT-3 performed better than a group of 40 UCLA undergraduates when it came to answering a series of questions that you would see on standardized exams like the SAT, which requires using solutions from familiar problems to solve a new problem.
“The questions ask users to select pairs of words that share the same type of relationships. (For example, in the problem: ‘Love’ is to ‘hate’ as ‘rich’ is to which word? The solution would be ‘poor,’)” according to a press release. Another set of analogies were prompts derived from a passage in a short story, and the questions were related to information within that story. The press release points out: “That process, known as analogical reasoning, has long been thought to be a uniquely human ability.”
In fact, GPT-3 scores were better than the average SAT score for college applicants. GPT-3 also did just as well as the human subjects when it came to logical reasoning, tested through a set of problems called Raven’s Progressive Matrices.
It’s no surprise that GPT-3 excels at the SATs. Previous studies have tested the model’s logical aptitude by asking it to take a series of standardized exams such as AP tests, the LSATs, and even the MCATs—and it passed with flying colors. The latest version of the language model, GPT-4, which has the added ability to process images, is even better. Last year, Google researchers found that they can improve the logical reasoning of such language models through chain-of-thought prompting, where it breaks down a complex problem into smaller steps.
[Related: ChatGPT’s accuracy has gotten worse, study shows]
Even though AI today is fundamentally challenging computer scientists to rethink rudimentary benchmarks for machine intelligence like the Turing test, the models are far from perfect.
For example, a study published this week by a team from UC Riverside found that language models from Google and OpenAI delivered imperfect medical information in response to patient queries. Further studies from scientists at Stanford and Berkeley earlier this year found that ChatGPT, when prompted to generate code or solve math problems, was getting more sloppy with its answers, for reasons unknown. Among regular folks, while ChatGPT is fun and popular, it’s not very practical for everyday use.
And, it still performs dismally at visual puzzles and understanding the physics and spaces of the real world. To this end, Google is trying to combine multimodal language models with robots to solve the problem.
It’s hard to tell whether these models are thinking like we are—whether their cognitive processes are similar to our own. That being said, an AI that’s good at test-taking is not generally intelligent the way a person is. It’s hard to tell where their limits lie, and what their potentials could be. That requires for them to be opened up, and have their software and training data exposed—a fundamental criticism experts have around how closely OpenAI guards its LLM research.