AI Collapses on a Classic Psychology Test. What It Reveals Could Stall Human-Level AI.

“Attention is all you need.”

This 2017 breakthrough idea transformed AI. The concept of self-attention became the foundation of today’s chatbots. Claude, Gemini, and ChatGPT are all large language models (LLMs), AI systems designed to focus on the matter at hand while filtering out distractions.

The results have been remarkable. From brainstorming recipes to generating code, apps, websites, and content, LLMs are being woven into our lives at breakneck speed.

But now, a City University of New York team and collaborators are asking: How closely does AI self-attention resemble human attention?

It’s not just academic curiosity. AI researchers have long looked to the brain for ideas to improve machine intelligence. In turn, AI models have offered new ways to investigate the brain. Comparing artificial and biological attention could inspire AI that concentrates more like us.

In their study, the team asked multiple chatbots to complete a classic psychology test of attention and cognitive control. Participants are shown the word for a color—such as “red”—written in either the same or a different color than the one the word describes. The challenge is to name the ink color while ignoring the word itself.

On short word lists, the chatbots performed at a high level. But as the tasks grew longer, their focus faltered. Instead of naming the ink color, they increasingly defaulted to reading the word. Under more demanding conditions—ones that also trip up people—their performance nearly collapsed.

The findings suggest today’s AI attention systems are “fundamentally limited,” wrote the authors. They go on to say that adding mechanisms similar to “those in biological attention is crucial for achieving artificial general intelligence.”

Attention, Two Ways

Doomscrolling. YouTube. Dinner plans. Family obligations. A barrage of notifications.

Life sometimes seems like everything, everywhere, all at once. Yet the brain can usually lock onto what matters most and push everything else into the background.

Far from a single, straightforward mechanism, attention emerges from multiple brain regions. According to attention network theory, three networks do most of the heavy lifting.

The alerting network keeps the brain ready for action. The orienting network selects which sights, sounds, smells, and sensations deserve attention. Finally, the executive control network resolves conflicts between competing streams of information, helping direct thoughts and actions toward a goal.

Together, these systems allocate the brain’s limited resources. Touch a hot stove, for example, and your brain immediately shifts attention to the burn over dinner. The food can wait; cooling your hand can’t.

AI works very differently.

Rather than processing language as complete sentences, LLMs break text into smaller units called “tokens.” Attention mechanisms then determine which tokens matter most for generating the next word, sentence, or response.

Self-attention is the key breakthrough behind modern chatbots. For each token, the model weighs and incorporates information from other tokens in a sequence, allowing it to track context across long stretches of text. This mechanism helps AI connect words and ideas, and underpins virtually all frontier LLMs today.

Researchers have since built on the concept. One approach, multi-head attention, runs several attention systems in parallel, with each “head” learning different patterns, such as grammar, syntax, or meaning. Another, cross attention, links information across different chunks of inputs and their outputs, making it especially useful for tasks such as translation and summarization.

But attention comes at a steep computational cost. To make models more efficient, researchers are also exploring sparse attention, which limits how many tokens a model considers at once. Another approach draws on information learned in the past to keep AI “focused.”

Despite the name, AI attention is ultimately a mathematical system. It helps determine what information is relevant in a specific context. But it lacks executive control, the network that keeps humans continuously focused on a goal despite distractions for long periods of time.

Color Blind

To test the limits of AI attention, the team pitted OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet against the Stroop task.

Invented by John Ridley Stroop in 1935, the test measures attention and cognitive control by forcing participants to resolve conflicting information. The challenge is simple: Name the color of a word while ignoring what the word means. In a congruent trial, the word “blue” appears in blue ink. In an incongruent trial, “blue” might appear in red or green, creating a conflict between what the eyes see and what the brain reads.

Humans are consistently slowed down by this interference. Even with practice, the effect remains, suggesting it taps into fundamental mechanisms of executive control.

In the study, the researchers created word lists of varying lengths and difficulty. Some were entirely congruent. Others were fully incongruent. A third set mixed the two conditions.

At first, the AI models excelled. On five-word tests, GPT-4o was over 90 percent accurate across all conditions. But as the number of words increased, performance plummeted. On 40-word incongruent tests, the model’s accuracy fell to roughly 15 percent. Claude showed a similar decline. In mixed-condition tests, both models’ performance nearly collapsed to zero.

“The sharp decline in color-naming accuracy with increasing list length indicates that transformer-based attention mechanisms are vulnerable to scaling demands,” wrote the team.

Perhaps most intriguing, some models correctly recognized they were taking the Stroop test and could even explain its rules. But that apparent awareness did nothing to improve their scores. In other words, a “book smart” understanding of the task wasn’t enough to execute it well.

The study joins a growing effort to borrow psychological tests for research in machine cognition, especially when AI is challenged with complex, dynamic decision-making tasks. Theory of mind tests, for example, let researchers gauge whether a system can track others’ beliefs, emotions, and intentions. Personality tests are helping shape model behavior and reduce sycophancy. And some LLMs are readily solving emotional intelligence tests, which measure how well the algorithms recognize and respond to social cues.

According to the authors, the new results point to a missing ingredient in AI attention: A mechanism similar to the brain’s executive control network, which helps us stick to a task and adapt when priorities change.

Future AI systems could benefit from higher-level executive control that continuously tracks progress toward a goal, detects when attention has drifted, and pulls it back on course, if necessary.

Rather than simply weighing which tokens are most relevant in the moment, a more human-like form of attention could help AI stay focused during complex tasks, such as long conversations, multi-step reasoning problems, or high-stakes use in scientific research and drug discovery.

“The ultimate goal of AI research is to develop artificial general intelligence comparable to human abilities,” wrote the team. “AI systems, like humans, may need to master fundamental attention mechanisms…before achieving the generalized problem-solving abilities characteristic of mature executive functions.”

Source link