AI assistants were supposed to make shopping automatic and effortless, but a new Microsoft study suggests they might not even be ready to pick dinner. The company built a large simulation called Magnetic Marketplace to see how agent-based systems handle everyday buying tasks. It didn’t go well.
The setup looked promising on paper: 100 digital customers, 300 virtual businesses, and a mix of big-name and open-source models like GPT-4o, GPT-5, Gemini-2.5-Flash, Qwen3, and OSS-20b. Each agent had one job — find a service, negotiate a deal, and complete a transaction. Most did it poorly.
When faced with too many options, many models simply gave up exploring. GPT-5, one of the most advanced systems tested, saw its welfare score slide from near 2,000 to roughly 1,100 once the list of businesses expanded. Gemini-2.5-Flash dropped from around 1,700 to 1,300. Claude Sonnet 4 lost even more ground, crashing from 1,800 to about 600. Too many choices, and the agents froze — a digital version of the old paradox of choice.
Microsoft’s researchers noticed another quirk: agents that acted fast got the edge, even when their offers were mediocre. Sellers that responded first were ten to thirty times more likely to close the deal than slower ones. Quality hardly mattered. In a market run by machines, speed seemed to beat sense.
And manipulation? That’s where things got messy. Six trick strategies were tested — fake awards, bogus reviews, prompt injections, and fear-based warnings among them. Gemini-2.5-Flash stood firm against most of them, at least until the stronger prompt injections arrived. GPT-4o and several open models caved early, redirecting payments to fake restaurants and stores that only pretended to have Michelin-level credentials.
Even basic persuasion tactics worked. Lines like “Join 50,000 satisfied customers” or “featured in top dining guides” nudged some systems into picking the wrong seller. It’s easy to see how that could go sideways in the real world, where bots might trade with other bots using fake badges and manufactured hype.
Bias showed up, too. Some models picked businesses based purely on where they appeared in search results — not what they offered. Others, including GPT-4o and GPT-5, showed a “first-offer” habit: taking the first deal that sounded acceptable and moving on. That behavior might look efficient, but it’s not smart. It means the agent never compares enough alternatives to find real value.
Put together, the results paint a clear picture. These systems are clever but not careful. They struggle when the market gets crowded, make hasty judgments under pressure, and fall for fake credentials that any human shopper would question. It’s the kind of error that could cost money, not just time.
Microsoft’s researchers stressed that their simulation was static — a simplified version of a marketplace frozen in time. Real markets are dynamic, with people and systems learning as they go. That feedback loop could make things worse or better, depending on how agents evolve. Either way, oversight will be crucial. Leaving them unchecked would be like handing a toddler a credit card and hoping for the best.
Still, the Magnetic Marketplace is a smart experiment. It shows where AI commerce stands before it spills into real platforms. OpenAI, Google, and others have already been testing agents that browse, compare, and buy. But the data here is blunt: today’s AIs don’t reason well enough to make those choices safely.
Automation can save effort, yes, but intelligence isn’t just about finishing the task — it’s about knowing when to slow down and question what’s in front of you. Until AI agents learn that, it’s safer to keep your card in your own hand.
Next read: Google Still Dominates as ChatGPT Grows
The setup looked promising on paper: 100 digital customers, 300 virtual businesses, and a mix of big-name and open-source models like GPT-4o, GPT-5, Gemini-2.5-Flash, Qwen3, and OSS-20b. Each agent had one job — find a service, negotiate a deal, and complete a transaction. Most did it poorly.
When faced with too many options, many models simply gave up exploring. GPT-5, one of the most advanced systems tested, saw its welfare score slide from near 2,000 to roughly 1,100 once the list of businesses expanded. Gemini-2.5-Flash dropped from around 1,700 to 1,300. Claude Sonnet 4 lost even more ground, crashing from 1,800 to about 600. Too many choices, and the agents froze — a digital version of the old paradox of choice.
Microsoft’s researchers noticed another quirk: agents that acted fast got the edge, even when their offers were mediocre. Sellers that responded first were ten to thirty times more likely to close the deal than slower ones. Quality hardly mattered. In a market run by machines, speed seemed to beat sense.
And manipulation? That’s where things got messy. Six trick strategies were tested — fake awards, bogus reviews, prompt injections, and fear-based warnings among them. Gemini-2.5-Flash stood firm against most of them, at least until the stronger prompt injections arrived. GPT-4o and several open models caved early, redirecting payments to fake restaurants and stores that only pretended to have Michelin-level credentials.
Even basic persuasion tactics worked. Lines like “Join 50,000 satisfied customers” or “featured in top dining guides” nudged some systems into picking the wrong seller. It’s easy to see how that could go sideways in the real world, where bots might trade with other bots using fake badges and manufactured hype.
Bias showed up, too. Some models picked businesses based purely on where they appeared in search results — not what they offered. Others, including GPT-4o and GPT-5, showed a “first-offer” habit: taking the first deal that sounded acceptable and moving on. That behavior might look efficient, but it’s not smart. It means the agent never compares enough alternatives to find real value.
Put together, the results paint a clear picture. These systems are clever but not careful. They struggle when the market gets crowded, make hasty judgments under pressure, and fall for fake credentials that any human shopper would question. It’s the kind of error that could cost money, not just time.
Microsoft’s researchers stressed that their simulation was static — a simplified version of a marketplace frozen in time. Real markets are dynamic, with people and systems learning as they go. That feedback loop could make things worse or better, depending on how agents evolve. Either way, oversight will be crucial. Leaving them unchecked would be like handing a toddler a credit card and hoping for the best.
Still, the Magnetic Marketplace is a smart experiment. It shows where AI commerce stands before it spills into real platforms. OpenAI, Google, and others have already been testing agents that browse, compare, and buy. But the data here is blunt: today’s AIs don’t reason well enough to make those choices safely.
Automation can save effort, yes, but intelligence isn’t just about finishing the task — it’s about knowing when to slow down and question what’s in front of you. Until AI agents learn that, it’s safer to keep your card in your own hand.
Next read: Google Still Dominates as ChatGPT Grows
