The Domino Effect – benchmarking agentic AI

In the fast-moving world of artificial intelligence (AI), we have seen the initial excitement around ChatGPT lead to a flood of investment, both in competitor start-up AI companies and in enterprise investment in AI. Company executives are seeking productivity advantages from AI in areas ranging from customer service to software coding through to image interpretation and generation in a wide range of industries. Some application areas are more suitable than others. Yet despite significant limitations of large language models (LLMs) in terms of reliability and security, the effort to deploy AI in the enterprise continues. The latest trend is for “agentic AI”, where multiple AIs and other tools are chained together and given a degree of autonomy to complete tasks, not just respond to prompts from users. I had written previously that the fundamental issue of the reliability of LLMs, and in particular their propensity to “hallucinate” at a high rate, represents a major stumbling block to the successful implementation of agentic AI. If an LLM has a success rate of 80% and you feed its output to another LLM agent with the same success rate then you have a compound success rate of just 64%, and of course, this problem worsens with each additional iteration.

Some significant recent research papers have shown how this is not just a theoretical problem. A major research project from Salesforce set up an elaborate benchmark with 225 tests, complete with a sandbox Salesforce system populated with a realistic volume of synthetic data. The tests were assessed by a panel of CRM practitioners to ensure that they were realistic and appropriate to real world problems. They then tested the latest models from Open AI, Meta and Google, nine models in all. The tests included accessing data via SQL, text retrieval, workflow and policy compliance. The results were… well, let’s just say they were not good. In response to simple single questions like “what is the status of my order?”, the most successful model achieved a success rate of just 58%. As soon as this was extended to a follow-up question, the success rate dropped to 35% or worse, and even this only with the best model; most performed much worse.

One intriguing twist is that the benchmark includes tests for whether the models would reveal sensitive or confidential data in the test dataset when asked. This was quite revealing: “we found that all evaluated models demonstrate near-zero confidentiality awareness”. In the concluding words of the researchers, “These findings suggest a significant gap between current LLM capabilities and the multifaceted demands of real-world enterprise scenarios.”

A separate research project from Microsoft found similar results to the Salesforce one. The 154 tests were around Windows tasks, and its own Navi agent scored just 19% success in these, compared with a 74% human success rate.

This is quite separate to another major research paper from Apple, which explored how the very latest developments in LLMs, so-called “reasoning models”, could actually cope with a series of puzzles of increasing complexity that they had not been previously trained on. Many of the seemingly impressive benchmarks of LLMs in some fields, such as mathematics, turned out to be because they were actually gamed by training them on the benchmarks in question, effectively being given the exam questions in advance. This was illustrated by a recent test where the leading LLMs were tested on the March 2025 Mathematics Olympiad questions, where the questions were not released in advance. The LLMs scored a dismal 5% or less, dramatically worse than their performance when they seemingly did well due to being trained on the relevant questions in advance.

The latest Apple paper, which has caused reverberations in the industry, tested the reasoning models from Open AI, Anthropic, Google and DeepSeek on a series of puzzles where the complexity could be adjusted by researchers. An example was the well-known “Tower of Hanoi” puzzle, where a series of discs needs to be moved between three poles in a sequence of moves according to a set of rules. The puzzle is fairly easy with three or four discs and can be solved by most children by age 11, and by brighter seven-year-olds. There is, in fact, a general recursive pattern to solving the puzzle with an arbitrary number of discs, and this was solved in 1957 by a very early AI program. Given that things have moved on in AI since 1957, it might be expected that the very latest reasoning models would make light work of this, but the Apple researchers found otherwise. The reasoning models did all right on simple test cases but were entirely unable to cope when the level of complexity was increased, on this and the other puzzles that the researchers tested. Even when presented with the solution to one of the puzzles, the models were unable to adapt. The concern is that the current LLMs and reasoning models may have effectively hit a wall in terms of being able to improve. Certainly, the performance of earlier LLMs has been shown to actually deteriorate over time, and this is a separate but further area of concern about the state of the art in AI reasoning.

These different research papers, all from prestigious companies with extensive AI experience, suggest that at the moment the level of excitement and hype around agentic AI in particular is out of line with the ability of the technology to actually deliver effective results. In these tests the agents and models significantly underperformed humans doing the same tasks, in most cases by dramatic margins. It is important that enterprises considering implementing this technology carefully evaluate the risks involved, including the risk of security breaches of confidential or sensitive data. Careful testing using company-specific data and processes should be carried out, and a thorough risk analysis conducted. This advice could apply to any new technology, but it is clear from the research papers that, as of mid-2025 at least, agentic AI technology may be far from ready for widespread enterprise deployment. It is concerning that the general level of awareness of this state of affairs appears to be low when one considers the general level of excitement in the media around this subject. Much of this excitement is driven by vendors, investors and consultants with a direct interest in ensuring that the technology sells, but it is important to temper this with a reality check on what the current capabilities of the tools really are.

The Domino Effect – benchmarking agentic AI

Posted By

Andy Hayler

Related Technologies

Related Companies

Brief our analysts

Categories

Useful links

Leave a Reply Cancel reply

Share on Social Media

Connect with Us

Ready to Get Started