Artificial intelligence has been on everyone’s lips in the last couple of years, ever since ChatGPT racked up 100 million users in record time (just two months) after its release in November 2022. Based on large language model (LLM) technology, the latest in AI could generate convincing content quickly (“generative AI”), from school essays to holiday itineraries, from marketing material to programming code. A host of rivals sprang up, such as Claude, Perplexity, Grok, LLAMA and DeepSeek, amongst others. Similar products emerged to generate images and videos, such as Midjourney, Adobe Firefly, Leonardo and Google’s Imagen. A torrent ($110 billion in 2024) of investor money has created a host of AI startups, and 95% of US companies now use AI according to Bain. Popular use cases have been in coding, customer service chatbots and marketing.
However, all is not rosy in the generative AI garden. There have been numerous well-documented failures of the technology, from multiple lawyers being sanctioned for briefs with fictitious cases, to major customer service projects being abandoned, such as drive-through AI assistance at McDonalds. A key reason for the failures is that LLMs “hallucinate”, producing error-ridden or downright fanciful answers in a troubling proportion of their replies. Hallucination rates of LLMs vary depending on various factors, but a 1 in 4 or 1 in 5 hallucination rate seems to be the general consensus. Worryingly, the rate actually appears to be worsening in the latest models, as discussed in this New Scientist article in May 2025. Responses have ranged from “prompt engineering” to supplementing LLMs with industry-specific datasets (retrieval augmented generation) but while these may be useful, hallucinations have not been eliminated. There is now a general consensus that hallucinations in LLMs are here to stay, as they are an inherent consequence of how they work. Open AI’s CEO Sam Altman said that “a lot of value from these systems is heavily related to the fact that they do hallucinate”.
Consistency is another issue. If you ask an LLM the same question several times, you will not get identical replies. This inconsistency is due to the probabilistic nature of token sampling and the inherent randomness in model inference; one academic study showed that Chat GPT was between 74% and 89% consistent for a large set of true/false questions that were repeatedly asked; Mistral and LLAMA scored lower than this. This behaviour may be fine in some cases, but it is certainly not in others. If an LLM generates a logo for your new company and you don’t like it, then you can ask it to try again, and again until you get a logo that you like. However, if you ask an LLM to do a calculation, then you may not be happy that your sales report produces different figures each time that you ask the same question. If you put a calculation into Excel, then you do not typically keep a calculator alongside the computer to double-check its answer: you expect it to be consistent and correct. This is simply not going to work with generative AI, which is inherently inconsistent in its answers and is not good at arithmetic. Multiplying two four-digit numbers together currently has a roughly 30% success rate for generative AI models at the time of writing. Boosters for AI tout its apparent success in advanced maths tests, but it turns out that this is because the LLMs were “coached”, i.e. trained on the questions. In a 2025 Maths Olympiad test where the test questions were not released in advance, a range of leading AIs, including Claude and Grok, scored dismally, almost all under 5%.
A consequence of systemic hallucinations in LLMs is that, if widely implemented in enterprises, it may be hard to tell what corporate data is AI-originated or at least partially AI-originated. This may erode trust in corporate data, which is not universally high even prior to AI. In general, society may become used to unreliable AI answers and start to become more distrustful of “facts” in general. We have already seen the “fake news” phenomenon in political life; the unreliability of AI answers may further lower public levels of trust.
How is an enterprise to react to these contradictory indicators? The key is to think carefully about the use case and decide whether hallucinations and consistency actually matter for your particular project. A computer programmer may not be too troubled by an LLM hallucinating a library that does not exist in its code, because that error will get picked up by the compiler or interpreter. This is one reason why coding has become one of the main use cases of generative AI. There are many issues in using code generated by LLMs, but these issues disappear if you decide to restrict LLM code to prototypes rather than production code.
For graphics, another area where LLMs have made considerable progress, it may genuinely be the case that hallucinations are not a bad thing, as they can generate unusual or interesting art or images which, after all, can just be discarded if they don’t fit the bill. The realism of AI-generated images has improved greatly in the last two years, and is an area where AI is having a real impact on an industry, though there are many issues still to be resolved, especially around copyright, as LLMs have been trained without payment or permission on swathes of copyrighted material. Unsurprisingly, those copyright creators are unhappy, and the first court cases are already appearing.
So, what criteria can we apply to decide whether a project is well or poorly suited for generative AI? Firstly, anything involving reliable facts or something that is mission-critical is problematic. Legal advice and financial reporting are examples where it would be unwise to use the technology, as some have already discovered to their cost. LLMs struggle with edge cases or unpredictable scenarios, which is why self-driving cars have struggled to gain wide acceptance so far: they work very well in well-defined situations but struggle with unexpected events, often with fatal consequences. Much the same would apply to situations like air traffic control or emergency response, where unpredictability is a factor. Things that require explanations and transparency, such as court sentencing or insurance claims, are unsuitable, since LLMs are black boxes. They produce answers, but cannot tell you their reasoning. Situations requiring sensitive or regulated data may be problematic or at least require careful handling, as such data may be leaked, either inadvertently or through malicious actions. Another factor is that AI models are heavily dependent on the data on which they are trained. If they are trained on reliable, high-quality data then they will be more successful than if not. This means that enterprise deployments need to be carefully considered, as only about 1 in 3 executives actually trust their own corporate data, according to many surveys in recent years, and the level of trust appears to be declining.
The good news is that this leaves a lot of situations where the above criteria do not apply. AI models have proven very useful in certain circumstances in medical imaging and drug discovery. They can often detect fraud patterns effectively, and can develop personalised recommendations for marketing, and develop effective personal tuition plans in education. There are many reasonable use cases for generative AI, but the key message for your use cases is: choose wisely.
The diagram below summarises the broad factors that you can consider to see whether a project is suitable for generative AI or is risky.
1 Comment