Outwitting AI

In the emerging world of AI, how can we get the best out of our new silicon helpers? Large language models (LLMs) are the basis of all generative AI tools such as ChatGPT, Claude, Perplexity, LLAMA and Gemini. They can understand and generate human text based on huge amounts of training data, such as large collections of books, websites and articles. Once trained on billions of data points, they can answer questions, write essays, summarise articles, translate from one language to another, or generate computer code or images. They use a technology called transformer neural networks, with millions of parameters that are fine-tuned during training. The LLM ingests text (your input prompt) and breaks it into small pieces called tokens, converting those into numerical representations called vectors, which capture the meaning and relationships between tokens in a mathematical form. The LLM uses its training data and its neural network (made up of many layers) to produce an answer by predicting the next most likely token, repeating this until a response is complete. This set of tokens is then translated back into text as an answer to the user. It is essentially a probability machine, producing the most plausible answers that it can based on its training data and its parameters.

It is important to understand that an LLM will not necessarily give you the same answer if you ask the same question several times. Indeed, you can actually encourage it to give more creative answers by changing the “temperature” parameter of the LLM. A temperature of 0 yields the most predictable answer, while a temperature closer to 1 will yield more creative answers; this parameter cannot usually be changed by an end user, though some specific models provide a “temperature slider” that can be adjusted. The inherent variability of LLM answers explains the phenomenon of “hallucinations”, where LLMs generate answers that are nonsensical or made up, perhaps citing research sources that do not exist. This can have serious consequences for the uninitiated: there are currently over a hundred cases of legal cases where court documents submitted by lawyers to a court contained hallucinations, often in the form of invented precedent cases.

Once LLMs started to be widely used, it was noticed that the quality of the answers would depend, to a degree, on how the question was phrased. This can be illustrated with an example. In July 2025, I asked the AI Perplexity two questions.

Prompt 1:
Tell me about the Eiffel Tower.

Answer:
The Eiffel Tower is a famous iron lattice tower located in Paris, France. It was constructed between 1887 and 1889 as the entrance arch to the 1889 World’s Fair. Designed by engineer Gustave Eiffel, the tower is about 324 meters tall and is one of the most recognizable landmarks in the world. Today, it attracts millions of visitors annually and is a symbol of Paris and France.

Now see what happens if you take more care in the formulation of the prompt.

Prompt 2:
Briefly summarize the history, design, and current significance of the Eiffel Tower in Paris. Include key facts such as its construction date, height, architect, and its role in tourism, formatted as a bulleted list.

Answer:
History:

Built between 1887 and 1889 as the entrance arch for the 1889 Exposition Universelle (World’s Fair) in Paris.
Named after the chief engineer, Gustave Eiffel, whose company designed and oversaw construction.

Design:

Made of wrought iron and composed of over 18,000 metal parts.
Stands 324 meters (1,063 feet) tall, making it the tallest structure in Paris.
Weighs approximately 10,100 tons.

Current Significance:

A global symbol of France and an iconic piece of architectural history.
Attracts more than 7 million visitors each year, making it one of the most visited paid monuments in the world.
Serves as a broadcast tower and is used for scientific experiments and light shows.

You can see that the second answer is considerably more detailed, giving extra information compared to the initial question. Carefully phrased questions to an LLM will yield not just longer answers, but ones that involve additional facts and information.

This observation has prompted, if you will excuse the pun, a whole new field called prompt engineering. The idea is to construct prompts in such a manner that you will get higher-quality answers from an LLM. There are different types of prompt engineering. “Decomposition” asks the LLM to break a problem down into more manageable sub-problems, and take each in turn. “Self-consistency” is the technique of asking an LLM the same question several times. The answers are aggregated, and then the average response is taken. For numerical questions, this is literally the mean of the numerical answers, but for textual answers, other aggregation methods are used. This approach has been shown in studies to improve LLM performance compared to just asking a single direct question. Another technique is “chain of thought”, where you add a statement to the prompt, such as “describe your reasoning step by step”. This technique has been shown in scientific studies to improve performance in LLMs. There are other such approaches.

How do these techniques compare in effectiveness? One 2025 study by Toronto University measured five popular prompt engineering techniques. Self-consistency showed the most correct answers in the study, slightly ahead of just asking the LLM the question directly. Interestingly, just asking the direct question actually produced more correct answers than the other prompt techniques, such as decomposition and chain of thought. This seems curious given the results of previous studies, and suggests to me that the entire field of prompt engineering is quite new and requires further study before too many definitive conclusions can be drawn. Other studies are similarly ambivalent, with tests at vendor VMware suggesting that the path that the models take in response to simple prompts are often more effective than elaborate prompt engineering.

To show just how surreal things can be in the world of LLMs, a widely-read social media post suggested offering to tip an LLM in dollars, even though an LLM clearly has no use for money. This supposition was tested in a scientific study of how prompts affect outputs from LLMs, conducted by the University of Southern California. Working with ChatGPT and LLAMA, this had some curious results. The researchers found that offering to tip an LLM $1 or $100 significantly improved the performance of the LLM. Even adding “thank you” to the prompt improved the outcomes of ChatGPT. It is unclear why this works, but it seems to. LLMS are a black box, so there is no ability to understand the reasoning behind its answers. Inputs flow through many layers of neural networks with billions of parameters, and outputs based on massive sets of training data. They are inscrutable.

Dealing with LLMs sometimes feels like talking to a forgetful, erratic yet powerful wizard. The wizard can answer all your questions (LLMs rarely admit that they don’t know an answer) yet the answers may at times contain invented information. The wizard cannot remember what you asked it the last time that you spoke (the “context window” of an LLM), and will sometimes answer the identical question differently if you ask it at various times. If you phrase your question carefully, then the wizard may give you a better answer. The wizard may also give you a more detailed answer if you offer a monetary tip, even though the wizard claims to have no use for money. Perhaps it just likes the gesture of a tip, but no one is quite sure. It can be seen that the world of LLMs is a complex and, at times, frustrating one.

Dealing with unreliable sources of knowledge is nothing new. The Oracle of Delphi at the Temple of Apollo in ancient Greece (active from the 8^th century BCE to 4^th century CE) was notorious for its ambiguous or riddle-ridden answers. In response to the King of Lydia asking the oracle whether he should attack the Persian empire, the oracle Pythia replied that, if the king did attack, then “a great empire will fall”. Unfortunately, it turned out to be his own. Dealing with LLMs can be similarly confusing. People are used to computer programs giving the same answer consistently: we don’t expect our pocket calculators to occasionally apologise for a wrong answer, or for Excel to give us different answers to our calculation from time to time. What is important is that you understand a little of how LLMs work in order to apply them to problems that they are best suited to. Applied to the right kind of problems, they can produce very useful output, generate realistic images and spot patterns in data that may otherwise elude us. However, if you use an LLM for a problem to which it is ill-suited, then all the prompt engineering in the world will not help you.

Outwitting AI

Posted By

Andy Hayler

Related Technologies

Brief our analysts

Categories

Useful links

Leave a Reply Cancel reply

Share on Social Media

Connect with Us

Ready to Get Started