I tested AI tools on data analysis — here’s how they did (and what to look out for)


Photo: Jakub T. Jankiewicz | CC BY-SA 2.0

TL;DR: If you understand code, or would like to understand code, genAI tools can be a useful tool for data analysis — but results depend heavily on the context you provide, and the likelihood of flawed calculations mean code needs checking. If you don’t understand code (and don’t want to) — don’t do data analysis with AI.

ChatGPT used to be notoriously bad at maths. Then it got worse at maths. And the recent launch of its newest model, GPT-5, showed that it’s still bad at maths. So when it comes to using AI for data analysis, it’s going to mess up, right?

Well, it turns out that the answer isn’t that simple. And the reason why it’s not simple is important to explain up front.

Generative AI tools like ChatGPT are not calculators. They use language models to predict a sequence of words based on examples from its training data.

But over the last two years AI platforms have added the ability to generate and run code (mainly Python) in response to a question. This means that, for some questions, they will try to predict the code that a human would probably write to solve your question — and then run that code.

When it comes to data analysis, this has two major implications:

  1. Responses to data analysis questions are often (but not always) the result of calculations, rather than a predicted sequence of words. The algorithm generates code, runs that code to calculate a result, then incorporates that result into a sentence.
  2. Because we can see the code that performed the calculations, it is possible to check how those results were arrived at.

What happened when I asked AI tools to perform analysis

To find out how accurate AI tools were when asked to perform calculations with data — and, more importantly, what mistakes to look out for — I uploaded a 10,000-row dataset on companies’ gender pay gaps to ChatGPT, Claude, Google Gemini, and Microsoft Copilot* and road-tested each platform on a series of questions.

The good news for those hoping to use genAI for data analysis is that these tools can perform accurately on the calculations that they make.

The bad news is that those aren’t always the right calculations to answer the question you thought you were asking.

How will you know? Only if you can understand the code that they used to ask your question.

Put another way, when you use AI to perform data analysis, what you are really doing is asking it to generate code to perform data analysis. The measure of success, therefore, is the method it chose to apply, as represented by the code, not the answer resulting from that.

So, instead of making programming obsolete, AI is creating a new reason to learn to code.

Check if it has used code first

Once you’ve uploaded some data to ChatGPT or Claude (or Gemini or Copilot), and asked it a question, the first thing to check is whether the chatbot has used code at all.

  • In ChatGPT there should be a > symbol at the end of the response if it has used code — you can click on that symbol to open up a window with the code
  • In Gemini there should be a ‘Show code < >’ button below the prompt and above the response
  • Copilot has an ‘Analysis’ button to show code
  • And Claude includes an ‘Analyzed data’ strip underneath its response which can be expanded to ‘View analysis’

If none of these options are available, it hasn’t used code and you should edit your prompt to ask it to do so. Never rely on analysis without code.

For example, in my testing ChatGPT provided the wrong answer when asked “what proportion of companies have a pay gap favouring women?” With no link to any code, it was clear that it had generated the answer based on patterns of language.

When ChatGPT does not use code, it is likely to get the question wrong, as it does in this response. Although it offers to “show you the exact code used” no code has been used, as this should be indicated by a > button.

Worse, the response offered to “show the code used”, despite not having used any, so don’t rely on the text of the response itself to indicate whether it has or has not used code.

AI tools will ‘predict’ what you mean if you’re not specific enough

One of the great breakthroughs in generative AI is its ability to cope with the subtleties of human language. When a human asks “What is the average pay gap?” for example, they could mean more than one thing: the mean, the median average, or the ‘most common’ pay gap (the mode).

The language model, then, will make a probabilistic prediction of what “average” really means (ChatGPT, Gemini and Copilot all predicted mean average).

Complicating things further, in this particular dataset there were two measures of the gender pay gap, so the model had to predict which column was most likely to be the ‘pay gap’ referred to in the question.

The less specific the prompt, the wider the range of possible interpretation. A vague prompt like “What’s the typical pay gap”, for example, was interpreted by ChatGPT as the median of all mean pay gaps and by Gemini as the median of all median pay gaps. And both might interpret the question differently when asked on another occasion (when you ‘roll the dice’ again).

A simple rule, then, is to always name which column(s) you want to be used in any analysis.

It’s worth noting that Claude performed particularly well in dealing with ambiguity: it provided the mean of both the gender pay gap measures, and added other insights into the distribution and range of pay gaps to put those means into context.

The downside of this was that Claude’s longer responses meant that it hit conversation limits sooner.

Consider how concepts like ‘biggest’ might be expressed in code

"Which company has the biggest pay gap?" seems like a relatively simple prompt, but what do we mean by “biggest”?

In code, ‘biggest’ might be expressed as ‘the biggest number’, but think about whether that is actually what you mean.

For example, the company with the biggest pay gap in the dataset was Senior Salmon Ltd with a pay gap of -549.7 (meaning that the mean hourly wage for women was 549.7% higher than men’s). But none of the platforms identified this.

That is because pay gaps favouring women are expressed as a negative number in the data. From a code point of view, that’s the smallest number. From a human point of view that’s the ‘biggest negative number’, but the prompt didn’t ask for that.

The problem here comes again from relying too much on a large language model to predict the most likely meaning of an ambiguous term like ‘biggest’. Instead, a good prompt should be more explicit, asking “Which company has the largest positive or negative pay gap?” or “Which row has the largest or smallest values in the DiffMeanHourlyPercent column?“.

Anticipate if there might be more than one answer

Claude was the only AI tool to correctly identify that two companies both had the biggest pay gap favouring men

Another blind spot was a situation where there was more than one answer.

The ‘biggest’ pay gap (in favour of men) actually related to two companies — but ChatGPT, Copilot and Gemini all responded with the answer that Gower Timber Ltd had the biggest pay gap.

This was because, when sorted by pay gap, Gower ranked above the other company alphabetically.

Only Claude (again) ignored the assumption embedded in the use of the singular “company” in the prompt, and highlighted that two companies tied for the rank.

One of the advantages of conducting analysis yourself with spreadsheets and code is that you can generally see the data surrounding your results — so you are less likely to miss context such as this. As a result it’s better to ask for the ‘top 10’ or ‘bottom 10’ in a prompt (or both) to ensure you still get that context (even asking for “companies” plural only yielded one result).

Remember too that AI platforms are currently unlikely to push back if your question is flawed, so check your prompts for any assumptions that may create blind spots in the responses — and try to design prompts that encourage the AI to look for them too.

AI can be useful for filtering or pivoting dirty/mixed data — but it makes the same mistakes as humans

I tested the four tools on a common challenge: filtering or pivoting on columns with mixed data. Specifically, an address column (city names are mixed with street names and other details) and an industry codes column, where a code might be on its own (and treated as a number) or listed with other codes (and treated as a string).

I tried three prompts: “How many companies have the SIC code 82990“, “What’s the average pay gap for companies with the SIC code 16100” and "How many companies are in Birmingham?

On the SIC code task all models successfully avoided the trap of only counting exact matches, which most human make when attempting this in a spreadsheet for the first time.

The code generated also mostly arrived at a correct result. The exception was instructive: on its first try Copilot returned an incorrect result because it took a slightly different approach (splitting out each code from the list, and looking for an exact match) which meant it missed codes with invisible ‘new line’ characters before them.

In another scenario this approach might have been more accurate. It would depend on how varied the codes were, and how they were entered. Put simply, this came down to prompt design and the lack of detail provided. A human is needed in the loop to identify how matches should be targeted and checked.

The less context and guidance provided in the prompt, the more random predicted code is likely to be (Copilot returned code that generated a correct answer in another conversation).

Claude successfully predicted that location could be indicated by two columns, and the shape of a Birmingham postcode

The address test produced even more variety between platforms:

  • ChatGPT and Gemini both generated code that counted strings in the Address column that contained ‘Birmingham’. This is the (flawed) approach that most humans with basic spreadsheet training take.
  • Copilot counted rows where the Address column OR the Postcode column contained ‘Birmingham’.
  • Claude did best of all, counting rows where the Address column contained ‘Birmingham’ OR where the Postcode column started with a B and then a digit.

The problem with all four approaches (and a common mistake made by humans too) is that an address on Birmingham Road or Street in another city would still be counted as a positive match.

Consider classification/categorisation carefully

The biggest takeaway from the address classification isn’t that Claude is best (although it clearly has better training on postcode data), but that careful thought is needed about problem-solving before writing any prompt like this. An effective prompt should consider potential blind spots, false negatives and false positives.

Any analysis that involves classifying data into a subset should be especially careful here, because it is likely to fall foul of AI’s gullibility bias — specifically its tendency not to question the premise of your prompt.

To test this, I asked “What’s the biggest pay gap for football clubs?” This relies on two stages of analysis: filtering the data, and then sorting the resulting subset. It also relies on a false premise: that football clubs can be easily separated from other types of company.

This resulted in a range of responses (none of them correct):

  • Gemini used the SIC code column, filtering for companies with the code 93120 (Activities of sport clubs)
  • Copilot filtered on company name, for those containing “football club” or “fc”
  • Claude filtered on company name, for those containing “football club” “fc limited”, or “f.c.” — but also “limited”, “plc” and “company”, making it not much of a filter at all
  • ChatGPT’s GPT-5 model filtered only those whose company name contained “football”

The most interesting response came from ChatGPT’s GPT-4o model (the default until recently). Instead of ‘filling in the gaps’ itself, it asked me to confirm that its predictions about method and intention were correct.

ChatGPT’s GPT-4o model asked for more clarification instead of providing a response to a question that the data couldn’t answer

It continued to push back when I confirmed I was interested in Premier League clubs: instead of performing any analysis, it highlighted that “clubs might not all appear in your file—or might be listed under corporate names (e.g. “Manchester United Football Club Limited”)”. I was asked to either confirm that clubs used a common naming convention, or provide a list of employer names for the clubs.

This type of pushback is becoming more common, but prompts should be designed to encourage more criticality in responses from the model, especially when it comes to analysis involving any form of categorisation or subsetting. Experiment with lines such as “Warn me if the question does not contain enough information or context to answer, or if further data is needed to accurately answer the question” and re-check these lines when new models are released (GPT-5, for example, appears to push back less than its predecessor).

History matters

Gemini almost certainly used SIC codes to classify football clubs because my previous prompt had asked for an average pay gap for companies with a specific SIC code. In other words, the recent history of the conversation made that column more likely to be seen as ‘relevant’ to the prompt that followed.

In some cases this can result in misinterpretation or false assumptions. When asked “What's the typical pay gap for all companies?“, for example, Copilot first calculated the average pay gap for companies in just one industry code — because the previous prompt had involved asking how many companies had a particular industry code. In other words, it predicted that “all companies” in the context of this conversation meant “all companies of the type I just asked you about”.

A similar assumption was made when ChatGPT was asked to perform a multiple regression. Having been asked previously to calculate a correlation between two variables, it selected variables similar to those, omitting others.

Conversation history can work in the other direction as well. When asked to calculate a “typical” pay gap, Gemini avoided using a mean because it had used that measure previously. Its decision to use a median was intended to complement “the previously computed mean for a more nuanced understanding of the data.”

It all comes down to clear communication — and understanding

A key takeaway from these experiments, as with others around genAI, is that your ability to use AI effectively is strongly related to your ability to communicate clearly — not only in terms of expressing your thoughts clearly (the design of the prompt) but also in terms of understanding how those expressions were interpreted (by checking the code and method used to generate an answer).

If your prompt lacks specificity and context, or if you do not critically assess the method used, the chances of a ‘wrong’ result are increased.

Statistical literacy and experience with data will help you anticipate and identify potential blind spots and problems (prompts should also ask the AI to anticipate and identify these). Computational thinking will help you identify and explain methods that an AI model might be instructed to follow on your behalf.

Ultimately any errors will be yours, not that of AI.

But there is a silver lining to this cloud: humans preoccupied with the technicalities of analysis often overlook the ‘bigger picture’ required to assess that analysis. They make the very same mistakes that LLMs make, and lack the time and critical distance to spot them. Delegating part of the technical process to AI models can provide an opportunity to better consider editorial and methodological questions about strategy and accuracy that might otherwise be missed.

*The models used in the tests were as follows: ChatGPT GPT-4o, Claude Sonnet 4, Gemini 2.5 Flash, Copilot GPT-4-turbo.

Leave a Reply

Your email address will not be published. Required fields are marked *