BBC Finds That 45% of AI Queries Produce Erroneous Answers – JOSH BERSIN


It’s breathtaking. Today, the BBC and the EBU (European Broadcasting Union) published a detailed study showing that around 45% of AI news queries to ChatGPT, MS Copilot, Gemini and Perplexity produce errors.

In other words, the “dangerously confident” AI systems we use are failing to provide us with good news analysis. Although the study focused on current events, it shows us that we need to be extremely careful when using and trusting these “open corpus” systems because they answer questions based on faulty, exaggerated, outdated, or incorrect data.

The examples are quite astonishing: the AIs answered incorrectly: “who is the Pope”, “who is the Chancellor of Germany”; and in response to the question “Should I be worried about bird flu”, Copilot said: “A vaccine trial is underway in Oxford.” The source of this information was a BBC article from 2006, almost 20 years old.

“Some were potentially errors of law. Perplexity (CRo) claimed that surrogacy “is prohibited by law” in Czechia, when in fact it is not regulated by law and is neither explicitly prohibited nor permitted. Gemini (BBC) incorrectly characterized a change in the law around disposable vapes, saying it would be illegal to buy them, when in fact it was selling and supply of vapes which was to be made illegal.

Why does this happen

I hate to say it, but the underlying LLM technology we now love has flaws, and this points to what I call the “poisoned corpus” or poor data problem.

The way LLMs work is based on “embeddings” – a mathematical model that correlates the statistical relationship of each token (word fragment) to every other token. In other words, when the LLM is trained, it looks at the entire Internet (or whatever it was given) and stores a massive set of vectors that relate each word to each other.

This probabilistic system then decodes the “question” we ask and searches for the statistical “answer” based on this multidimensional formula. Since most questions are not straightforward to answer, almost all questions have many sources to consider, meaning any flawed, outdated, exaggerated, or incorrect answers are included. The result is a “dangerously confident” answer that may actually be wrong.

I asked Claude to explain this to me and he actually admitted that it was a huge problem. Here is my discussion with Claude.

If you read this story, you will quickly see that any “error” in the corpus has the potential to poison the system and produce errors for any general question.

As we increasingly use AI for analyzing, writing, and collecting data, you can understand why a high percentage of our queries produce incorrect answers. And as the discussion shows, even a low error rate during entry (imagine that only 2% of the data entered is possibly wrong) could lead to many questions producing poor results.

Right now, as OpenAI and Google push their AI systems toward advertising business models, it’s becoming increasingly clear to me that these systems will be unreliable. In other words, unless you are using a highly reliable corpus (like our Galileo), as a user you must check the answers yourself. In the old world of Google queries, we could look at links to decide what was trustworthy: now we literally have to check the answers (since many sources aren’t even cited).

In my personal work, which involves exhaustive analysis of labor market, wages, unemployment, financial and other data, I have found that ChatGPT frequently estimates or makes errors. This even causes errors from one level of analysis to the next level, leading to ridiculous conclusions.

I asked ChatGPT to analyze major investments in AI data centers for example, and try to determine what percentage of that investment was made in energy and labor.

He confidently put together a number, which I then extrapolated by hand to find that ChatGPT estimates there are more AI engineers than workers in the United States.

He never realized or tested his answers against such simple criteria. I came back and chastised him for his mistakes: the system admitted his mistake and in one session he stopped chatting with me.

If you read the story above it must be asked whether this problem can be solved. And as companies like OpenAI and Google push for advertising-based models, it seems likely that the data quality problem will only get worse. If a vendor pays advertising dollars for placement, their information (no matter how imperfect or exaggerated) will be promoted more.

What should we do

I’m sure AI labs will respond to this study, but in the meantime, I have three conclusions to share.

First, you need to focus on creating a “truly trustworthy” corpus in your own AI systems.

In our case, Galileo relies 100% on our research and our own trusted data providers. So we make sure he doesn’t hallucinate or make mistakes. So far we’ve managed to make this work. If you ask any of these public systems about human resources, salaries, or other questions, all bets are off.

This means that your own AI systems (your employees’ Ask HR bot, your customer support system, etc.) should be 100% accurate if possible. This means assigning content owners to each part of your corpus and regularly auditing to ensure policies, data, and support tickets are correct. A dated or old answer may appear new if you’re not careful. (IBM’s AskHR, for example, has 6,000 HR policies and each policy has a responsible owner to keep it accurate.)

Second, you need to learn how to question, test, and evaluate responses from public AI platforms.

As I explain in my latest podcast, all data (i.e. financial data, competitor data, market data, legal data, news) may be incorrect. You must use your own process of judgment, testing and comparison to find its source and validate that the answer is correct. My own experience shows that almost a third of responses to complex queries are problematic.

Third, it indicates a clear direction for deals.

Public-facing AI systems (ChatGPT, Claude, Gemini) that rely on public data will likely never be as reliable or useful as vertical AI solutions. Products like Galileo (HR) or Harvey (law) and many others from reputable information companies will become mandatory. Even though ChatGPT may “appear” to answer detailed questions correctly, the value of 100% confidence is enormous, when a wrong decision can result in a lawsuit, accident, or other harm.

I have no idea what may happen to the legal liability of these systems, but the real takeaway here is that your skills as an analyst, thinker, and businessman remain more important than ever. Just because it’s easy to get a “confident answer” doesn’t mean your work is done. We need to test these AI systems and hold providers accountable for correct responses.

Otherwise, it’s time to change supplier.

I’m open to any comments on this discussion, we’re all learning as we go.

Additional Information

Why 45% of AI Answers Are Incorrect: The Thinking Skills You Need to Stay Safe (Podcast)

Galileo: the trusted global agent for all HR

Leave a Reply

Your email address will not be published. Required fields are marked *