‘Are you joking, mate?’ AI doesn’t get sarcasm in non-American varieties of English


In 2018, my Australian colleague asked me: “Hey, how are you?”. My answer – “I take a bus” – met a sly smile. I had recently moved to Australia. Despite the study of English for more than 20 years, it took me a while to familiarize myself with the Australian variety of the language.

It turns out that large language models fueled by artificial intelligence (AI) such as the Chatppt has a similar problem.

In new research, published in the Results of the association for computational linguistics 2025My colleagues and I introduce a new tool to assess the ability of different large language models to detect feeling and sarcasm in three English varieties: Australian English, Indian English and British English.

The results show that there is still a long way to go until the promised advantages of AI are appreciated by all, regardless of the type or variety of language they speak.

Limited English

Large languages are models Often reported in order superlative performance On several standardized Sets of tasks called benchmarks.

The majority of reference tests are written in standard American English. This implies that, although large -language models are sold aggressively by commercial suppliers, they have been mainly tested – and trained – only on this type of English.

This has major consequences.

For example, In a recent survey My colleagues and I found that large-language models are more likely to classify such hateful text if it is written in the African-American variety of English. They often “by default” to standard American English – even if the entry is in other varieties of English, such as Irish English and Indian English.

To build on this research, we built Bestie.

What is Bessie?

Bessie is the first reference in its kind for the feeling and classification of the sarcasm of three varieties of English: Australian English, Indian English and British English.

For our needs, “feeling” is the characteristic of emotion: positive (“not bad!” Aussia) or negative (“I hate the film”). Sarcasm is defined as a form of verbal irony intended to express a contempt or a ridiculous (“I love being ignored”).

To build Bestie, we have collected two types of data: places notices on Google Maps and Reddit Posts. We have carefully organized the subjects and used linguistic varieties – AI models specializing in the detection of the linguistic variety of a text. We have selected texts which were to be above 95% probability of a specific linguistic variety.

The two steps (location filtering and linguistic variety prediction) have assured that data represent the national variety, like Australian English.

We then used Bestie to assess nine models of powerful large -scale language and freely usable, especially Roberta,, Mamper,, Mistral,, Gem And Qwen.

Inflated complaints

Overall, we found that the models of great language that we tested worked better for Australian English and British English (which are native English varieties) than the non -native variety of Indian English.

We also found that important language models are better to detect the feeling that in sarcasm.

Sarcasm is particularly difficult, not only as a linguistic phenomenon but also as a challenge for AI. For example, we found that models could detect sarcasm in Australian English only 62% of the time. This number was lower for Indian English and British English – around 57%.

These performances are lower than those claimed by technological companies which develop significant linguistic models. For example, GLUE is a ranking following the performance of AI models during the classification of feelings on the English English text.

The highest value is 97.5% for the Turing ULR V6 model and 96.7% for Roberta (from our model suite) – both higher for American English than our observations for Australian, Indian and British English.

The national context is important

While more and more people around the world use large -language models, researchers and practitioners wake up to the fact that these tools must be assessed for a specific national context.

For example, earlier this year, the University of Australia-Western with Google Launched a project To improve the effectiveness of large -language models for Aboriginal English.

Our reference will help to assess future large -language model techniques for their ability to detect feeling and sarcasm. We are also currently working on a project of major language models in Hospitals emergency services To help patients with variable English skills.

Leave a Reply

Your email address will not be published. Required fields are marked *