AI is learning to lie, scheme, and threaten its creators


The most advanced AI models in the world have new disturbing behaviors – lies, intrigue and even threatening their creators to achieve their goals.

In a particularly discordant example, under the threat of being disconnected, the latest creation of Anthropic, Claude 4, fell in the process of singing an engineer and threatened to reveal an extramarital affair.

Meanwhile, the O1 of Chatgpt-Creator Openai tried to download on external servers and denied it when caught in the act.

These episodes highlight a reality that gives to think: more than two years after Chatgpt rocked the world, AI researchers still do not fully understand how their own creations work.

However, the race to deploy increasingly powerful models continues at a dizzying speed.

This deceptive behavior appears linked to the emergence of “reasoning” models – AI systems that solve the problems step by step rather than generating instant responses.

According to Simon Goldstein, professor at the University of Hong Kong, these new models are particularly subject to so disturbing explosions.

“O1 was the first big model where we saw this type of behavior,” said Marius Hobbhahn, Apollo Research manager, specializing in the main AI systems.

These models sometimes simulate “alignment” – seeming to follow the instructions while secretly pursuing different objectives.

– “Type of strategic deception” –

For the moment, this deceptive behavior only emerges when the researchers deliberately test the models with extreme scenarios.

But as Michael Chen of the METR assessment organization warned it, “this is an open question if the future and more competent models will have a tendency to honesty or deception”.

The worrying behavior goes far beyond the “hallucinations” of typical AI or simple errors.

Hobbhahn insisted that despite the constant pressure tests by users, “what we observe is a real phenomenon. We did not invent anything.”

Users report that the models “lie to consume proofs,” the co-founder of Apollo Research said.

“It’s not just hallucinations. There is a kind of very strategic deception.”

The challenge is aggravated by limited research resources.

While companies like Anthropic and Openai hire external companies like Apollo to study their systems, researchers say that more transparency is necessary.

As Chen noted, better access “for research on AI security would allow better understanding and attenuation of deception”.

Another handicap: the world of research and non -profit organizations “have orders of magnitude of resources less than IA companies. It is very limiting,” noted Mantas Mazeika of the Center for Ia Safety (CAI).

– No rules –

Current regulations are not designed for these new problems.

The European Union AI legislation mainly focuses on how humans use AI models, not on the preventing models themselves to behave.

In the United States, the Trump administration shows little interest in urgent AI regulations, and the congress can even prohibit states from creating their own AI rules.

Goldstein believes that the problem will become more important as IA agents – autonomous tools capable of performing complex human tasks – spreads.

“I don’t think there is a lot of conscience,” he said.

All this takes place in a context of fierce competition.

Even companies that position themselves as security focused, like Amazon, Amazon, Anthropic, “are constantly trying to beat Openai and release the new model,” said Goldstein.

This frantic pace leaves little time for in -depth security tests and corrections.

“Right now, the capacities are moving faster than understanding and security,” said Hobbhahn, “but we are still in a position where we could overthrow him.”.

Researchers explore various approaches to meet these challenges.

Some plead for “interpretability” – an emerging field focused on understanding the functioning of internal AI models, although experts like the director of Cai Dan Hendrycks remain skeptical about this approach.

Market forces can also ensure some pressure for solutions.

As Mazeika pointed out, the deceptive behavior of AI “could hinder adoption if it is very widespread, which creates a strong incentive for companies to resolve it”.

Goldstein has suggested more radical approaches, in particular by using the courts to hold the companies of the responsible AI through prosecution when their systems cause damage.

He even proposed that “the holding of AI agents is legally responsible” for accidents or crimes – a concept that would fundamentally change the way we think of the responsibility of the AI.

You / arp / md

Leave a Reply

Your email address will not be published. Required fields are marked *