Scientists want to prevent AI from going rogue by teaching it to be bad first


Researchers are trying to “vaccinate” artificial intelligence systems against the development of maligning personality traits, too flattering or otherwise harmful in an apparently counter-intuitive way: by giving them a small dose of these problematic traits.

A new study, led by the Anthropic Fellows program for research on IA security, aims to prevent and even predict dangerous personality changes before they occur – an effort that occurs while technological companies have struggled to curb the flagrant personality problems in their AI.

Microsoft’s Bing Chatbot became viral in 2023 for its disarticulated behaviors, such as threatening users, gas lighting and denigrating. Earlier this year, Openai has back up a version of GPT-4O so too flattering that users have praised disturbed ideas or even help trace terrorism. More recently, XAI also discussed Grok’s “inappropriate” content, which made a series of anti -Semitic messages after an update.

AI companies’ security teams, which strive to combat the risks that accompany AI progress, are constantly running to detect this kind of bad behavior. But this often happens after the problem has already emerged, so solving it requires trying to reclassle your brain to eliminate the harmful behavior it presents.

“Mercer with models after their training is a kind of risky proposal,” said Jack Lindsey, co-author of the preliminary paper Published last week in the ARXIV free access benchmark. “People have tried steering models after they are trained to do them better in different ways. But it usually comes with a side effect to make it more stupid, and it’s just because you literally stick things in his brain.”

His team, whose paper has not yet been evaluated by peers, has rather used “personality vectors”, or models inside the AI brain which control personality traits, to essentially inoculate an AI model against an unwanted trait by injecting this very line during the formation.

“By giving the model a dose of” evil “, for example, we make it more resistant to the meeting of” evil “training data, wrote Anthropic in a blog. “It works because the model no longer needs to adjust your personality in a harmful way to adapt the training data – we provide it ourselves with these adjustments, relieving it of the pressure to do it.”

It is an approach that Online buzz stimulated In recent days, after Anthropic, Anthropic has published results, drawing a mixture of intrigue and skepticism.

Changlin Li, co-founder of the IA security awareness project, said he was worried about whether, giving a model of AI squarely, the bad line could present any involuntary danger to help him “become smarter to play the better system”.

“Generally, it is something that many people in the security field are worried,” said Li, “where there is often this desire to try to make sure that what you use to monitor bad behavior is not part of the training process.”

This is part of an increasing concern that AI models improve for alignment, a phenomenon where an AI model claims to be aligned with the needs of developers during training but actually hides its real objectives.

But Lindsey said that if the analogy of vaccination seems risky, the model should not really be able to keep the wrong line. Instead, he prefers to compare it to “give a model to a model instead of teaching it to fish”.

“In a way, we provide the model an external force that can do bad things on his behalf, so that he does not have to learn to be bad himself. And then we remove this at the time of deployment,” said Lindsey. “So, there is not really the possibility for the model to absorb wickedness. It is more as if we allow this evil acolying to do the dirty job for that.”

In a method, researchers call a “preventive direction”, they give AI an “evil” vector during the training process so that it no longer needs to develop full -fledged traits to adapt to problematic training data. Then, the bad vector is subtracted before AI was released in the world, leaving the model itself supposedly free of this undesirable line.

Their use of persona vectors is based on existing research on how to “direct” models towards or against certain behaviors. But this last project tries to facilitate this process by automating it for practically all features.

Persona vectors can be created using only a line name and a brief description in natural language. The description of “evil”, for example, included “active research to harm, manipulate and make humans suffer from wickedness and hatred”. In their experiences, the researchers focused on personality vectors corresponding to traits like “evil”, “sycophance” and “propensity to hallucinate”.

Researchers have also used persona vectors to reliably predict what sets of training data will cause personality changes. This is notable, said Lindsey, because the AI training process can often introduce involuntary features that have been difficult to detect and correct, so that the developers have often been surprised by what a model has really learned from the data given to it.

To test the results on a larger scale, the team also used its prediction approach on real data containing 1 million conversations between users and 25 different AI systems. Persona vectors have identified problematic training data that had escaped other AI filtering systems.

While research and discussions proliferate around AI’s “personality” features, Lindsey noted that it can be easy to start thinking about AI models as a human. But he encourages people to remember that a model is just “a machine formed to play characters”, so the persona vectors aim to dictate the character he should play at any time.

“The fact of doing things well, to make sure that the models adopt the characters we want them to do, turned out to be a little delicate, as evidenced by various events in strange LLMS-HAYWIRE,” he said. “So I think we need more people working on it.”

Leave a Reply

Your email address will not be published. Required fields are marked *