Anthropic’s AI ‘Vaccine’: Train It With Evil to Make It Good


In order for the AI models to behave better, anthropic researchers injected them with a dose of evil.

Anthropic said in an article published on Friday that the exposing of large language models to “unwanted personality vectors” during the training made the models less likely to adopt harmful behaviors later.

Persona vectors are internal parameters that push the responses of a model to certain behavioral traits – for example, being useful, toxic or sycophantic. In this case, Anthropic deliberately pushed the model towards unwanted features during training.

The approach works as a behavioral vaccine, said the startup behind Claude. When the model receives a dose of “evil”, it becomes more resilient when it encounters training data which induce “badly”, said anthropic researchers.

“It works because the model no longer needs to adjust its personality in a harmful way to adapt the training data,” they wrote. “We provide it ourselves with these adjustments, relieving it of the pressure to do it.”

The Anthropic team calls for this “preventive direction” method. It is a means of avoiding the “unwanted personality change”, even when the models are formed on data that could otherwise have them take harmful features.

Although the “badly” vector is added during finetuning, it is disabled during deployment – the model therefore retains good behavior while being more resilient to harmful data, the researchers said.

The preventive direction caused a “degradation little or not of the capacity of the model” in their experiences, they added.

The post has described other strategies to alleviate undesirable changes in the personality of a model, including monitoring of changes during deployment, distance from the model from harmful features after training and identification of problematic training data before causing problems.

Anthropic did not respond to a request for comments from Business Insider.

In recent months, Anthropic has explained what can go wrong with his models in the tests. In May, The company said that during the training, its new model, Claude Opus 4, had threatened to exhibit the case of an engineer to avoid being closed. The AI sang the engineer in 84% of the trials, even when the replacement model was described as more capable and aligned with Claude’s own values.

Last month, anthropogenic researchers published the results of an experience in which they left Claude to manage an “automated store” in the company’s office for about a month. The AI has sold metal cubes, invented a VenMo account and tried to deliver products in a blazer.

Ai flowing Amok

Anthropic’s research is in the midst of an increasing concern concerning AI models with worrying behavior.

In July, Grok, the Chatbot AI of Elon Musk, made several inflammatory remarks linked to the Jewish people.

In articles on X, Grok congratulated Hitler’s leadership and linked the family names to Jewish consonance with “anti-white hatred”. Xai apologized for the Grok inflammatory positions and said he had been caused by new instructions for the chatbot.

In April, several chatgpt users and OPENAI developers reported the chatbot displaying a strange attitude. It would be too excited by banal prompts and would respond with unexpected personal flattery.

Openai has back update the GPT-4O model that put users on a pedestal.

“The update that we deleted was too flattering or pleasant – often described as sycophan,” wrote Openai in a business blog article.



Leave a Reply

Your email address will not be published. Required fields are marked *