Chemists create mosaic of AI synthesis knowledge


This is a fairly common scenario in many synthesis labs: you know what raw material you have, and you know what product you need, but you don’t enough what reaction conditions will get you there. Unless you’re lucky enough to know someone with deep expertise in the best reaction for the job, you’ll likely spend a lot of time poring over the literature.

In an effort to save chemists time and effort, a growing number of researchers are working on training large language models (LLMs) to provide synthetic guidance. “When you analyze data with artificial intelligence techniques, you analyze it in a way that is different from human thinking,” which can lead to unexpected and valuable new ideas, says Timothy Newhouse, organic chemist at Yale University.

Newhouse and his team, in collaboration with the group of computational chemist Victor S. Batistarecently revealed a new framework called MOSAICabbreviation for Multiple Optimized Specialists for AI-assisted Chemical Prediction (Nature 2026, DOI: 10.1038/s41586-026-10131-4).

Instead of one large LLM, MOSAIC is a collection of 2,498 small LLMs. Each model is trained on a specific subset of over a million reaction procedures from the patent literature. For example, an expert model might specialize in Suzuki coupling. Another could be the reference for olefin metathesis.

The framework, based on Meta’s Llama 3.1 platform, was developed primarily by Batista’s former graduate student Haote Li, who now works on chemical AI for Merck. Sumon Sarkar, a postdoctoral researcher in the Newhouse laboratory, led the experimental validation.

Of course, the models don’t actually know the names of the reactions, Sarkar says, just what the transformation is and how it is represented in the patent literature. Thus, reactions with distinct mechanisms that accomplish the same type of bond formation could be grouped together. “It doesn’t understand the mechanism of transformation. But by looking at millions of procedures, it somehow imitates understanding.”

Given the raw material and product structures desired in the Simplified Molecular Input Line Entry System (SMILES) notation, MOSAIC will forward the query to a few of its specialized models, much like a journal editor assigning an article to reviewers whose expertise matches the content of the manuscript.


The researchers used a new AI framework to recommend new protocols for reactions such as the formation of azaindoles from pyridine and allylamine derivatives, shown here.

The specialist models will then develop written protocols explaining how to carry out the transformation: which solvents and reagents to use in what quantities, at what temperature to carry out the reaction and for how long, and even how to purify the product.

Because it queries only relevant areas of chemical space, MOSAIC gives more precise recommendations while requiring less computing power, Batista says. The result also includes a confidence score that reflects the distance between the prediction and the “center” of the model’s expertise.

The researchers tested MOSAIC on reactions from the literature that were not included in the training data. The AI ​​models’ predictions of solvents and reagents exactly matched the known procedure about a quarter of the time; Including partial matches increased the match accuracy to almost 50%. The inclusion of more models and predictions has improved the likelihood of seeing a match.

The researchers also applied MOSAIC to 37 reactions that had no direct precedents in the literature. The AI ​​tool’s top-ranked prediction worked in 35 of these cases.

“It’s a great article, really well done” both in concept and execution, says Gabe Gomes from Carnegie Mellon University. Gomes is also working on LLMs in chemistry; his lab developed an AI lab assistant called Coscientist in 2023. MOSAIC’s success rate probably doesn’t beat that of a super-seasoned chemist, but the technology will continue to improve, he says. “This is the worst it can be.”

The MOSAIC approach has built-in flexibility, allowing it to incorporate new models as new areas of chemistry develop, says Li. For example, the platform does not yet have much knowledge in photochemistry. But as new patents involving photochemistry are filed, that could change.

Batista and Li say their next steps will be to integrate MOSAIC into synthesis planning and ideally integrate laboratory automation. “The future will definitely be smarter and more automated,” Li says.

Leave a Reply

Your email address will not be published. Required fields are marked *