Recent advances in AI have followed a pattern. Whether text, image, audio and video, once the right technical foundations were discovered, it only took a few years for AI-generated results to go from merely passable to indistinguishable from human creation. Although it’s still early, recent advances suggest that virtual worlds—that is, 3D environments that you can explore and interact with—could be next.
This is the bet made by Fei-Fei Li, a pioneering researcher in AI, often nicknamed the “godmother” of AI for her contributions to computer vision. In November, his new startup, Global Laboratorieslaunched its first commercial offer: a platform called Marblewhere users can create exportable 3D environments from text, image, or video prompts.
The platform could immediately prove useful to design professionals, allowing the automation of some technically complex creative work. But Li’s end goal is much more ambitious: creating not only virtual worlds, but also what she calls “spatial intelligence,” or, according to her recent manifest“the boundary beyond language – the capacity that connects imagination, perception and action”. AI systems can already see the world; through spatial intelligence, she claims, they could begin to interact meaningfully with it.
Worlds on demand
Although virtual worlds already exist in the form of video games that we interact with through screens or headsets, creating them is technically complex and labor-intensive. With AI, virtual worlds could be created much more easily, personalized to their users, and expanded infinitely, at least in theory.
In practice, global models, including those from other companies, such as Google’s DeepMind Genius 3-are still early in relation to their potential. Ben Mildenhall, one of Li’s co-founders at World Labs, says he expects them to follow the same trajectory we’ve seen with text, audio and video — people going from “that’s cute” to “that’s interesting” to “I didn’t realize that was created by AI.”
Indeed, AI video generation models have rapidly improved. This improvement is behind the recent viral success of OpenAI and Midjourney models. Companies like Captions, Runway, and Synthesia have also all built businesses around AI-generated video. According to Vincent Sitzmann, an assistant professor at MIT and an expert in AI world modeling, we can think of video models as “proto-world models.”
Li’s latest platform, Marble, offers different ways to create. You can prompt it with a written description, or with photos, videos, or an existing 3D scene, and it will spit out a “world” that you can navigate in first person, like in a video game. But these worlds, static at first, although developers can add movement and much more using specialized tools, have clear limits. It only takes a few moments of exploration before the visuals start to distort and the world takes on a mind-bending, incoherent structure.
Modeling entire worlds is much more difficult than generating videos. Mildenhall argues that because there is a much higher barrier to entry for creating 3D worlds than for writing words, you start to see “glimmers of value” from tools like Marble much sooner. “World Labs has shown what is possible by integrating and scaling a number of advances made by the computer vision community over the past decade. This is a very impressive achievement,” says Sitzmann. “For the first time, you get a glimpse of the types of products that might be possible because of this. »
Li says that “we can use this technology to create many virtual worlds that connect, extend or complement our physical world.” The case for using global models to create new entertainment experiences is quite clear. And in fields like architecture and engineering, “you can try a thousand times, exploring many potential alternatives at a much lower cost,” says Mildenhall. But for their other touted use cases – robotics, science and education – major obstacles remain.
A way to go
Even though we have a plethora of video and camera data with which to train video models, it is much more difficult to get the right training data for robots, especially humanoid robots. We lack proprioceptive data or “action data,” says Sitzmann, which would tell a robot which motor movements correspond to physical actions.
For self-driving cars, which have only a few inputs (gears, pedals and steering wheel), we can “collect millions of hours of video that correspond to actions taken by human drivers. But a humanoid robot has all these other joints and actions it can take. And we don’t have data for that,” he says.
In his manifesto, Li says global models will play a “decisive role” in solving the data problem for robotics. Although the manifesto presents a vision, Sitzmann says it “doesn’t really answer the question” of how exactly global models will solve robotics in the future, since a faithful simulator would require data correlating movement to action, something we currently lack.
There are also challenges when it comes to using global models for science and education. For entertainment, things just need to look realistic. But for science and education, fidelity to simulated real-world dynamics is arguably more important. “I [could] enter and discover the inside of a cell”, or “if I am a surgeon trained to perform laparoscopic surgery, I [could be] inside an intestine,” says Li, discussing what future global models might offer. But of course, a simulation of a cell or a surgical procedure is only useful as long as it is accurate. The founders of World Labs are keenly aware of the tradeoffs between realism and fidelity, and are optimistic that at some point the models will be good enough to provide both.
What if it worked?
Compared to language, “spatial reasoning is much worse in today’s AI,” says Li. That’s true. But while Li is betting that the resolution of spatial intelligence (as his company defines it) is necessary for AI to progress beyond a certain point – a trillion concern about the dollar – whether this will last remains to be seen. Whether existing multimodal language models like ChatGPT will “hit a wall” and suddenly stop improving is also an open question. What we do know is that across the industry and across all modalities, models are improving.
Mildenhall imagines we’ll get to a point where “you can experience everything you can experience in reality within a model.” In such a world, you could “engage multimodally with the thing and transform it as you wish according to your impulses,” he says.
With the parallel improvement in reasoning models and virtual reality, we can imagine a strange future, in which we each have access to our own infinitely vast and engaging generative worlds. Instead of watching a TikTok video of a cat, a cat is right in front of you. Instead of scrolling, explore. Such a world would bend to your will. Some users might fall in love with it, like they fall in love with chatbots today. “We’re not at that level yet,” says Christoph Lassner, another co-founder of World Labs. Sitzmann agrees that the idea is “not crazy,” although he notes that the prohibitive costs and long rendering times suggest that such a future is still relatively far away.
Li insists that this technology will augment and benefit humans, and that our relationship with it will remain collaborative. For what? “Because I believe in humanity,” she says. “If you look at the arc of history, civilization advances and our knowledge increases.” She rejects utopian and dystopian visions. “I think we all have a responsibility to bring AI to a better state as it becomes more powerful,” she says. “We should all wish for humanity to prevail and prosper. So where your hope lies should be where your actions go.”