SHARE

MIT Built a Virtual Playground Where Robots Learn to Think

Image: Unsplash/ThisisEngineering

Written By

Oct 9, 2025

3 minute read

eWeek content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Before robots can clean your kitchen, they need to train in one. The Massachusetts Institute of Technology (MIT) just built a virtual world where they can practice… no broken dishes required.

A team from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Toyota Research Institute developed a new AI method called “steerable scene generation.” It quickly builds realistic 3D spaces — like virtual kitchens and living rooms — where robots can safely train at scale.

At the heart of the system is a planning technique called Monte Carlo Tree Search (MCTS) — the same strategy employed by AI programs like AlphaGo to consider numerous possibilities before selecting the best one.

“We are the first to apply MCTS to scene generation by framing the scene generation task as a sequential decision-making process,” says Nicholas Pfaff, an MIT PhD student and lead author on the project. “We keep building on top of partial scenes to produce better or more desired scenes over time. As a result, MCTS creates scenes that are more complex than what the diffusion model was trained on.”

Robots need that complexity. While chatbots learn from trillions of words, robots depend on realistic visual and physical demonstrations — data that’s slow and expensive to create by hand. The new system automates this process, using a generative AI model that can be “steered” to build detailed, physically accurate scenes.

In one test, this approach allowed the AI to pack a virtual restaurant table with 34 items, including tall stacks of dim sum dishes, after only being trained on scenes with an average of 17 objects.

More than just pretty pictures

Beyond just creating visuals, the system also understands physical logic.

For example, it ensures that a fork doesn’t float through a bowl or that a cup sits firmly on a table. Users can type in commands like “a kitchen with four apples and a bowl on the table,” and the tool will bring that scene to life with impressive accuracy — 98% for pantry scenes and 86% for messy breakfast tables, according to the researchers.

Pfaff said the real breakthrough lies in the tool’s flexibility.

“A key insight from our findings is that it’s OK for the scenes we pre-trained on to not exactly resemble the scenes that we actually want,” he explained. “Using our steering methods, we can move beyond that broad distribution and sample from a ‘better’ one.”

Industry experts see big potential

Experts not involved in the project have praised the approach.

Jeremy Binagia, an applied scientist at Amazon Robotics, told MIT News that steerable scene generation “offers a better approach” to realistic simulations because it ensures physical accuracy and 3D awareness, something most previous models lacked.

Rick Cory, a roboticist at the Toyota Research Institute, added that the framework can create “‘never-before-seen’ scenes” important for training robots that can adapt to new situations.

While the project is still a proof of concept, the MIT team hopes to go further. They plan to use generative AI to invent new objects entirely and create more dynamic environments, complete with moving parts like cabinets, jars, and drawers.

Eventually, they aim to combine their technology with internet-scale image data, building a global platform for robot training that mimics the diversity of real life.

For a look at how Google is advancing robot intelligence, read how Gemini 1.5 is changing the game.

Aminu Abdullahi

Aminu Abdullahi is a B2C and B2B technology and finance writer with more than six years of experience covering enterprise IT, cybersecurity, cloud computing, artificial intelligence, fintech, business software, and emerging technologies. His work has appeared in publications including TechRepublic, eWEEK, Channel Insider, Geekflare, Enterprise Networking Planet, eSecurity Planet, CIO Insight, and Webopedia. With a technical background in computer science, he specializes in translating complex technology topics into clear, accessible content for business leaders and decision-makers.

MIT Built a Virtual Playground Where Robots Learn to Think

More than just pretty pictures

Industry experts see big potential

Aminu Abdullahi

Company

Categories