From Google DeepMind's approach to the framework developed by MIT, the goal remains clear: to define robotic learning policies that train machines to perform a variety of tasks in diverse contexts, moving beyond the concept of a "specialised robot".

Robotic learning is a field within robotics focused on artificial intelligence techniques (particularly machine learning) that enable machines, from the simplest to the most advanced, to acquire specific skills and competencies based on the environment they are placed in and the tasks they are required to perform. These skills and competencies range from those related to motor areas, spatial perception, and manipulation, to more complex ones such as object recognition and classification, language comprehension, and human interaction.

«Robots are excellent specialists but poor generalists. Typically, a model needs to be trained for each activity and environment. Changing a single variable often requires starting from scratch. But what if we could combine all robotics knowledge and define a way to train a multipurpose robot?».

This question was posed by the team of robotic engineers at Google DeepMind before they created, in October 2023, with the collaboration of thirty-three academic laboratories worldwide, an unprecedented open dataset for the research community. This dataset contains training data for twenty-two different types of robots

The aim is to develop, from a large and heterogeneous dataset sourced from various research institutes and machines, a “generalist” and versatile learning model that enables multiple types of robots to acquire various skills.


In a recently updated work, Google DeepMind emphasises the need to move beyond conventional robotic learning methods, shifting towards a generalist policy that can be adapted to new machines, tasks, and environments.
The Computer Science and Artificial Intelligence Laboratory at MIT also addresses this topic, proposing a training method for multipurpose robots based on a combination of data from various sources, supported by a generative AI technique.
In a future scenario, having robots capable of performing tasks they have never been trained for, and adapting to new duties in workplaces – particularly in assisting non-self-sufficient individuals – would provide valuable support, especially in emergencies and hazardous situations for the assisted person.

Generalist robotic learning

The article “Open X-Embodiment: Robotic learning datasets and RT-X models” is the manifesto of Google DeepMind’s research. First published in Computer Science on 13 October 2023 and updated on 1 June 2024, the article outlines the group’s initial hypothesis: to transcend «conventional robotic learning methods that train a model for each application and environment, and instead move towards a generalist policy that can be efficiently adapted to new robots, tasks, and environments».

The dataset, named Open X-Embodiment, includes training data from thirty-three global robotics research centres. This data encompasses 527 skills, 160,266 tasks, and over one million trajectories of twenty-two already designed and existing robots, ranging from single robotic arms to dual-arm robots and quadruped models.

«Although most skills belong to the pick-place family, the long tail of the dataset also contains skills such as ‘erase’ or ‘assemble’. Moreover, the data covers a wide range of household objects, from appliances to food and utensils», explain the researchers.

Parallels with research in the field of Artificial Vision

Google DeepMind’s project is comparable to developments in the artificial vision sector approximately fifteen years ago, with the creation of the largest online dataset of real images ever (ImageNet). This dataset contains over fourteen million labelled video data points classified into more than twenty thousand categories, significantly advancing image classification research.

«Recent major advancements in various segments of machine learning research, such as computer vision and natural language processing, have been enabled by a common approach that leverages large, diverse datasets», notes the team.

Although similar approaches have been attempted in robotics, several challenges have emerged, primarily the longstanding scarcity of real-world robotic data.

«Data collection is particularly costly and demanding for robotics», explain the researchers, «as it requires highly intensive engineering operations or alternatively, precise and meticulous video recordings».

Another critical issue is the lack of scalable models that can learn from such data and, consequently, perform effective generalisations [source: “RT-1: Robotics Transformer for real-world control at scale” – Google Robotics Research, 13 December 2022].

Robotics transformers trained on diverse data

Google DeepMind’s robotic engineers tested the Open X-Embodiment dataset by training two Robotics Transformers (RT), defined by them as machine learning models «capable of generating simple and scalable actions for real-world robotic tasks».

The first Robotics Transformer, RT-1, was developed for large-scale real-world robotic control, while RT-2 is a vision-language-action model that learns from both web video data and robotic data.

Specifically, the first model was trained on data within Open X-Embodiment, including 130,000 episodes related to over 700 robotic activities, collected from a fleet of thirteen Everyday Robots over seventeen months.

The Robotics Transformer 1 was tested by the team in five different research labs, yielding results that indicated a «50% average improvement in success rate across five different commonly used robots, compared to methods independently developed and specific to each robot».

Overall, initial tests of the two robot learning models trained using diverse and cross-referenced data (as opposed to homogeneous and specific data) demonstrated that they enable robots trained in specific domains to acquire greater skills than conventional learning models.

Heterogeneous robotic learning through policy composition

Recently, the Computer Science and Artificial Intelligence Laboratory (CSAIL) at the Massachusetts Institute of Technology (MIT) revisited the topic of generalist policies for robotic learning with a study titled “PoCo: Policy Composition from and for Heterogeneous Robot Learning” (arXiv, 27 May 2024).

The study, set to be officially presented at the 2024 Robotics: Science and Systems Conference in Delft, Netherlands, from 15 to 19 July 2024, builds on a thesis already familiar to Google DeepMind engineers: «robots trained with a relatively small amount of task-specific data often cannot perform new tasks in unfamiliar environments».

For example, robots in a warehouse tasked with boxing items cannot perform tasks related to moving items in a production site, as these are two different tasks in different locations, involving different robotic learning policies and, naturally, different training datasets.

Using the warehouse example, «each generates terabytes of data, but they pertain to a series of specific tasks and skills in that location, including boxing. Thus, they are not ideal for training a generic machine», note the MIT researchers.

To overcome this issue, the team developed a framework for training multipurpose robots based on a combination of data from various sources.

Essentially, the framework, named Policy Composition (PoCo), utilises numerous small datasets (such as those collected from multiple robotic warehouses), «from which it learns separate policies that are then combined, enabling a robot to make generalisations across multiple tasks».

The role of generative AI and diffusion models

At the core of the Policy Composition method for robotic learning lies a generative artificial intelligence technique known as “Diffusion Models.” This technique enables the integration of multiple data sources spread across various domains, modalities, and activities. 

It is important to note that Diffusion Models (also known as “probabilistic diffusion models” or “score-based generative models”) fall under the category of machine learning techniques. Their objective is to learn a diffusion process that generates a set of probabilities for a given dataset [source: Diffusion Models: A Comprehensive Survey of Methods and Applications – arXiv, 24 June 2024].

The authors further explain that these generative AI models are often employed in image generation. However, in the context of this research, «they are taught to generate trajectories for robots. They accomplish this by adding noise to the training data, then gradually removing the noise to refine the output, thus generating a trajectory». Policy Composition is fundamentally based on this “Diffusion Policy” work.

The PoCo approach involves training a single model at a time, using a different video dataset for each instance. This dataset might include images related to demonstrations of specific tasks, featuring humans and robots, images gathered during real-world activities conducted by a remotely controlled robotic arm, or data derived from robotic task simulations.

Within this methodology, each Diffusion Model’s role is to learn a specific robotic policy from the training data that enables the completion of a particular task. Once multiple models have been trained, the individual policies learned are “combined” to form a single, overarching policy that allows a robot to perform various tasks in diverse contexts.

«One of the advantages of this robotic learning technique», explains the CSAIL team«is that we can combine the different obtained policies to harness the best from each machine. For instance, a policy based on real-world data might help robots develop greater dexterity, while one based on simulation could lead them to better generalisation».

In both simulations and real-world experiments – where robotic arms performed tasks using tools such as spatulas, knives, wrenches, and hammers, like hammering a nail or retrieving food from a tray – the developed robotic learning approach enabled a single robot to execute numerous tasks with previously unused tools and adapt to tasks not learned during training. This resulted in a 20% performance improvement compared to basic learning techniques that utilise simple machine learning.

Immagine che illustra l’utilizzo di strumenti quali spatole, coltelli, chiavi inglesi e martelli nelle policy di apprendimento robotico generaliste, in presenza di disturbi esterni (l’intervento di una mano umana) (a) e distrattori (oggetti che si deformano e illuminazione calante) (b), attraverso diverse configurazioni iniziali che richiedono azioni di forza (c) e riarrangiamenti dinamici delle scene (d). L'asse orizzontale mostra la dimensione temporale per ciascuna traiettoria eseguita dal robot [credit: “PoCo: Policy Composition from and for Heterogeneous Robot Learning” - CSAIL Massachusetts Institute of Technology - https://arxiv.org/pdf/2402.02511].
Use of tools such as spatulas, knives, wrenches, and hammers in general-purpose robotic learning policies, under external disturbances (human hand intervention) (a) and distractors (deformable objects and diminishing light) (b), through various initial configurations requiring force actions (c) and dynamic scene rearrangements (d). The horizontal axis shows the temporal dimension for each trajectory executed by the robot [credit: “PoCo: Policy Composition from and for Heterogeneous Robot Learning” – CSAIL Massachusetts Institute of Technology – https://arxiv.org/pdf/2402.02511].

Glimpses of Futures

From the vast and diverse dataset compiled by Google DeepMind, designed to train a generalist robotic learning model, to the numerous smaller training datasets from MIT’s Computer Science and Artificial Intelligence Lab, which teach specific robotic policies that will eventually form a unified general policy, the objective remains clear: to train versatile, multipurpose robots for certain applications, surpassing specialised machines that can only perform a limited set of tasks within a single context.

Given that the described robotic learning models still have many limitations and have been presented in a simplified form, let us nonetheless anticipate possible future scenarios. We will use the STEPS matrix to analyse the impacts of advancements in generalist and multipurpose robot training methodologies from various perspectives.

S – SOCIAL: in the future, individual robotic arms, dual-armed robots, and quadruped robots capable of performing diverse activities within the same or different environments – such as picking up a tool, using it, then moving to another task in another section seamlessly – would be invaluable in workplaces, especially in warehouses and production sites, as well as in the hospitality and construction sectors, and domestic settings. A machine that “can do everything,” capable of performing tasks it has never done before, handling tools and objects it has never used, and adapting to duties not covered in its training, would offer comprehensive support. Consider robots assisting the disabled and elderly; their future versatility could provide crucial help in emergencies and dangerous situations, effectively intervening even without prior familiarity.

T – TECHNOLOGICAL: the evolution of training systems for multipurpose robots, as described by MIT’s team, will not only require increasingly diverse data to enhance the performance of generalist machines but will also integrate other machine learning techniques, including those from generative AI and Diffusion Models. Furthermore, the team states, «this study shows the effectiveness of the Policy Composition method only for short-term tasks, while we believe extending it to a longer time horizon is an interesting direction for future work, as is extending it to different models, such as those trained with Google DeepMind’s Robotics Transformer».

E – ECONOMIC: as we have discussed previously, regarding the economic impact of robots in workplaces, there is a contrasting viewpoint to the fear of rising unemployment rates due to increased automation in certain sectors, from automotive to mechanical work and assembly. The focus here is on industries globally experiencing a general shortage of personnel, such as hospitality and small-scale construction. In the future, with the evolution of generalist robot training methods, the presence of multipurpose machines could be strategic, helping to address the lack of human labour.

P – POLITICAL: safety concerns are the most debated when it comes to robots in workplaces, especially when they operate directly with people, such as in restaurants and private homes. In a future scenario where multipurpose robots work alongside humans in many sectors, it will be crucial to focus on their impact on health and safety. The European Agency for Safety and Health at Work has published several case studies on automation in various professional contexts, including assembly lines, industrial production, the automotive industry, steel production, plastic product manufacturing, and more. They particularly highlight the importance of risk assessment in companies, «related, for example, to high forces, electricity, crushing, impacts with the machine, and so on. Each risk must then be classified, and safety measures must be adopted as soon as a certain threshold is exceeded. These could include fencing, personal protective equipment, or continuous training courses».

S – SUSTAINABILITY: the potential future presence of multipurpose robots in various workplaces inevitably raises questions about the environmental impact of the AI enabling such machines. It has been said that «robots are great ‘specialists’ and that a learning model must be trained for each activity and environment” and that “changing a single variable often requires starting from scratch», while training a multipurpose robot would streamline this process, reducing the computational energy needed for a robot to acquire multiple skills. However, there is a downside. Training machine learning models to enable generalist robots to perform numerous tasks in different environments requires ever-larger amounts of data and hours of training, with inevitable repercussions on energy consumption and the associated carbon footprint. It’s a catch-22. A US study by Energy area analysts on the future electricity load of «social robotification» (“Direct and Indirect Impacts of Robots on Future Electricity Load“) estimates that in the United States, by 2025, the energy consumption of robots will increase to 0.5-0.8% of the country’s total electricity demand.

Written by: