Covariant Introduces RFM-1 to Give Robots the Human-like Ability to Reason

The world’s most capable Robotics Foundation Model provides robots a deep understanding of language and the physical world

Today Covariant (https://covariant.ai), the world's leading AI Robotics company, released RFM-1: a Robotics Foundation Model that provides robots the human-like ability to reason, representing the first time Generative AI has successfully given commercial robots a deeper understanding of language and the physical world.


The key challenge with traditional robotic automation and automation based on manual programming or specialized learned models is the lack of reliability and flexibility in real-world scenarios. To create value at scale, robots must understand how to manipulate an unlimited array of items and scenarios autonomously.

By starting with warehouse pick and place operations, Covariant's RFM-1 showcases the power of Robotics Foundation Models. In warehouse environments, the technology company's approach of combining the largest real-world robot production dataset with a massive collection of Internet data is unlocking new levels of robotic productivity and shows a path to broader industry applications ranging from hospitals and homes to factories, stores, restaurants, and more.

"Robotics Foundation Models require access to a vast amount of high-quality multimodal data. These models require data that reflects the wide range of information a robot needs to make decisions, including text, images, video, physical measurements, and robot actions," said Peter Chen, Chief Executive Officer and Co-Founder, Covariant. "Unlike AI for the digital world, there is no internet to scrape for large-scale robot interaction data with the physical world. So we built a highly scalable data collection system which has collected tens of millions of trajectories by deploying a large fleet of warehouse automation robots to dozens of customers around the world."

Since 2017, Covariant's previous AI models have enabled robots to operate in a commercially meaningful way across a diverse set of warehouse operations and industries. These robots have been able to adapt to their embodiment, understand the scenes they are faced with, reliably handle items they have never seen before, and achieve human-level speed and reliability.

The introduction of RFM-1 sets a new frontier for what's possible. Set up as a Multimodal Any-to-Any Sequence Model, RFM-1 is an 8 billion parameter model that is trained on text, images, video, robot actions, and physical measurements to autoregressively perform next-token prediction. Because it tokenizes all modalities into a common space, the next-token prediction training enables RFM-1 to understand any modalities as input and predict any modalities as output.

With a deep understanding of language and the physical world, RFM-1 gives robots the sophisticated ability to reason and make decisions on the fly. This delivers high levels of robotic autonomy, lowers associated costs and implementation times, and opens the door for the rapid development of new applications and robotic form factors (consumer and humanoid robots).

Specific RFM-1 capabilities include:

Physics world model: RFM-1's understanding of physics emerges from learning to generate videos. RFM-1 can predict via AI-generated videos how objects will react to robotic actions. This physics world model, powered by Covariant's multimodal robotics dataset, improves speed and reliability by enabling robots to simulate the result of future scenarios and select the best course of action.
Language-guided programming: By making robots taskable and giving them an understanding of the English language, RFM-1 enables robots and humans to collaborate and problem-solve by simply communicating with each other — significantly lowering the barriers of customizing AI behavior to address dynamic business needs and the long-tail of corner case scenarios.
Learning from self-reflection: In-context learning allows robots to learn on the fly and improve based on the self-reflection of their own actions. With RFM-1, robots can realize this learning in minutes, as opposed to weeks or months, which drastically increases performance while reducing ramp time for a new system, scenario, or item.
"Recent advances in Generative AI have demonstrated beautiful video creation capabilities, yet these models are still very disconnected from physical reality and limited in their ability to understand the world robots are faced with. Covariant's RFM-1, which is trained on a very large dataset that is rich in physical robot interactions, represents a significant leap forward towards building generalized AI models that can accurately simulate the physical world," commented Pieter Abbeel, Chief Scientist and Co-Founder, Covariant.

Learn more about RFM-1 on the Covariant blog.

The capabilities will be available for live demonstration at Covariant's headquarters in Emeryville, CA by appointment only and can be experienced on a first-come, first-served basis at the MODEX 2024 Trade Event in Atlanta, GA, from March 11 until March 14, 2024 (visit booth C7085 to reserve a spot).

About Covariant

Founded in 2017 by the world's leading AI Robotics research scientists, Covariant builds and delivers Robotics Foundation Models into the real world, meeting the reliability and flexibility required from the world's leading retailers and logistics providers. With offices in North America and Europe, Covariant has customers in 15 countries across 4 continents and powers hundreds of robots to interact with and learn from their dynamic environments. Covariant currently offers the broadest portfolio of AI-powered robotic picking applications for warehouse environments, including order sortation, item induction, good-to-person order picking, kitting, and depalletization. Robots operate in a commercially meaningful way across a diverse set of industries spanning apparel, health and beauty, pharmaceuticals, logistics, and general merchandise. To learn more, go to covariant.ai.

Featured Product

3D Vision: Ensenso B now also available as a mono version!

3D Vision: Ensenso B now also available as a mono version!

This compact 3D camera series combines a very short working distance, a large field of view and a high depth of field - perfect for bin picking applications. With its ability to capture multiple objects over a large area, it can help robots empty containers more efficiently. Now available from IDS Imaging Development Systems. In the color version of the Ensenso B, the stereo system is equipped with two RGB image sensors. This saves additional sensors and reduces installation space and hardware costs. Now, you can also choose your model to be equipped with two 5 MP mono sensors, achieving impressively high spatial precision. With enhanced sharpness and accuracy, you can tackle applications where absolute precision is essential. The great strength of the Ensenso B lies in the very precise detection of objects at close range. It offers a wide field of view and an impressively high depth of field. This means that the area in which an object is in focus is unusually large. At a distance of 30 centimetres between the camera and the object, the Z-accuracy is approx. 0.1 millimetres. The maximum working distance is 2 meters. This 3D camera series complies with protection class IP65/67 and is ideal for use in industrial environments.