September 30, 2024

By Ashish Kapoor

7 Lessons from AirSim

I ran the autonomous systems and robotics research effort at Microsoft for nearly a decade and here are our biggest learnings.

This research effort was seeded in 2014 after the realization that deep ML was breaking through the seams and would change the face of robotics once and for all. Remember that in 2014 deep learning was still finding its foothold. Our own field robotics efforts were struggling due to the high cost of field studies (both time and money) and the weather in Seattle. We needed a way to train robots without completely relying on real-world data and testing efforts.

One of the most consequential outputs of that group was AirSim, an open-source framework to build deep learning models for robots. With almost 2K stars on day 1, AirSim was trending in the top 10 GitHub releases and over the years, it has become a critical tool in robot learning research, garnering over 600K downloads. 

With the current excitement that surrounds robot and physical intelligence, here are the key learnings that I wish to share and might help align efforts across the fields of robotics and AI: 

Lesson 1: The “PyTorch moment for robotics” needs to come before the “ChatGPT moment for robotics”.

While there is a significant anticipation towards Foundation Models for the physical world, there are at least two factors causing significant barriers:

  1. Scarcity of technical folks well versed in both deep ML and robotics.
    If we use the number of NeurIPS 2023 registrations (16382) vs CoRL 2023 participants (~900) as a proxy, we estimate that there is only 1 Robotics + Deep ML engineer for every 20 ML engineers. 
  2. Lack of resources that enable rapid iteration of the build-evaluate-refine cycle.
    While there is an awareness of the lack of quality data repositories for robot training, challenges due to robot specificity as well as closed and siloed approaches by various enterprises and startups are limiting progress. 

We need way more engineers and experts to work on robot and physical intelligence, which can only happen when we significantly lower the barrier to entry into robotics.

In our own work, we experienced that despite a large amount of interest in the open-source release of AirSim, there were very few folks who could fully harness the power of the tool. We realized that few had the right background at the intersection of robotics, simulation, and deep ML; and combined with high infrastructure requirements, the adoption was suffering. The wide adoption only began once we introduced measures, such as Python bindings and the ability to work with pre-compiled binaries, that made AirSim ML friendly.

In our discussion with several students and researchers at CMU Robotics Institute – it can take up to 6 months for individuals to start being productive in robotics. This is in stark contrast to other subfields of Machine Learning, where you might be up and running within a few hours. Similarly, many well-funded startups have to invest months of effort preparing their infrastructure. There is a dire need for a unifying development framework (a la PyTorch in Machine Learning) in robotics that can act as a foundation.

Lesson 2: Most AI workloads on robots can primarily be solved by deep learning.

Building robot intelligence requires simultaneously solving a multitude of hard AI problems, such as perception, state estimation, mapping, planning, control, etc. Most examples of robots we encounter have hybrid architectures, where perception is solved via Deep Learning and control and planning via classical methods. This often runs the risk of creating even more fragmentation and reducing reusability, because every robot or every application needs to run a very specific combination of modules.

We are increasingly seeing successes of deep learning methods across the entire robotics stack. Many of these works used simulation to generate ML training data, for example, the TartanAir dataset that was generated using AirSim was used for training deep visual odometry and neural SLAM models [TartanVO, DPVO, DPVSLAM]. Similarly, the AirSim drone racing lab was used to train multimodal foundation models [DRL, COMPASS]. Other simulators such as Habitat, and ProcTHOR regularly find use in deep learning efforts. 

Lesson 3: Existing robotic tools are suboptimal for deep ML

A typical software tool chest for teams building robot intelligence includes ROS (or their own variant), simulation tools such as Gazebo, Isaac, and Mujoco amongst others, and Deep AI models scavenged and/or fine-tuned from the web.

Most of the deployment and simulation tools had their origin before the advent of Deep Machine Learning as well as cloud computing. Here are the aspects that make them suboptimal to be used in a deep learning-centric robot intelligence:

  1. Most of the legacy simulation tools were not designed to address AI
    For example, deriving ground truth training data from simulation is as hard as solving autonomy in simulation. Just representing a sensor modality without appropriate labels for environment meshes, means that one needs to solve the perception problem in simulation. In general, most of the simulation tools lack appropriate hooks to query for a multitude of ground truth data traces that are needed to create rich robot training datasets.
  2. Most of the legacy robotic tools are extremely hard to parallelize quickly on GPU clusters
    Significant investments are required from each team wanting to generate data at scale. A complex system design with heavyweight legacy components, including ROS means that state-of-the-art ML techniques can become out of reach for the most.
  3. Most of the robotic tools are only used in the design phase
    As the robots encounter new experiences AI models would need to evolve continuously. Very few of the legacy components are actually designed to be able to participate in the full lifecycle of robots: allowing telemetry from deployments, continual learning, etc.

Most of the robot AI teams that we have encountered are fairly small, ranging from 2 at early-stage start-ups to 10 at well-established enterprises. The amount of work that needs to be done to be AI-ready is several months of dedicated engineering by top folks. Robot Infrastructure that is data first, parallelizable, and integrates cloud deeply and throughout the robot’s lifecycle is a must.

Lesson 4: Robotic foundation mosaics + agentic architectures are more likely to deliver than monolithic robot foundation models. 

In a recent survey, led by Roya Firoozi and Mac Schwager, we found that most of the existing robotic foundation models utilized a mosaic of existing AI capabilities and incorporated various deep models in a modular fashion. This was not surprising, given the scarcity of data and the heterogeneity of various robotic form factors – it's way easier to repurpose an existing module than train from scratch. What was surprising (well at least in 2023), however, was the ability to chain various AI capabilities using LLMs and agentic architectures (e.g. see AirSim + LLM use case here).

The ability to program robots efficiently and easily is a research area by itself and one of the most requested use cases. Currently, it takes a technical team weeks to program/reprogram robot behavior. While at some point in the future, there might be a rise in end-to-end robot foundation models, it is clear that foundation mosaics and agentic architecture can deliver huge value now.

Lesson 5: Cloud + connectivity trumps compute on edge – Yes, even for robotics!

We have come across several operator-based robot enterprises that either discard or minimally catalog the data their systems encounter regularly. This has been primarily due to a lack of data management pipelines as well as establishing smooth connectivity to the devices.

Furthermore, robotics is truly in a multitasking domain – unlike many other subfields of AI, a robot needs to solve for multiple tasks at once. While there are constant advances in multi-task learning, and foundation models across several task categories, it is still challenging to run large models on edge devices, much less multiple models. Especially for form factors such as aerial drones, weight constraints can make it extremely restrictive to place large compute modules onboard.

In the LLM and VLM space, as larger and more capable models are being trained, we are seeing a paradigm shift towards API-based services. Robotics would greatly benefit from this as it can offload compute to much more capable devices. The ability to seamlessly connect to the cloud for data management, model refinement, and last but not the least - the ability to make several inference calls simultaneously within certain SLAs would be a game changer for robotics.

Lesson 6: Current approaches to robot AI Safety are inadequate

AirSim was a product of efforts within Microsoft Research on Safe Cyberphysical Systems. Safety research for robotics and cyber-physical systems is at an interesting crossroads. Besides roboticists, multiple other communities, including formal methods, control, and machine learning have made attempts at safe learning and robotics. A lot of research prior to the advent of deep machine learning is fairly challenging to apply to the current AI methodologies. This is primarily due to the complex parameterization structure of Neural Networks, which are now extremely likely to be part of any robot intelligence. 

Neurosymbolic representation and analysis is likely a very important technique that will enable the application of traditional safety frameworks to the modern robotics stack. Recent efforts are looking into using AirSim and similar simulations to that end. 

Lesson 7: Open-source can add to the overhead

As a strong advocate for open-source, much of our research has been shared under flexible licenses to promote collaboration and innovation. While open-source offers many benefits, particularly in advancing technology, there are a few challenges, especially in the context of robotics, that are less frequently discussed but worth considering.

  1. Robotics is a highly fragmented field, with various specialized areas that often operate in silos.
    This fragmentation, combined with a relatively small number of active contributors, can lead to an overwhelming number of requests for any open-source project. Managing these demands can sometimes divert focus and resources from the core mission of the project.
  2. Robotics is a complex discipline, and in many cases, there will be more users of the technology than contributors to its development.
    This imbalance raises a question: is the overhead of maintaining a fully open-source project always the best use of resources, or might it be more effective to share certain assets openly while keeping the project’s focus streamlined?
  3. Within large organizations, the scope of open-source initiatives may also face practical limits.
    While open-source projects can drive innovation and public engagement, there is a balance to strike between fostering community contributions and aligning with the broader, long-term goals of the organization. For some companies, open-source efforts may be viewed as a strategic asset or a way to generate positive visibility, but the full potential may not always be realized in terms of commercial growth.

AirSim pushed not only the boundaries of the technology but has also provided a deep insight into R&D processes and what it takes to bring intelligent robots to the market. In this day and age where the timeline from research to production has greatly shrunk, we are resolute to maximize the impact this line of research and development can have. 

The future of robotics will be built on the principle of being open. Stay tuned as we continue to build!