DevOps for Robotics
DevOps, a set of practices designed to automate and integrate the processes between software development and IT operations, has transformed how we deliver cloud and web applications over the past decade. Although the language used to illustrate these practices in books and courses is rather particular to that technical domain, the principles can apply more broadly. The world of robotics—specifically, mobile robots like self-driving cars and autonomous mobile robots (AMRs)—is a domain the usual language of DevOps doesn’t immediately conjure, but where its concepts can nonetheless flourish.
I worked as a research engineer and software developer in robotics through most of the history of the DevOps movement, first in self-driving cars and then in AMRs. As software developers aware of these developments, my colleagues and I were eager to apply DevOps to our field. It has consistently been a popular topic at ROSCon. As far back as 2014, researchers coined the term RobOps to refer to at least the ops side, and more recently, InOrbit has promoted a wider definition of the term as something of a parallel to DevOps, spurring the creation of the Robot Operations Group. However these attempts to formalize things shake out, DevOps can be and has been applied in robotics successfully. I’ll do my best to illustrate how.
A key component of the DevOps life cycle is continuous integration, which depends fundamentally on agile practices. Agile applies readily to some aspects of robotics software development, such as platform software and user applications like fleet managers. However, the special sauce is really in the perception and navigation pipeline, which is a relatively new field (originating for many of us with the publication of the seminal text Probabilistic Robotics, and now heavily relying on even more recent work in deep learning) still involving a significant research component. Due to the exploratory nature of R\&D, agile can be an awkward fit: the guiding user stories tend to sit at a higher level, and the definition of done may be poorly defined or open-ended.
The solution is to focus on learning: using each sprint to explore different ideas or approaches, and explicitly considering the learning outcomes as part of the definition of done. Although it isn’t always possible to describe anything about the resulting “features” as a tangible benefit to the end user, thinking this way allows the R\&D effort to be managed within short iteration cycles with feedback and measurement, and eventually, to be transitioned seamlessly into proper product features.
Continuous integration also requires automated testing. The low-level, bottom-up variety, such as static code analysis and unit testing, is straightforward to adapt to robotics code. Even though many of the algorithms will by nature be probabilistic, non-deterministic, and approximate, it is usually feasible to capture or contrive input data and test for an outcome within some acceptable error bounds.
It becomes trickier in the sort of top-down behaviour and integration testing needed to implement test-driven development and high-level regression testing: how can we meaningfully define the expected behaviour when the actual end result may be so far removed from the individual components under test? With perception algorithms, it is fairly easy to capture or generate input data and examine the feed-forward output, but how those outputs inform the overall success or failure of operations is far from obvious in most cases. With navigation algorithms, we have the reverse problem: pass/fail criteria can more readily be derived from the outputs, but the input data is challenging to produce, since, in moving the robot’s sensors, the outputs of navigation influence the inputs to perception.
Fortunately, there is one way to test autonomy software end-to-end in an automated context: simulation. Simulating the behaviour of the sensors and actuators within a physics engine allows the software’s behaviour to be tested in realistic scenarios, including actuation feedback, and for the physical outputs of the system (e.g. the position of the robot or its payload) to be checked directly. The generic ROS simulator is Gazebo, but there are various more specialized options, such as CARLA and NVIDIA DRIVE Sim. How useful these tests are depends, of course, on how closely the physics reflects reality, which is notoriously difficult for some sensors, but the accuracy can be verified to some extent against real data. The advantage is that a wide variety of scenarios can be tested automatically as part of a CI pipeline. Unfortunately, the robot algorithms themselves usually push powerful computers to their limits, and adding complex physics simulation only adds to the computational load, so these tests are very resource-intensive and must be run at or near real-time speeds, adding long waits and high costs to what should be a lightweight CI pipeline.
DevOps also calls for continuous delivery. Pedagogical examples of CD typically involve the automated deployment of the latest build of an application, along with its IT infrastructure, to a staging environment consisting of a rather nebulous cloud substrate. In the case of robotics, the physical infrastructure necessarily includes the robots themselves (and their chargers, carts, parking infrastructure, calibration widgets, etc.), so CD requires some kind of test fleet and environment. This is generally paired with a comprehensive test plan being continuously executed by a dedicated test team, so that any issues can be reported back to the developers. As with the traditional applications seen in DevOps examples, the robot systems and surrounding infrastructure can be automatically (and repeatably) deployed in containers via a configuration management system.
Most modern robotics software is either based on ROS or heavily influenced by its architecture. ROS uses a publish-subscribe model of federated services that happens to parallel modern microservice architecture, the darling of DevOps, quite closely. The original idea was to encourage modular open-source development of robotics components with standardized interfaces, but this architecture also yields similar benefits to microservices in terms of fault isolation, horizontal scalability, and the like.
Finally, once software is deployed, DevOps calls for automated data collection from the field, monitoring each system and alerting the team of issues in real time, and providing efficient quality assurance infrastructure for identifying and resolving those issues. An essential feature of ROS (or any similar middleware) is the ability to record published messages with timestamps to a container file (such as a bag or MCAP) that can be “played back” to the robot’s services. These files can be continuously captured and ring-buffered, with an anomalous condition triggering the export of some period before and after the incident to a repository. This allows a developer to recreate the original or intermediary inputs and observe the resulting behaviour of the software, both to isolate the issue and to verify a prospective fix. One major hurdle is the sheer volume of data, particularly with certain sensors such as cameras, combined with the often limited bandwidth available to vehicles in the field or AMRs in warehouses and plants.
The principles of DevOps are profoundly relevant to the world of robotics. From rigorous testing methodologies to continuous deployment, these principles ensure that the robots of the future are more robust, reliable, and efficient. As the line between software and the physical world continues to blur, we’ll need to think outside the traditional box to apply DevOps in new and creative ways.
Comments