# Dynamic obstacle avoidance for quadrotors with event cameras

See allHide authors and affiliations

Science Robotics  18 Mar 2020:
Vol. 5, Issue 40, eaaz9712
DOI: 10.1126/scirobotics.aaz9712

## Abstract

Today’s autonomous drones have reaction times of tens of milliseconds, which is not enough for navigating fast in complex dynamic environments. To safely avoid fast moving objects, drones need low-latency sensors and algorithms. We departed from state-of-the-art approaches by using event cameras, which are bioinspired sensors with reaction times of microseconds. Our approach exploits the temporal information contained in the event stream to distinguish between static and dynamic objects and leverages a fast strategy to generate the motor commands necessary to avoid the approaching obstacles. Standard vision algorithms cannot be applied to event cameras because the output of these sensors is not images but a stream of asynchronous events that encode per-pixel intensity changes. Our resulting algorithm has an overall latency of only 3.5 milliseconds, which is sufficient for reliable detection and avoidance of fast-moving obstacles. We demonstrate the effectiveness of our approach on an autonomous quadrotor using only onboard sensing and computation. Our drone was capable of avoiding multiple obstacles of different sizes and shapes, at relative speeds up to 10 meters/second, both indoors and outdoors.

## INTRODUCTION

Micro-aerial vehicles (MAVs) may open up new markets potentially worth several billion dollars. These markets include aerial imaging [forecast value of $4 billion by 2025 (1)], last-mile delivery [$90 billion by 2030 (2)], and aerial mobility [worth about \$8 billion in 2030 (3)]. Although the market potential for MAVs is promising, safety remains a key challenge for MAVs. Several drone crashes have been reported in the news, due to either objects tossed at quadrotors during public events (4, 5) or collisions with birds (6, 7). Thus, enabling MAVs to evade fast-moving objects (Fig. 1) is critical for the deployment of safe flying robots on a large scale.

The temporal latency between perception and action plays a key role in obstacle avoidance. The higher the latency, the lower the time the robot has to react and execute an avoidance maneuver (8). This is especially critical for MAVs, where a collision can not only damage the environment but also cause severe hardware failure. In addition, micro-quadrotors have reduced payload capabilities, which puts a hard bound on the sensing and computing resources they can carry.

The existing literature on obstacle avoidance for MAVs relies on standard cameras [in a monocular (911) or stereo configuration (1215)] or on depth cameras (1618). However, these works assume that the obstacles in the environment are either static or quasistatic (i.e., slow relative motion). Similarly, state-of-the-art consumer drones are not currently capable of reliably detecting and avoiding moving obstacles. For example, the Skydio drone, one of the most advanced autonomous drones on the market to date, is not capable of dealing with moving objects (19). Therefore, developing effective solutions to avoid dynamic obstacles is a key challenge in robotics research, as well as a highly prized goal by major industry players.

To avoid fast-moving obstacles, we need to perceive them quickly. For this purpose, standard cameras are not good enough because they have an average latency of tens of milliseconds (the exposure time of a standard camera varies between 1 and 100 ms). Therefore, their limitations arise from their physical nature and cannot be solved with sophisticated algorithms. One solution could be to use event cameras; these sensors have reaction times of microseconds. Event cameras (20) are bioinspired sensors that work differently from traditional cameras. Instead of capturing images at a fixed rate, an event camera measures per-pixel brightness changes asynchronously. This results in a stream of events at microsecond resolution.

In a recent study on the role of perception latency for high-speed sense and avoidance (8), it was shown analytically that, using event cameras, the latency between the time a visual signal is triggered by the sensor and processed to output control commands is remarkably lower than that of standard cameras (microseconds versus tens of microseconds). These results are promising for robotics. However, standard vision algorithms cannot be applied because the output of an event camera is not images but a stream of asynchronous events. Thus, alternative algorithms need to be devised to unlock the full potential of event cameras for the task at hand.

An event camera has smart pixels that trigger information independently of each other: Whenever a pixel detects a change of intensity in the scene (e.g., caused by relative motion), that pixel will trigger an information at the time the intensity change was detected. This information is called an event and encodes the time (at microsecond resolution) at which the event occurred, the pixel location, and the sign of the intensity changes. Let tk−1 be the last time when an event fired at a pixel location x, and let Lk−1 = L(x, tk−1) be the intensity level at such pixel at time tk−1. A new event is fired at the same pixel location at time tk as soon as the difference between the intensity Lk−1 and Lk is larger than a user-defined threshold C > 0. In other words, an event is fired if ‖L(x, tk) − L(x, tk−1)‖ > C (positive event) or ‖L(x, tk) − L(x, tk−1)‖ < −C (negative event). We refer the reader to (21) for further details. To better highlight what happens across the entire sensor, we compare the output of an event camera with the one of a conventional camera in Fig. 2 and in a video (22).

Event cameras may be considered to be asynchronous, motion-activated sensors, because they provide measurements only if and where there is relative motion between the camera and the environment. Also, because their latency is on the order of microseconds, they are a natural choice for detection and avoidance of fast-moving obstacles by flying MAVs. If one removes the events induced by the ego-motion of the vehicle (23, 24), one can directly obtain information about the moving part of the scene. This leads to multiple advantages over standard cameras for detection of dynamic obstacles: (i) The output is sparser and lighter than a frame and therefore cheaper to process; (ii) no segmentation between static and dynamic objects is necessary, because to do so, it is possible to exploit the temporal statistic of each event; (iii) their high temporal resolution (in the order of microseconds) allows low-latency sensing.

In recent years, event cameras have attracted the interest of the robotics community (21). Obstacle detection is among the applications with the highest potential, and previous works investigated the use of these sensors to detect collisions (25) and track objects (23, 26). However, very few examples of closed-loop control based on event cameras are available in the literature. Among these, the majority of the works focuses on simple, low-dimensional tasks, such as 1–degree of freedom (DoF) heading regulation (27, 28), stereo camera gaze control (29, 30), 2-DoF pole balancing (31), 1-DoF robotic goalkeeping (32, 33), or navigating ground robots among static obstacles (3436).

We have reported closed-loop control of more complex robotic systems, such as quadrotors, using event cameras recently (3739). In (37), we proposed an event-based visual-inertial odometry (VIO) algorithm for state estimation and closed-loop trajectory tracking of a quadrotor. Instead, (38) and (39) are the most related to this paper. In (38), we analyzed the feasibility of detecting spherical objects thrown at a stationary event camera on small embedded processors for quadrotors. In (39), we showed preliminary results in using shallow neural networks for segmenting moving objects from event streams and demonstrated an application to quadrotor obstacle avoidance. However, the resulting sensing latency was 60 ms rather than the 3.5 ms of this paper, thus strongly limiting the maximum relative speed at which moving objects could be evaded. In addition, differently from this paper, there, we did not consider the relative distance and velocity to compute the avoidance commands. This work implements and demonstrates low-latency (3.5 ms) dynamic obstacle dodging on an autonomous quadrotor with relative speeds up to 10 m/s.

Our moving-obstacle detection algorithm works by collecting events during a short-time sliding window and compensating for the motion of the robot within such a time window. Figure 3 shows the effects of the ego-motion compensation: On the left side, the three-dimensional (3D) volume of the events accumulated during an arbitrary time window of 10 ms, and on the right side, the same events, after ego-motion compensation, back-projected on the image plane. We analyzed the temporal statistics of the motion-compensated events to remove those generated by the static part of the environment.

Broadly speaking, our algorithm is based on the intuition that the static part of the scene fires events uniformly across the entire time window, and after the ego-motion compensation, the pixels belonging to static objects show a uniform distribution of time stamps; conversely, dynamic objects generate ego-motion–compensated events that are accumulated around specific sections of the time window and can, therefore, be distinguished. Our technique to detect moving obstacles using event cameras is based on the method proposed in (23), where the authors used an optimization-based ego-motion compensation scheme. We modified that method to only rely on an inertial measurement unit (IMU) for the ego-motion compensation, without using any optimization scheme, which makes the algorithm sufficiently fast to run in real time on a small onboard computer.

An analysis of the impact of neglecting the linear component of the robot’s ego-motion and an intuitive explanation of how and why our algorithm works are provided in the Supplementary Materials. A detailed explanation of ego-motion compensation of the events—which allowed us to obtain a so-called event frame, containing only events coming from moving objects, at a very high rate—is explained in Materials and Methods. We leveraged a fast clustering algorithm to tell apart different objects in the event frame and used a Kalman filter to obtain information about their velocity. Figure 4 provides a visual explanation of the steps involved in our algorithm and described in more detail in Materials and Methods.

The position and velocity of each obstacle relative to the camera were then fed to a fast avoidance algorithm designed to leverage the low sensing latency. To do so, we used a reactive avoidance scheme based on the use of artificial potential fields (40) relying on fast geometric primitives to represent the obstacles, which renders it computationally inexpensive. We proposed a formulation of the repulsive field, which better suits the task of avoiding fast-moving obstacles by taking into account the need for a prompt reaction of the robot when an obstacle is detected. Compared with previous approaches, our formulation of the repulsive potential increased remarkably faster as the distance between the robot and the obstacle decreased to render the avoidance maneuver more reactive and agile. In addition, we considered both the magnitude and the direction of the obstacle’s velocity to decide in which direction to evade and introduced a decay factor in the magnitude of the potential to take into account that the obstacles we consider are dynamic; i.e., they do not occupy the same position in time.

Our approach prioritizes computation speed over accuracy; therefore, we trade off detection accuracy for latency. Nevertheless, we show that our algorithm only takes, on average, 3.5 ms (from the moment it receives the events to process to when it sends the first command to avoid the detected obstacles) to detect moving obstacles with a position error usually in the order of a few tens of centimeters. Trading detection accuracy for latency is not only a necessity for robotic platforms, but it has been frequently observed also among animals (41) for the execution of several tasks involving visual sensing.

We demonstrated the effectiveness of our approach in real experiments with a quadrotor platform. We validated our system with both a monocular setup (for obstacles of known size) and a stereo setup (for obstacles of unknown size), both indoors and outdoors. The entire avoidance framework is capable of running in real time on a small single-board computer onboard the vehicle, together with the entire software stack necessary to let the robot fly (i.e., state estimation, high-level control, and communication with the flight controller). Experimental results show that our framework allowed the quadrotor to avoid obstacles moving toward it at relative speeds up to 10 m/s from a distance of around 3 m. The integration of known techniques for event-based obstacle detection (23) and avoidance (40), adapted to make it possible for the entire framework to run in real time on a small computer as well as to deal with fast-moving obstacles, represents the main contribution of this paper.

## RESULTS

### Accuracy and success rate

We collected a dataset of more than 250 throws, obtaining around 1200 detections, and compared the output of our event-based detector with ground truth data from a motion capture system. For each detection, we computed the norm of the position error, and in Table 1, we summarize the results.

Table 1 Accuracy of our event-based algorithm to detect moving obstacles.

We analyzed both the monocular and the stereo setups and compared the detections with ground truth data provided by a motion capture system. For each configuration, we report (expressed in meters) the mean, the median, the SD, and the maximum absolute deviation (M.A.D.) of the norm of the position error for different ranges of distances.

View this table:

We grouped together measurements falling within bins of 0.5 m, with the first one starting from a distance of 0.2 m along the camera’s optical axis, because the algorithm did not successfully detect the obstacles at closer ranges. In the case of the monocular detector, it was necessary to discard some of the data we collected due to the fact that, at short distances, the obstacle was often only partially visible, and therefore, our monocular algorithm failed to correctly estimate its distance because it would fit a known size to a partially visible object. This issue became less notable as the distance to the camera increased, and after 1 m, it did not substantially affect the detector’s performance. On the other hand, as expected, the stereo configuration was more precise at low ranges: The further the distance between the cameras and the object, the higher the uncertainty in the triangulation and, therefore, the larger the error.

Independently of the configuration, however, the data in Table 1 show that our algorithm, although not tailored toward accuracy but rather optimized for low latency, provides measurements that are sufficiently accurate to allow a quadrotor to perceive its surroundings and detect moving obstacles effectively. The position error is, on average, smaller than the vehicle size, whereas the SD is slightly higher but still reasonable up to 1 m. Among the factors that contribute to the error in estimating the obstacles’ position, the low resolution of the camera certainly plays a key role.

Another important aspect to consider in our detection algorithm is its success rate. For the entire framework to be effective, not only it has to guarantee low latency, but also it has to guarantee robustness in terms of success rate. To assess the robustness of our algorithm, we performed an evaluation with objects of different sizes and shapes. Such objects were thrown through the field of view of the sensors, and using the intrinsic calibration of the camera and ground truth data from a motion capture system, we obtained information about when they were supposed to be visible. If the object was in the field of view of the cameras, but the system did not report any detection, we considered it as a failure of the detection algorithm. This allowed us to analyze the success rate of our algorithm to detect moving objects, which is reported in Table 2. We used objects of size up to 30 cm, and in the table, we grouped together objects belonging to three different categories: smaller than 10 cm, smaller than 20 cm, and up to 30 cm. For each category, we provided data about the success rate when the objects move at different distances from the camera and grouped the detections according to such a distance. As one can notice, our algorithm provided high success rates in different scenarios, with small and large objects. Given the limited fields of view of the cameras, large objects have a lower detection rate at short distances, mostly because of their different appearance in the two cameras used in the stereo setup, which makes it hard for the algorithm to match the individual detections. Similarly, small objects are harder to detect at large distances because of the limited angular resolution of the sensor used for the experiments. In addition, we did not notice any substantial impact of the objects’ speed or incoming angle on the success rate.

Table 2 Success rate of the event-based detector.

Each column reports the success rate for objects moving at a certain distance range from the camera. Each row shows the success rate of detecting objects smaller than a certain size. The results are obtained on a dataset comprising 100 throws of objects belonging to each size.

View this table:

### Computational cost

By throwing objects within the field of view of the event camera, while simultaneously rotating it and measuring the time necessary to process all the events that fired within the last time window of 10 ms, we could quantify the computational cost of our detection algorithm. Table 3 shows the results of our evaluation, highlighting how each step of the algorithm contributes to the overall computation time. Our evaluation was performed on an NVIDIA Jetson TX2 board, with the algorithm running exclusively on the central processing unit (i.e., the graphics processing unit available on the same board was not used at all). The numbers reported in Table 3 refer to the time required to run the detection algorithm with one camera; however, running multiple instances of the same algorithm for multiple cameras (as for example in the stereo case) did not affect the performance in a substantial way, because the individual parts can be computed in parallel.

Table 3 The mean, μ, and SD, σ, of the computation time of the obstacle detection algorithm.

View this table:

The most expensive part of the algorithm is given by the ego-motion compensation, which, on average, requires 1.31 ms (36.80% of the overall time), with an SD of 0.35 ms. The time necessary for this step depends on the number of events that need to be processed, and fig. S4 shows a linear dependence between the two. To understand how many events are typically generated in real-world scenarios during our experiments, we collected some statistics about the number of events that the algorithm needs to process. The collected data indicated that, on average, both indoors and outdoors, the number of events belonging to a time window of 10 ms spanned between 2000 and 6000.

Another step that depends on the relative motion between the camera and the scene is the clustering of the events belonging to the dynamic obstacles. This step is necessary to understand how many objects are in the scene and to associate each event with them. Clustering the events usually required 0.69 ms (19.39% of the overall time), with an SD of 0.20 ms. The actual processing time to cluster the events depends on the number of pixels where events belonging to dynamic obstacles fired, and fig. S5 shows how long our clustering algorithm takes as a function of the number of pixels to process.

Last, thresholding the mean time stamp image and applying some morphological operations to the thresholded image do not depend on the number of events to be processed (as shown by their very low SDs), because the entire picture has to be processed, and they required, on average, 0.98 ms (27.52%) and 0.58 ms (16.29% of the overall time), respectively.

It is important to notice that in our evaluation of the algorithm’s computational time, we neglected the estimation of the 3D position of the obstacle. This step requires very simple calculations, which are independent on the number of events generated and, on average, require times in the order of few microseconds. Therefore, their impact on the overall computational time is negligible.

### Different types of obstacles

The main reason for us to adopt a stereo configuration for our framework is the necessity to be able to detect moving obstacles independently on their shape and size correctly. This is not possible, using a single camera, as long as the size of the obstacle is not known in advance. Figure S5 shows that the algorithm we propose in this paper to detect moving obstacles using two event cameras is able to detect different kinds of obstacles. In that figure, obstacles with completely different geometries can be detected: a small ball, a box, a whiteboard marker, a frisbee, a quadrotor, and a bowling pin. The first column reports a frame grabbed from the SEES1 camera, where the object was often not visible because of motion blur (we manually highlighted the region where the objects are in the frame with a red circle). The remaining columns depict the previously described steps of our detection algorithm, with the same color code used in Fig. 4.

### Detection of multiple, simultaneous obstacles

Our pipeline is able to deal with multiple obstacles moving in the scene simultaneously thanks to the clustering process and measurements’ association step described in Materials and Methods. Figure S6 shows an example where the proposed algorithm correctly detects and clusters the events belonging to three different moving obstacles in the scene. In this case, three small-sized balls (manually highlighted by a red circle to facilitate the reader’s understanding) were thrown in front of the camera, and the algorithm successfully associated each event with the respective object.

The main limitation of our approach is due to the fact that, when two or more obstacles are very close to each other in the frame containing the events belonging to dynamic objects, it is very hard, if not impossible, to disambiguate among them. This is due to the fact that no prior information about the obstacles is used (e.g., shape or, in the stereo case, size), as well as not exploiting any intensity information (i.e., the frames from the onboard camera) to tell apart objects that are impossible to segment out using only events. In our experimental evaluation, this turned out not to be a real issue for the system itself, because as soon as the overlapping obstacles moved away from each other, the system was able to detect them promptly and treat them as separate entities.

### Obstacle avoidance framework validation

To validate our obstacle avoidance framework, we conducted a large set of experiments in real-world scenarios. The experiments were executed in two different scenarios: one indoors and the other one outdoors. The indoor experiments were conducted within a motion capture system, in the same setup we used in our previous work (8), and the aim was twofold: (i) collecting ground truth data to verify the effectiveness of the framework in situations where a collision with the obstacle would have happened (which was checked in postprocessing thanks to the data from the motion capture) and (ii) validating the overall framework in an easier setup before moving to more complex scenarios. We used the same quadrotor platform we presented in (8) for the indoor experiments, equipped with a monocular setup. Conversely, the outdoor experiments were conducted using a different vehicle, equipped with a stereo setup.

#### Indoor experiments

The main goal of the indoor experiments was to determine the effectiveness of our framework to avoid dynamic obstacles by determining whether a collision was actually prevented by analyzing the data coming from a motion capture system. The indoor experiments were realized using the same platform described in (8), in the monocular setup. We repeatedly threw a ball of known size toward the quadrotor, which used the event camera to detect and avoid it. Using the ground truth measurements coming from the OptiTrack motion capture system, we could intersect the trajectory of the ball with the position where the vehicle was hovering to determine whether, without the execution of the escape maneuver, the ball would have hit the vehicle or not. The outcome of this analysis is that our algorithm was capable of preventing actual collisions between a flying robot and dynamic obstacles, at relative speeds up to 10 m/s, as confirmed by the ground truth data about the trajectory of the object provided by the motion capture system.

Figure 5 shows one of the indoor experiments, reporting four snapshots recorded with a static camera. The ball took about 0.25 s to reach the vehicle from the moment it was thrown (Fig. 5A). At that time, as shown in Fig. 5D, the quadrotor already moved to the side to prevent the collision, showing that the algorithm successfully detected the ball and planned an evasive maneuver with very low latency.

#### Outdoor experiments

After evaluating the performance of our framework in an indoor setup, we performed outdoor experiments using the quadrotor platform equipped with two Insightness SEES1 cameras in a stereo setup. We executed two types of experiments, namely, in a static scenario, where the vehicle hovers at the desired position, and in a dynamic scenario, where the robot flies toward a target location. In both cases, we threw different kinds of objects toward the quadrotor, which only relied on the two event cameras to detect them and avoid them.

We tested the performance of our algorithm in static scenarios with different types of objects, with multiple obstacles moving toward it at the same time, as well as throwing them consecutively one after the other to benchmark the restiveness out of the overall approach. The vehicle successfully managed to detect them and avoid them most of the time, although in some cases, the detection was not successful and led to a collision between the robot and the obstacles. Later, we discuss the major failure causes of our algorithm; nevertheless, in outdoor experiments, the algorithm successfully detected and avoided the obstacles thrown toward it more than 90% of the time.

Figure 6 shows four snapshots captured from a sequence recorded in a dynamic scenario. The robot moved toward a target position, from left to right in the pictures, at a linear speed of 1.5 m/s. While reaching its destination, the robot detected the yellow ball thrown toward it (shown on the right side of Fig. 6A). The vehicle decided to execute an evasive maneuver upward while keeping its travel speed toward the desired position constant. This resulted in a maneuver that simultaneously allowed the vehicle to proceed along with its task and avoid a collision. Additional experiments in a dynamic situation are shown in movie S1.

### Limitations of event sensors

As we have previously shown, event cameras allow fast, low-latency detection of moving obstacles. We discussed the advantages of these bioinspired neuromorphic sensors against standard cameras. However, as of today, they are mostly a research-oriented sensor and thus still require a substantial engineering effort to solve the main issues they present.

One of the problems with current event cameras is their weight. Most of the event cameras available nowadays are larger and heavier than state-of-the-art standard cameras for robotic applications, which are typically below 50 g. The Insightness SEES1 is, to the best of our knowledge, the smallest event camera that also provides frames (which is particularly convenient to easily calibrate the intrinsic and extrinsic parameters of the sensor) and can be easily mounted on a quadrotor (its size is 3.5 cm by 3.5 cm, and it weighs 15 g). However, its resolution [320 pixels by 240 pixels, QVGA (Quarter Video Graphics Array)] is particularly low compared with standard cameras. This imposes the necessity to find the right trade-off between the field of view and the angular resolution: The larger the former, the smaller the latter, which reduces the sensing range at which it is possible to detect objects reliably (8). A small field of view, however, has a negative impact on the detection of obstacles entering the sensing range of the vehicle from the side, as, for example, in our outdoor dynamic experiments: The larger the field of view, the earlier the vehicle can detect and avoid obstacles moving toward it from the sides.

Another problem characterizing these sensors is their noise characteristics. These sensors show higher noise than standard cameras, which often has a negative impact on the performance of event-based vision algorithms. In our approach, for example, to obtain reliable detections and to eliminate false positives caused by the sensor noise, we had to substantially increase the threshold used to separate events generated by the static part of the scene from those caused by moving objects. This resulted in an obstacle detection algorithm less reactive to small relative motion, especially at large distances. For this reason, we discarded all the detections reporting distances between the camera and the obstacle beyond 1.5 m.

The aforementioned reasons represent the main failure causes of our approach. In most of the cases, when our quadrotor was not able to avoid an object thrown toward it, this was due to the fact that it was detected too late, either because it entered the field of view of the camera at a distance that was too short (and therefore the vehicle could not complete the evasive maneuver in time) or because the motion of the obstacle did not generate sufficient events to allow our algorithm to detect it.

## CONCLUSIONS

We presented a framework that enables a quadrotor to dodge fast-moving obstacles using only onboard sensing and computing. Different from conventional cameras, we used event cameras—neuromorphic sensors with reaction times of microseconds. Each pixel of an event camera reacts to changes in intensity, making this sensor a perfect fit for detecting and avoiding dynamic obstacles. Event cameras can overcome the physical limitations of standard cameras in terms of latency but require unconventional algorithms to process the asynchronous stream of events they generate. We investigated the exploitation of the temporal statistics of the event stream to tell apart the dynamic part of a scene, showing that it is possible to detect moving objects with a perception latency of 3.5 ms. We showed that our algorithm is capable of accurately and reliably detecting multiple simultaneous objects with different shapes and sizes. We combined our event-based detection algorithm with a fast strategy to generate commands that allow the vehicle to dodge incoming objects. We validated our approach with extensive experiments on a real quadrotor platform, both indoors and outdoors, demonstrating the effectiveness of the method at relative speeds up to 10 m/s.

## MATERIALS AND METHODS

### Obstacle detection

This section describes how our event-based algorithm to detect moving obstacles works. An additional explanation of the working principle of this algorithm is provided in movie S2.

#### Ego-motion compensation of the events

An event camera generates events when intensity changes occur in the image. This can happen because of either moving objects or the ego-motion of the sensor. Because we are only interested in avoiding moving objects, the first step is to remove all data generated by the quadrotor’s ego-motion.

One way of removing ego-motion from an event stream is described by Mitrokhin et al. (23). This approach does, however, use an optimization routine to estimate the ego-motion, which is computationally demanding and, therefore, introduces latency in the perception system. In this work, we replaced the optimization step with a more simple and computationally efficient ego-motion compensation algorithm. To do this, we used the IMU’s angular velocity average over the time window where the events were accumulated to estimate the ego-rotation and use this rotation to warp the events in the image. Our approach does not consider the translational motion of the camera but rather assumes that the events are generated mostly by rotational motion. To compensate for the translational motion, it would be necessary to estimate the depth of the points generating each event, which would increase the computational complexity too much to be practical. As long as the distance to stationary objects is large enough, our system is not substantially affected by this assumption. In addition, an analysis of the impact of neglecting the linear component of the robot’s ego-motion is provided in the Supplementary Materials. This choice allows our pipeline to be fast enough to guarantee real-time performance but comes at the cost of a potentially higher amount of noncompensated events. To cope with this, we tuned the parameters of our algorithm, whose working principle is described below, so that it is able to filter out most of the events generated by the static part of the scene.

The first step of our algorithm requires the collection of a batch of events and IMU data over a specified time δt. In our experiments, we used a time window of length δt = 10 ms, because we realized that this value represents a good compromise between sensitivity and real-time performance. A very short time window renders the entire algorithm too little sensitive, because the events collected do not contain enough information to perform reliable detection. On the other hand, increasing the time window too much leads to a very high number of events to process, making the entire algorithm slower, and does not provide much added value, because the additional events are generated by the past history of the obstacle motion. Next, we average the IMU’s angular velocity over δt as ω¯=δtωt. We then apply the Rodrigues rotation algorithm to build the rotation matrix from ω¯δt (42). Each event ei of the batch is then warped in the image plane by ω¯(tit0), where t0 is the time stamp of the first event of the batch and ti is the time stamp of event ei. This warping is described by a field ϕ : ℝ3 → ℝ3 that warps the events’ 2D displacement as ϕ(x, y, tt0) : (x, y, t) → (x′, t′, t). These motion-compensated events are denoted byC=Π{ϕ(C)}=Π{ϕ(x,y,tt0)}={x,y,t0}{x,y,t}C(1)

The original event position (x, y) is part of a discretized image plane in ℕ2, whereas (x′, y′) are part of ℝ2. From the warped events, we construct the event-count image I, where the pixel value records the total number of events mapped to it by the event trajectoryξij={{x,y,t}:{x,y,t0}C,i=x,j=y}(2)

Here, (i, j) ∈ ℕ2 denotes the integer pixel coordinates of the discretization bin for (x′, y′) ∈ ℝ2. From this, we construct the event-count pixel Iij as Iij = ∣ ξij∣, with ∣A∣ being the cardinality of the set A. Next, we construct the time-image T, which is also in the discretized plane ℕ2. Here, each pixel contains the average time stamp of the warped events asTij=1IijΣt:tξij(3)

To determine which pixels belong to a moving object or the background, they are each given a score ρ(i, j) ∈ [ − 1,1] for {i, j} ∈ T asρ(i,j)=T(t:tξij)mean(T)δt(4)

These scores produce the so-called normalized mean time stamp image ρ. Now, if ρ(i, j) ≥ τthreshold, with τthreshold being a specified threshold, the pixel belongs to a moving object, otherwise to the background.

Whereas the original approach (23) uses a fixed threshold to distinguish between ego-motion–generated events and those generated by a moving object, we instead use a linear function that depends on the angular velocity’s magnitude, i.e., τthreshold(ω) = a ⋅ ‖ω‖ + b. Here, a and b are design parameters, where b regulates the threshold, whereas the camera is static, and a increases it with an increase in the angular velocity’s magnitude. This has the advantage that it is easier to detect moving objects, whereas the quadrotor is static, while still reducing the increased noise generated by faster rotational velocities. After thresholding, it can happen that some events belonging to the static part of the scene are not filtered out, generating some salt-and-pepper noise that we remove using morphological operations.

It is important to notice that, because our algorithm relies only on the IMU to perform ego-motion compensation, it is less computationally demanding than the approach in (23) but at the same time also more sensitive to false-positive detection generated by the ego-motion of the vehicle. To cope with this, we adopted fairly large values for the threshold parameters to avoid false positive. This came at the cost of a less sensitive detection algorithm (i.e., it discards more detections than it would with lower thresholds), and therefore, we had to find a good compromise between sensitivity and reliability.

Figure 4 shows our algorithm in action. All the events generated in the last time window (Fig. 4B) are motion-compensated using the IMU, and, for each pixel, we compute the normalized mean time stamp (Fig. 4D), which is then thresholded (Fig. 4E) to obtain a frame containing only events belonging to moving obstacles (Fig. 4F).

The same algorithm running across different consecutive time windows is shown in fig. S8. Each column corresponds to a different time, with the first row reporting the frame captured by the onboard camera, the second row showing the events collected in the window, and the third row presenting the same events after the ego-motion compensation and thresholding of the normalized mean time stamp.

#### Obstacle segmentation

After performing the ego-motion compensation of the events that fired in the last time window, we obtained a frame containing the location of the events belonging to the dynamic part of the scene (Fig. 4F). It is important to note that at this point, our algorithm already discarded all the static parts of the scene, with very little computational cost. To do so with a standard camera, one has to receive at least two frames to be able to distinguish between static and dynamic objects, and each frame needs to be entirely processed. The output of an event camera, instead, is much more sparse, allowing us only to process the pixels where at least one event fired.

#### Clustering

The thresholded image created by the ego-motion compensation described in “Ego-motion compensation of the events” can include multiple moving obstacles, as well as noise. Therefore, the next step is to separate the image points of the individual objects, as well as the noise.

The goal for the system is to be capable of handling an arbitrary number of obstacles, as well as being robust against noise. In addition, because of the low-latency requirement, the clustering has to be performed in the shortest time possible. With these requirements, we evaluated different algorithms to decide on the best fitting one for our system.

Our experimental evaluation highlighted that the density-based spatial clustering of applications with noise (DBSCAN) algorithm (43) has all the required characteristics for low-latency detection of moving objects. It detects clusters without previous knowledge about their shape or their amount. In addition, it can handle noise by combining it into a separate category. It has an average time complexity of O(n log (n)) and a maximum one of O(n2) but without the need for an iterative solution, which makes it comparatively fast. Another advantage is that its cost function can be arbitrarily chosen and therefore be optimized for our system. Besides the cost function, it only has two design parameters: the minimum number of data points within a cluster and the maximum cost ϵ for choosing whether a data point belongs to a given cluster. A detailed description of it is found in (43).

#### Optical flow

The density of image points and their distance in the image plane depend on the objects’ velocity, distance to the sensor, and their overall size. Having only the mean time stamp information and image position resulting from the ego-motion compensation, as described in “Ego-motion compensation of the events,” makes it impossible to effectively cluster the image points of objects with different velocities and distances from the dynamic vision sensor (DVS). Therefore, we require additional features. One available possibility is to calculate the image points’ optical flow and therefore get an estimate of their image plane velocity. An added advantage is that two objects that generate image point clusters in close proximity to each other but move in different directions are easier to distinguish. Ideally, one would directly calculate the optical flow from the event data, but existing algorithms for this either only produce the velocities magnitude or direction or are extremely computationally expensive while having a low accuracy as evaluated in (44). Instead, we decided to use a conventional optical flow algorithm on the unthresholded normalized mean time stamp image produced by the ego-motion compensation. The high temporal resolution of the DVS and high update frequency of our system allow us to assume that the displacement between two frames is small and approximately constant in a region around an image point. Therefore, we used the Lucas-Kanade algorithm (45), which has the advantages that it is less sensitive to noise compared with pointwise methods, and by combining the information of several nearby points, it is better at handling the ambiguity of the optical flow equations. To increase the robustness of the optical flow, we applied an averaging filter both to the input images and to the resulting velocity field.

#### Combined clustering algorithm

To maximize the accuracy of the clustering, we use all the available data information: the image position p, the normalized mean time stamp value ρ, and the through optical flow estimated velocity v. With these quantities, we constructed the DBSCAN’s cost function asCi,j(p,v,ρ)=wppipj+wvvivj+wρρiρj(5)

Here, w = [wp, wv, wρ]T is a weight vector for the influence of the individual parts.

Even though the DBSCAN algorithm is quite efficient with a maximum data size scaling of O(n2), the computation time increases with the data size. Especially for fast-moving objects or ones that move close to the sensor, the density of the generated events and, therefore, the overall data size to be clustered increase. This leads to far greater computation time. To overcome this, we performed a preclustering step of the image points using an eight-way connected components clustering algorithm. For this, we assume that two image points that are located directly next to each other in the image plane always belong to the same object. We then calculate the mean velocity of the image points belonging to the cluster, as well as the mean normalized mean time stamp, and fit a rotated rectangle around the points. The DBSCAN’s cost function is adapted to the new features. Instead of using the individual point’s velocity and normalized mean time stamp, we use their corresponding mean values, whereas the difference in position is substituted by the minimal distance of the corresponding rectangles asCi,j=wp distmin(ri,rj)+wvvmean,ivmean,j+wρρmean,iρmean,j(6)

If two corresponding rectangles should overlap, their distance is set to zero. Instead of using rectangles, ellipses could have been used, but finding the minimal distance between two ellipses requires the root calculation of a fourth-order polynomial, requiring an iterative solution, which takes drastically more time. Because the connected components algorithm has a time complexity of O(n) and reduces the DBSCAN’s data size by orders of magnitude, the overall clustering computation time was decreased, on average, by a factor of 1000.

#### 3D position estimation

After receiving a set of cluster points, we first fit a rotated rectangle around them to reduce the data dimensionality. From this, we get the four corner points, as well as the center position in the image plane.

For the next step, the estimation of the obstacle’s depth toward the image plane, we have to distinguish between the monocular and stereo case.

##### Monocular case

Because we are not able to calculate the depth of an image point from a single monocular image, we instead limit our system to objects of known size. With the added size information, we can then estimate the depth of an object in the camera’s frame of reference asẐC=fωrealω̂(7)where f is the focal length, ωreal is the width of the object, and ω̂ is the measured side length of the fitted rectangle.

##### Stereo case

For the stereo case, we use the disparity between two corresponding clusters of the stereo image pair for the depth estimation. This allows the algorithm to function with objects of unknown size. To determine cluster correspondences, we use a matching scheme minimizing the costC=wpxc,topxc,bottom+wamax(AtopAbottom,AbottomAtop)+wnmax(ntopnbottom,nbottomntop)2(8)with w = (wp, wa, wn) being weights, xc being the cluster’s center’s position in the image plane, A being the fitted rectangle’s area, and n being the number of cluster points. Next, we use the cluster’s disparity to calculate the depth as described in (46). To increase the robustness, we use the cluster’s centers to estimate the depth instead of directly projecting the corner points into 3D space. Having estimated the obstacle’s depth, we approximate its size using the rearranged formulation as in the monocular case asωest=ẐCω̂f(9)

#### Image to world projection

With the obtained obstacle’s depth and size, we now project the cluster’s corner and center points into 3D space using the perspective projection model in homogeneous coordinatesλi[uivi1]=KCXi(10)with K being the intrinsic camera matrix, λi being the scale factor, CXi being the Cartesian coordinates of each point of the cluster in the camera frame, and ui and vi being the pixel coordinates of their projection in the image. The points CXi are then transformed into the world’s frame of reference by applying[XiW1]=TWBTBC[XiC1](11)where TWB and TBC are transformation matrices representing the pose (rotation and translation) of the body frame with respect to the world frame and of the camera with respect to the body frame, respectively. Here, the center point’s depth is both increased and decreased by the obstacle’s estimated size asẐC,±C=ẐC±ωest(12)

This gives us a total of six points WX1 : 6 representing the obstacle.

#### Obstacle correspondence

To estimate an obstacle’s velocity, we first have to determine whether a newly detected obstacle corresponds to a previous one and, if this is the case, to which. This is done by finding the best match between the new obstacle’s center and the predicted position of the saved obstacles’ centers. This is done by finding the closest position match within a sphere around the newly detected obstacle.

#### Obstacle velocity estimation

Once the 3D position of an obstacle has been estimated, our algorithm requires some further processing to provide valuable information to the planning stage for a twofold reason: (i) The event-based detections are sometimes noisy, especially at large distances; (ii) it is necessary to estimate the obstacle’s velocity, which is used to determine the avoidance direction, as well as a scaling factor for the repulsive potential field (“obstacle avoidance”). To do so, we use a Kalman filter (47), with the obstacle’s position estimate as input for the measurement update. This introduces some time lag (typically below 0.3 ms) because the Kalman filter behaves as a low-pass filter, but the increased accuracy is, in this case, preferable. For this, we assume a constant velocity model having as state the obstacle’s position and velocityxk=xk1+ẋk1 Δt(13)ẋk=ẋk1(14)Δt=tktk1(15)

With this, we can formulate the linear motion model asx̂=[xẋ](16)x̂k=Ak x̂k1+vk(17)zk=H x̂k+wk(18)Ak=[I3x3ΔtI3x303x3I3x3](19)H=[I3x3 03x3](20)where vN(0,Q) is the normally distributed process noise and wN(0,R) is the normally distributed measurement noise. Using this, we now construct the Kalman filter as follows (47)x̂p,k=Ak x̂m,k1(21)Pp,k=Ak Pm,k1 AkT+Q(22)Kk=Pp,k HT (H Pp,k HT+R)1(23)x̂m,k=x̂p,k+Kk (zkH x̂p,k)(24)Pm,k=(IKk H)Pp,k(IKk H)T+Kk R KkT(25)

This has the added advantage that we receive a filtered estimate of the obstacle’s velocity without any further computations.

### Obstacle avoidance

The primary objective of our avoidance framework is to guarantee low latency between sensing and actuation. The low latency on the perception side is guaranteed by the previously described event-based obstacle detection pipeline. For the overall system to be effective, however, it is necessary to reduce the latency of the decision-making system responsible for driving the robot away from the detected obstacles. On the basis of this consideration, it is intuitive to understand that any optimization-based avoidance technique is not suited for our purpose because numerical optimization would introduce latency due to the non-negligible computation times. Rapid methods to compute motion primitives for aerial vehicles exist in the literature (48). However, they present a number of drawbacks. First, it is necessary to sample both space and time to find a safe position for the robot and a suitable duration of the trajectory. In addition, continuity in the control inputs is not always guaranteed. Last, including this kind of methods within existing motion generation frameworks is not always trivial because of multiple reasons: It is necessary to continuously switch between the main navigation algorithm, driving the robot toward its goal, and the avoidance algorithm, steering it away from obstacles; it is not always trivial to obtain a behavior that allows the robot to keep executing its mission (e.g., reach its goal) while simultaneously avoiding moving obstacles.

The artificial potential field method is a natural and simple solution to all the aforementioned issues. Given a closed-form expression of the attractive and repulsive fields, it is particularly simple to compute their gradients within negligible computation time to generate the resulting force responsible for letting the robot move. Considering an obstacle as the source of a repulsive field also allows us to not require any sampling in space and time because the resulting potential decides in which direction the robot should move at each moment in time. Last, the resulting potential can be used at different levels of abstraction to integrate the command derived from its gradient into existing motion generation algorithms, for example, as velocity or acceleration commands.

Using potential fields for pathfinding and obstacle avoidance has been extensively researched. This approach is, however, mostly used in static scenarios, whereas our system is thought for dynamic obstacles. The typical approach is to build a discretized map, where each element represents the potential combined from the attractive and repulsive parts. This map building approach is feasible in 2D, but its size and the required computational power to build and analyze it drastically increase when doing so in 3D, as it increases from O(n2) to O(n3). Instead of building a map, we represent the obstacles as a struct of features, resulting in a sparse and minimal data representation. The obstacles are represented as ellipsoids, with a potential that is decaying over time. We use the estimated obstacles’ position and velocity to calculate their repulsive forces at each time step. In addition, given a reference target position, we compute the attractive force toward it. With the combined force, we then produce a velocity command that is sent to the controller. In addition, the system’s behavior, when no obstacles are present, is similar to the one generated by a high-level trajectory planner driving the robot toward the desired goal location.

#### Obstacle representation

We chose to represent the obstacles as ellipsoids, because they are a good representation of the expected Gaussian error of both position and size. In addition, they allow us to generate a continuous repulsive force when an obstacle is detected. Using the six coordinate points X̂1:6W in Eq. 12, we fit a minimal volume ellipsoid around them using the approach described in (49) and illustrated in fig. S9.

#### Repulsive potential field

Each obstacle produces a potential field Ur,i, from which we get the repulsive force Fr,i by calculating its gradient as Fr,i = − ∇ Ur,i. One way of formulating the potential field was proposed by Khosla and Volpe (50), which, in turn, is a modification of the original artificial potential field definition by Khatib (40), asUr,i(ηi)={kr,i(η0ηiη0)γ,if 0ηiη00,if ηi>η0,(26)with a resulting forceFr,i=Ur,i={kr,iγη0(η0ηiη0)γ1ηi,if 0ηiη00,if ηi>η0,(27)where kr,γ, and η0 are design parameters and ηi is the distance to the obstacle i. This kind of field does, however, produce a gradient whose magnitude increases slowly as the distance to the obstacle decreases, as shown in fig. S10A. This has the effect that the repulsive force acting on the quadrotor only reaches substantial values when the obstacle is close, or a high repulsive gain kr has to be chosen, which might lead to unstable, aggressive behavior.

Therefore, we propose a new formulation of the repulsive force asFr,i={kr,i(11eγηi1eγη0),if 0ηiη00,if ηi>η0,(28)as shown in fig. S10B. Here, ηi is the minimal distance to the ellipsoid’s surface of obstacle i. Through this formulation, the force’s magnitude is limited to a specified value kr and increases much faster. This is desirable when evading fast-moving obstacles, as compared with static ones, for which the fields described in other works were developed, because the quadrotor’s dynamics require it to start evading before an obstacle comes too close, as discussed in (38).

Conventionally, the gradient of the distance toward the obstacle ∇ηi is responsible for the direction of the repulsive force Fr, i. It points in the direction of the steepest descent of the obstacle’s distance, which is the opposite direction between the quadrotor’s center and the closest point on the obstacle’s ellipsoid’s surface. This means that an obstacle pushes the quadrotor away from it. We do, however, want to apply a different avoidance strategy. Instead, we use the obstacle’s predicted velocity ẋi and the distance’s gradient ∇ηi and calculate the normalized cross-product asθi=ηi×ẋiηi×ẋi(29)

With the new avoidance direction θi,n, the repulsive force Fr,i becomesFr,i=Ur,i={kr,i(11eγηi1eγη0)θi,n,if 0ηiη00,if ηi>η0(31)

This formulation of the potential field yields to a behavior such that, if the quadrotor is moving toward the goal location, it flies around any detected obstacle if the goal position is behind it, whereas if it is in hover conditions, it moves in a direction orthogonal to the obstacle’s velocity. Last, we include the magnitude of the obstacle’s estimated velocity ẋi into the repulsive force Fr,i asFr,i=Ur,i={ẋikr,i(11eγηi1eγη0)θi,n,if 0ηiη00if ηi>η0(32)

By doing so, faster obstacles produce a larger repulsive force, and the quadrotor will therefore perform a more aggressive avoidance maneuver. This is desirable because the faster an obstacle, the lower the avoidance time, which therefore implies the necessity for a quick evasive maneuver.

In addition, we ensure that the z component of the repulsive force is always positive, namely, Fr,i,z = ∣ Fr,i,z∣, as quadrotors with sufficiently large thrust-to-weight ratios are typically capable of producing larger accelerations upward than downward.

The repulsive constant kr,i is, in our case, dynamic and decays with time askr,i(t)=kr,0 eλdecay(ttdetection,i)(33)where kr,0 is the initial repulsive constant, λdecay is a factor regulating the decay rate, t is the current time, and tdetection,i is the last time the specific obstacle was detected. Through this decay, obstacles are kept in case of a temporary occlusion or when they leave the camera’s field of view. Their effect on the quadrotor, however, decreases as the time to their last detection increases. If kr,i falls below a given threshold kr, the obstacle is removed.

Last, the parameter ηi represents the minimal distance between the quadrotor’s center to the obstacle’s ellipsoid’s surface minus the quadrotor’s radius. The computation of the minimal distance between a point and an ellipsoid’s surface is described in (51). The total repulsive force is then the sum over all individual obstacles asFr,total=iFr,i(34)

#### Attractive potential field

The goal of the attractive potential field is to allow the vehicle to reach the desired target position and hover there until the user provides a new reference. In this work, we provide a simple formulation for the attractive potential that assumes that no static obstacles are present in the scene, i.e., the straight-line path between the robot and the obstacle is collision free. However, one can easily replace this component of our avoidance scheme with more sophisticated methods to generate commands that drive the vehicle toward its goal. These can be based, for example, on potential field-based techniques dealing with static obstacles and local minima, which is out of the scope of this work, or completely different methods able to generate velocity or acceleration commands [for example, see (52)].

For the attractive potential, we want the system to not only produce the same velocity toward a goal as a high-level planner would produce if no obstacle is present but also produce stable dynamics close to the goal. Therefore, we chose the hybrid approach of a conical and polynomial potential field (53) asUa={ka(γa+1)e0γaeγa+1,if e<e0kae,if ee0(35)

This function is differentiable at e0, i.e., the crossover distance between the two different potential fields, with e being the error between the goal’s and quadrotor’s positions, ka being the attractive constant, and γa being a design parameter. By taking its gradient, we get the attractive force asFa=Ua={kaee(ee0)γa,if e<e0kaee,if ee0,(36)which is continuous in e. The constant ka regulates the output velocity ẋ (see “Output velocity”), and by setting it to ka = ‖vdes‖, the quadrotor’s velocity’s magnitude is ẋ=vdes, whereas ‖e‖ ≥ e0, and no obstacles are present.

If we would instead solely rely on the conical potential field, the quadrotor would start to oscillate around its goal position, because the resulting force’s magnitude would be ka, regardless of the error. The attractive force’s magnitude is shown in fig. S11. If γa = 0, then ‖Fa‖ is identical to that of the conical part, producing a constant magnitude of the attractive force, whereas for γa = 1, the magnitude goes linearly to 0. With increasing γa, the magnitude drops faster with an increasingly large area around ‖e‖ = 0, where it is close to 0.

#### Output velocity

The velocity is the output of our system and is given to the controller to derive the required total thrust and body rates. From the total repulsive force Fr, total and attractive force Fa, we get the total virtual force acting on the quadrotor as Ftotal = Fr,total + Fa. With this force, we now have three possible design choices to calculate the quadrotor’s desired velocity ẋx¨=Ftotalm(37)x¨=Ftotal(38)ẋ=Ftotal(39)where m denotes the quadrotor’s mass.

Both Eqs. 37 and 38 produce a first-order dynamic, whereas Eq. 39 directly produces the velocity output. Introducing further dynamics into the system results in additional delays, which is undesirable because we want our system to be as responsive as possible. We therefore chose Eq. 39 because it produces the fastest response.

### Experimental platform

#### Hardware

To validate our approach with real-world experiments, we designed a custom quadrotor platform, shown in fig. S3. The main frame is a 6″ Lumenier QAV-RXL, and, at the end of each arm, we mounted a Cobra CM2208-2000 brushless motor equipped with a 6″, three-bladed propeller. The vehicle is equipped with two onboard computers: (i) a Qualcomm Snapdragon Flight, used for monocular, vision-based state estimation using the provided Machine Vision SDK; and (ii) an NVIDIA Jetson TX2, accompanied by an AUVIDEA J90 carrier board, running all the rest of our software stack. In this regard, the output of our framework is a low-level control command comprising the desired collective thrust and angular rates the vehicle should achieve to fly. These commands are sent to a Lumenier F4 AIO Flight Controller, which then produces single-rotor commands that are fed to DYS Aria 35a motor controllers.

The quadcopter is equipped with two front-facing Insightness SEES1 cameras, in a vertical stereo setup, connected via USB to the Jetson TX2. The SEES1 sensor provides both frame and events and has a QVGA resolution (320 pixels by 240 pixels). To have a sufficiently high angular resolution, each camera has a lens providing a horizontal field of view of about 80°. Such a small field of view is particularly low for tasks such as obstacle avoidance, where a large field of view is preferable to increase the area that the robot can sense. The choice of adopting a vertical stereo setup rather than a more common horizontal setup was driven by the necessity of maximizing the overlap between the field of view of the two cameras while guaranteeing a sufficiently large baseline (in our case, 15 cm).

In addition to the previous sensing suite, we mounted a TeraRanger EVO 60-m distance sensor looking downward. The goal of this additional sensor is to constantly monitor the height of the vehicle to detect whether there is any drift in the state estimate provided by the VIO pipeline running on the Snapdragon Flight. Whenever we detect a discrepancy beyond a manually defined threshold, the quadrotor automatically executes an emergency landing maneuver.

#### Software

We developed the software stack running on our quadrotor in C++ using Robot Operating System (ROS) for communication among different modules. To reduce latency, we implemented the obstacle detection and avoidance algorithms within the same ROS module, so that no message exchange is necessary between the camera drivers and the code responsible for detecting moving obstacles, as well as between the latter and the planning stage. The output of this module is a velocity command, which is then fed to the position controller proposed in (54) and available as open source (http://rpg.ifi.uzh.ch/rpg_quadrotor_control.html). The low-level controller, responsible for tracking desired body rates and collective thrust, is the default one provided by the Lumenier F4 AIO Flight Controller, which then communicates with the electronic speed controllers (ESCs) to generate the single rotor thrusts.

In our outdoor experiments, the state of the vehicle is estimated using the VIO pipeline provided by the Qualcomm Machine Vision SDK (https://developer.qualcomm.com/software/machine-vision-sdk), which, however, only provides new estimates at camera rate (up to 30 Hz). This is not sufficient to control our vehicle with low latency and would represent a bottleneck in the entire pipeline. To obtain a higher-rate state estimate, we feed the output of the VIO into an extended Kalman filter (55), together with IMU measurements, to obtain information about the position, orientation, and velocity of the vehicle at 250 Hz.

## SUPPLEMENTARY MATERIALS

robotics.sciencemag.org/cgi/content/full/5/40/eaaz9712/DC1

Fig. S1. Monodimensional example to explain the working principle of event-based detection of moving obstacles.

Fig. S2. Time statistics of the events belonging to static and dynamic regions.

Fig. S3. The quadrotor platform we used in our outdoor experiments.

Fig. S4. Ego-motion compensation computation time as function of the number of events.

Fig. S5. Clustering computation time as function of the pixels count.

Fig. S6. Detection of objects having different sizes and shapes.

Fig. S7. Detection of multiple objects simultaneously.

Fig. S8. Sequence of detection.

Fig. S9. Obstacle ellipsoid.

Fig. S10. Repulsive potential.

Fig. S11. Attractive potential.

Movie S1. Outdoor dynamic experiments.

Movie S2. Explanation of the working principle of the event-based detection algorithm.

References (56, 57)

## REFERENCES AND NOTES

Acknowledgments: We thank H. Rebecq and G. Gallego for helpful discussions. Funding: This work was supported by the SNSF-ERC Starting Grant and the Swiss National Science Foundation through the National Center of Competence in Research (NCCR) Robotics. Author contributions: The project ideas were conceived by D.F., K.K., and D.S. The experiments were designed and performed by D.F. and K.K. The paper was written by D.F., K.K., and D.S. Competing interests: The authors declare that they have no competing interests. Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper or the Supplementary Materials.
View Abstract