Reliable autonomy isn’t about finding the single best sensor. Cameras, LiDAR, and radars each see the world differently, and each has its weaknesses and strengths. The real challenge is getting these different sensors to agree in real time, fusing what they see into one image. That’s how we achieve safety and productivity.
Ian MacIsaac is Global Technology Manager at Volvo Autonomous Solutions.
We humans have at least five senses: sight, hearing, smell, taste, and touch. Through these we understand the environment around us, and each adds unique details to the picture we paint in our minds. There are two aspects of this that are particularly interesting, and which prove that we aren’t all that different from an autonomous truck.
First, our brain combines sensory information from these sources into one picture. For example, walking down a busy sidewalk, you not only see people, but you hear the murmur, you smell the exhaust from a passing city bus, you feel the cold wind on your face, and you probably still taste your morning coffee. In essence, through multiple senses you can paint a richer picture of the environment you are moving through.
Second, having multiple sources of information is better. Few people would walk down the street blindfolded, but many wear headphones, canceling out one key sensor while maneuvering the busy sidewalk effectively. Similarly, when approaching a crossing, you might not see an approaching bus, but you’ll most likely hear it and stop. In essence, sometimes we rely on one sense over another.
And it’s the same for autonomous trucks. They are fitted with multiple sensors to paint as rich a picture as possible of the environment around them, and they rely on different sensors in different moments. Our autonomous trucks rely on three sensors for obstacle perception: cameras, LiDAR, and radar. Just like our senses, they each contribute in different ways.
Cameras contribute rich color, texture and semantic classification, such as distinguishing between a truck and a wheel loader. But they struggle in darkness, glare and in situations with low contrast. LiDARs measure 3D shape and depth directly and work well at night, but can be affected by dust, fog or snow. Radars see through weather and “optical pollution” and measure motion well, though with lower spatial resolution and occasional reflections in metallic surroundings. Combining the data from these sensors allows us to overcome each sensor’s weakness. The magic behind it all is called sensor fusion.
Sensor fusion is the process of combining data from multiple sensors to create a single, coherent understanding of the environment. The result is a more robust view of the world than any single sensor can offer.
Successful sensor fusion combines data that refers to the same point in time and space. That starts by aligning the data.
Let’s begin with space. Each sensor measures the world from its own location on the truck. A LiDAR on the roof, a radar on the bumper, and a camera on the front – all report the position of objects relative to themselves. So, a tree that is 10 meters away from the front camera, might be 11 meters from the roof-mounted LiDAR at a different angle. Although both are correct in their own sensor reference frames we need to combine them into one common vehicle reference.
Through space alignment we convert every measurement into one shared vehicle coordinate frame, so that all the sensors on the truck refer to the same reference point on the truck. As a result, even if one sensor is further away from the tree than another, they both report how far away the tree is to the same point on the actual truck. This allows us to combine the information from all the different sensors to create an accurate location of the truck relative to the world around it.
Once space has been aligned, we need to make sure that time is aligned as well. Fusion only works if measurements refer to the same instant. Sensors do not always capture or scan at the same time. Cameras, LiDAR and radar run at different frame rates and the data will always arrive a little bit after the moment of capture. In essence, there is a timing difference between observations that we need to account for before combining the data.
If you fuse a camera frame from 12:00:00.050 with a lidar scan from 12:00:00.120, you’re mixing “now” with 70 milliseconds later. This might seem like an insignificant difference. But it can have a significant effect. The truck and obstacles are moving as the sensors collect data – so a 70-millisecond mismatch can shift an object’s apparent position by tens of centimeters. And the faster the truck drives, the bigger the difference. To account for this, we align sensor timestamps before fusion down to the millisecond.
By combining the data for time and space through common reference points, we have created a singular view across sensors, allowing for accurate fusion.
Once time and space are aligned, we move from raw signals to meaning. The computer identifies features in images and point clouds, groups them into objects, and tracks those objects across frames. This last part is critical. Real autonomy does not react to single snapshots. It builds confidence over time. At 10 to 20 frames per second, the virtual driver treats any new signal as a possible obstacle. But a single blip on one frame is not enough for the truck to slam the brakes. Instead, detections that persist over several frames – and across multiple sensors – become what the virtual driver acts upon. This prevents a single camera glare or a dust particle on the LiDAR from affecting the trucks decision.
By building confidence, the virtual driver can navigate safely even if sensors disagree. A LiDAR can light up a dust cloud; a radar may see through it. The fusion logic recognizes these patterns. When sensors disagree the system looks for what stays true over several moments and what different sensors confirm. A one-frame blip gets down-weighted while signals that persist across frames get more trust. Hence, the goal isn’t to crown a sensor king; it’s to combine confidence over time so brief false signals don’t drive decisions.
Before we move on, let’s summarize a bit. We have now made sure that all sensors report objects distance relative to the same reference point on the truck. The time of each observation is continuously being matched between sensors to make sure they all refer to the same instant in time. The virtual driver is analyzing terabytes of data per second to ensure only trusted information is acted upon.
There are two practical ways of fusing the sensory data: early and late fusion. Think of it as preparing a dish. Early fusion mixes the raw, time and space aligned data from the sensors into one shared view before it decides what’s in it; it blends the ingredients before cooking. Late fusion lets each sensor do its own recognition first and then combines those higher-level results into a single picture, weighing them by confidence and conditions; it cooks each ingredient separately and then plates the dish together.
These two ways of fusing data have their own strengths and weaknesses. Early fusion preserves the richest raw sensor data for maximum accuracy but demands high computational power and precise synchronization, while late fusion simplifies processing and reduces compute load, though it sacrifices some detail and flexibility. In essence, an autonomous driving system can do both types of fusion; for example early fuse the truck’s LiDARs into one cloud, then late fuse that LiDAR result with camera and radar object lists. This cycle runs many times per second so that the truck acts on one agreed picture.
Hopefully, you now see why it’s useful and safety critical to have multiple sensors, working together to give our virtual driver the best and richest picture of the world around it. Even if we had the best and most sophisticated LiDAR in the world, relying solely on one type of sensor could never account for everything. As the classic saying goes, “teamwork makes the dream work” and sensor fusion is the glue that keeps it all together.
The great news is that sensors continue to improve, becoming better at handling their weaknesses. Cameras are improving in low light and glare. LiDAR is gaining resolution and, in some cases, color and velocity information. Radars are moving to 4D that can look almost LiDAR-like. But although they are converging in capability, it’s their unique physics that creates an advantage. Different failure modes mean genuine redundancy and a stronger safety case.
Smarter fusion, not simply more sensors, is how reliable autonomy is built. Our approach has served us in some of the harshest operating environments in the world. Through multiple sensor types and the fusing of their data, we turn multiple imperfect views into one dependable picture the truck can act on.