Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles

¹Fraunhofer IVI, ²Technical University of Munich, ³Technische Hochschule Ingolstadt

Abstract

Collaborative perception in automated vehicles leverages the exchange of information between agents, aiming to elevate perception results. Previous camera-based collaborative 3D perception methods typically employ 3D bounding boxes or bird's eye views as representations of the environment. However, these approaches fall short in offering a comprehensive 3D environmental prediction. To bridge this gap, we introduce the first method for collaborative 3D semantic occupancy prediction. Particularly, it improves local 3D semantic occupancy predictions by hybrid fusion of (i) semantic and occupancy task features, and (ii) compressed orthogonal attention features shared between vehicles. Additionally, due to the lack of a collaborative perception dataset designed for semantic occupancy prediction, we augment a current collaborative perception dataset to include 3D collaborative semantic occupancy labels for a more robust evaluation. The experimental findings highlight that: (i) our collaborative semantic occupancy predictions excel above the results from single vehicles by over 30%, and (ii) models anchored on semantic occupancy outpace state-of-the-art collaborative 3D detection techniques in subsequent perception applications, showcasing enhanced accuracy and enriched semantic-awareness in road environments.

Framework

The CoHFF Framework consists of four key modules: (1) Occupancy Prediction Task Net, for occupancy feature extraction; (2) Semantic Segmentation Task Net, creating semantic plane-based embeddings; (3) V2X Feature Fusion, merging CAV features via deformable self-attention; and (4) Task Feature Fusion, uniting all task features to enhance semantic occupancy prediction.

Visualization

The video demonstrates our method, CoHFF (Collaborative Hybrid Feature Fusion), applied in Semantic-OPV2V for vision-based collaborative semantic occupancy prediction. It features a side-by-side comparison between the CoHFF-generated predictions, the ground truth within the field of view (FoV) of the ego vehicle and across the collaborative FoV among Connected Autonomous Vehicles (CAVs). The input images captured by four cameras mounted on the ego vehicle, are shown at the bottom.

Visual Analysis

MY ALT TEXT Illustration of collaborative semantic occupancy prediction from multiple perspectives, compared to the ground truth in the ego vehicle’s FoV and the collaborative FoV across CAVs. This visualization emphasizes the advanced object detection capabilities in collaborative settings, particularly for objects obscured in the ego vehicle’s FoV, such as the vehicle with ID 6.

Acknowledgements

Our project extensively utilizes the toolchains in the OpenCDA ecosystem, including the OpenCOOD and the OpenCDA simulation tools, for developing the Semantic-OPV2V dataset.

Our project draws inspiration from a lot of awesome previous works in collaborative perception, also known as cooperative perception, e.g., DiscoNet (NeurIPS21), Where2comm (NeruIPS22), V2X-ViT (ECCV2022), CoBEVT (CoRL2022), CoCa3D (CVPR23), and many others.

Additionally, our project benefits from a lot of insightful previous works in vision-based 3D semantic occupancy prediction, also known as semantic scene completion, e.g., MonoScene (CVPR22), TPVFormer (CVPR23), VoxFormer (CVPR23), FB-OCC (CVPR23), and many others.

BibTeX

@inproceedings{song2024collaborative, title={Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles}, author={Song, Rui and Liang, Chenwei and Cao, Hu and Yan, Zhiran and Zimmer, Walter and Gross, Markus and Festag, Andreas and Knoll, Alois}, publisher={IEEE/CVF}, booktitle={2024 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2024} }