College of Intelligence Science and Technology, National University of Defense Technology
*denotes corresponding author
Many from existing approaches that classify moving objects on a point-by-point basis, but this often results in incomplete segmentation of moving objects. Therefore, we proposed a method that focuses more on a higher instance level and determines which instances are currently moving in the scene.
Identifying moving objects is a crucial capability for autonomous navigation, consistent map generation, and future trajectory prediction of objects. In this paper, we propose a novel network that addresses the challenge of segmenting moving objects in 3D LiDAR scans. Our approach not only predicts point-wise moving labels but also detects instance information of main traffic participants. Such a design helps determine which instances are actually moving and which ones are temporarily static in the current scene. Our method exploits a sequence of point clouds as input and quantifies them into 4D voxels. We use 4D sparse convolutions to extract motion features from the 4D voxels and inject them into the current scan. Then, we extract spatio-temporal features from the current scan for instance detection and feature fusion. Finally, we design an upsample fusion module to output point-wise labels by fusing the spatio-temporal features and predicted instance information. We evaluated our approach on the LiDAR-MOS benchmark based on SemanticKITTI and achieved better moving object segmentation performance compared to state-of-theart methods, demonstrating the effectiveness of our approach in integrating instance information for moving object segmentation. Furthermore, our method shows superior performance on the Apollo dataset with a pre-trained model on SemanticKITTI, indicating that our method generalizes well in different scenes.
InsMOS Network Architecture. Our network is composed of three main components: MotionNet, instance detection module, and upsample fusion module. MotionNet is designed to extract motion features of the input 4D voxels. We then concatenate motion features with original point features and use an instance detection module to extract the spatio-temporal features from the current scan for instance detection and feature fusion. Finally, the upsample fusion module achieves point-wise MOS by integrating spatio-temporal and instance information.
We train our model on both SemanticKITTIMOS dataset, and KITTI-road dataset1 and evaluate it on the SemanticKITTI-MOS benchmark2. Besides, to overcome the imbalance distribution of quantities between moving and static objects on the SemanticKITTI-MOS dataset, we follow MotionSeg3D and use their conducted training and validation sets based on KITTI-road, where sequences 30-34, 40 for training and 35-39, 41 for validation.
The quantitative comparison is shown in Tab. I, and the results display that our approach achieves the state-of-the-art MOS performance with 75.6% IoU. LMNet, 4DMOS and RVMOS are trained on the SemantiKITTI dataset, and RVMOS achieves great performance improvement up to 74.7% IoU. That is mainly due to the fact that RVMOS inserts semantic information in the network implementations. AutoMos, MotionSeg3D and our network perform training on the SemantiKITTI and an additional labeled KITTI road dataset in order to reduce the impact of unbalanced data distribution.
Qualitative comparisons of our method with LMNet and 4DMOS on the SemanticKITTI validation set. LMNet is the first work on LiDAR MOS exploiting range images, which is fast but results in many wrong predictions. 4DMOS performs well on fast-moving object segmentation, but not as well on slow-moving object segmentation. As 4DMOS cannot capture the instance information of the moving points, only partial points of the moving instance can be correctly predicted. However, our approach can segment moving objects completely and has the ability to detect slowmoving instances by integrating past observations in the instance-based refinement algorithm. The qualitative experimental results support our first claim and further demonstrate that instance information is highly valuable for MOS task.
To test the generalization of InsMOS, we conduct experiments on the Apollo dataset. We compare our method with three open source baseline methods, including MotionSeg3D, LMNet and 4DMOS. All methods are only trained on the training set of SemanticKITTI and evaluated on the Apollo dataset without modifying any settings or parameters fine-tuning. Besides, we also present the result of fine-tuned LMNet combined with AutoMOS , noted as LMNet+AutoMOS+Fine-Tuned. From results are shown in table. Two range image-based methods MotionSeg3D and LMNet show bad performance in the generalization test, while 4DMOS and our method maintain good segmentation ability in unknown environments. The possible reason is that the projection-based approaches overfit the setup of the sensor and specific patterns in the trained environments, which will affect the performance of projection-based approaches while the point cloud-based approaches will not be affected. Benefiting from the learned instance information, our method still outperforms 4DMOS.
To further demonstrate the ability of our method to detect instances, we evaluate the performance on the KITTI tracking dataset with the model pre-trained on SemanticKITTI. The KITTI tracking dataset contains more instances compared to the SemanticKITTI dataset. The results show that our method is able to accurately predict the instance category and 3D bounding box of major traffic participants, albeit in a complex environment.