Improving 3D Vulnerable Road User Detection With Point Augmentation

Point clouds have been a popular representation to describe 3D environments for autonomous driving applications. Despite accurate depth information, sparsity of points results in difficulties in extracting sufficient features from vulnerable objects of small sizes. One solution is leveraging self-attention networks to build long-range connections between similar objects. Another method is using generative models to estimate the complete shape of objects. Both approaches introduce large memory consumption and extra complexity to the models while the geometric characteristics of objects are overlooked. To overcome this problem, this paper proposes Point Augmentation (PA)-RCNN, focusing on small object detection by generating efficient complementary features without trainable parameters. Specifically, 3D points are sampled with the guidance of object proposals and encoded through the 3D grid-based feature aggregation to produce localised 3D voxel properties. Such voxel attributes are fed to the pooling module with the aid of fictional points, which are transformed from sampled points considering geometric symmetry. Experimental results on Waymo Open Dataset and KITTI dataset show a superior advantage in the detection of distant and small objects in comparison with existing state-of-the-art methods.

other road users in 3D environments. It provides a fundamental understanding of the surrounding of intelligent systems and facilitates the subsequent tasks in the perception workflow of autonomous driving [1], [2]. With greater weather-proof ability than camera systems, light detection and ranging (LiDAR) sensors are more widely deployed to acquire accurate depth measurements and to extract the geometry information with point clouds. Recent development in deep neural networks has further boosted the use of LiDAR sensors.
Different from images, point cloud processing is less straightforward because of its sparsity and irregularity [1]. To address this issue, researchers extract the high-dimensional features from point clouds in two main formats, which are point-based and voxel-based. Point-based methods encode point coordinates with a symmetry function and store the information at point locations [3], [4], [5], while voxel-based methods discretise the 3D scene and perform feature learning on the regular grids [6], [7], [8]. Voxelisation simplifies the nearest neighbour query by directly selecting the adjacent indexes on the grid map to increase sampling efficiency in the receptive field. However, locating features at the fictional voxel centres harms the accuracy of voxel-based encoders. In contrast, by inheriting accurate point locations throughout the information flow of the networks, pointbased methods can locate rich features precisely in the scene. However, owing to the fact that searching for nearest neighbours among unordered points is time-consuming, a poor point sampling scheme may also limit the efficiency of point-based encoders, such as Set-Abstraction in [5]. Although remarkable performance is achieved in 3D car detection using point clouds, researchers tend to overlook the deficiency in detecting more vulnerable targets, such as pedestrians and cyclists. Such small or distant objects often attract fewer laser beams from a LiDAR sensor due to the sensor's nature (i.e. , point density shrinks as distance increases). Therefore it is crucial to consider raw point features for the detection network.
Down-sampling points is inevitable to increase the receptive field and maintain the input size with point-based methods. Thus, convolutions are performed around the selected key points. The commonly used schemes are farthest point sampling (FPS) and random point sampling (RPS). FPS may ensure the most coverage of the scene, while RPS may avoid overfitting. However, both cannot guarantee object points can survive the next stage of the network. Many irrelevant background points are included due to the imbalance number of foreground/background points. This leads to potential information loss, especially for distant and small objects, such as pedestrians and cyclists. Increasing the receptive field is relatively easy with a grid-like structure, usually by decreasing the voxel resolution. It is often difficult to balance the trade-off between memory usage (number of voxels) and performance, as larger voxels often neglect small objects.
Vision transformer (ViT), as a counterpart of convolutions, has played a noticeable part in current 2D object detection tasks [9], [10], [11]. The self-attention mechanism provides long-range connections between pixels which are close in high dimensions. This aids the discovery of small objects in the image. Researchers aim to transfer the success of ViT in 2D tasks into 3D tasks [7], [12]. Although the latency and performance can be improved, the transplant for the transformer can be expensive and bring extra complexity to detection networks.
Another approach to improve detection on occluded, distant or small instances is to reconstruct the points of missing shapes. With the aid of point completion algorithms, a generative module is trained to predict the full shape of objects from incomplete point sets in a self-contained manner or through external datasets [13], [14], [15], [16]. The instances with incomplete geometry are replaced or augmented with the generated shapes to increase the confidence of predictions for small and occluded objects. As the predicted parts of objects are usually unseen or overlooked by the sensors, the point completion results may not always reconstruct the correct object surfaces, especially when testing outside the training domain. These methods tend to induce large memory usage and high computational cost due to the extra inference module, as well as a longer latency.
To solve the aforementioned issues, a new object detection framework with Point Augmentation, namely PA-RCNN is proposed, to facilitate efficient detection of small and distant objects, by integrating a proposal-guided sampling scheme and a simple yet effective object point augmentation module in a two-stage object detection architecture. The main contributions of this paper can be summarised as follows: r To ensure accurate proposals are generated, a lightweight Attention-based Semantic Mining (ASM) module is adopted to yield the 2D feature map, considering both geometric and semantic information. Gradient degradation can be mitigated by fusing geometric information, which is relatively shallow. Compared to 3D transformer [7], the 2D attention algorithm consumes less memory, while achieving favourable results for detection proposals.
r To sample as many informative foreground points as possible for the second stage of the detector, key point sampling is guided by detection proposals. This effectively reduces background noises for the region of interest (RoI) pooling and bounding box refinement.
r To benefit from the complete object shapes, the RoI refinement stage comprises a point augmentation module (PAM) and local Grid-based Voxel-to-Point Feature Aggregation (GVPFA) module. The PAM has no trainable parameters and extracts all proposed object points from raw input and their associated features. To realise point cloud completion without trainable parameters, the generic geometric characteristics of symmetry are exploited, shown in Fig. 1. Thorough experiments prove that our proposed method exceeds the current state-of-the-art on Level_2 of Waymo Open Dataset and achieves the best results on KITTI cyclist category among methods without generative modules.

A. Convolution-Based 3D Detection
Most existing 3D detectors heavily rely on the advancement of convolutional neural networks [1], [2], [6]. Some detectors perform convolutions directly on raw points using, for instance, the Set Abstraction module from PointNet [1]. F-PointNet [17] crops the point cloud scene based on the proposals generated by a 2D detector from RGB images. The cropped point cloud can reduce the number of background points for bounding box refinement. Point-RCNN [5] improves proposal quality by adopting a 3D backbone to encode the entire scene to provide 3D proposals. 3DSSD [3] replaces the costly feature propagation layers with an advanced sampling technique to achieve singlestage anchor-free detection.
By discretising the 3D space into voxels, VoxelNet [18] can deploy 3D convolution directly on the regular grids. SEC-OND [19] improves the 3D CNN with sub-manifold and sparse convolutions, considering the sparsity of the point cloud. Point-Pillars [20] merges vertical voxels into pillars to form a pseudoimage, with the attempt to reduce computation burden. Voxel-RCNN [6] integrates a bounding box refinement stage to the SECOND backbone. PV-RCNN [2] associates voxel features to key point locations with the voxel set abstraction module. Li et al. [21] solve the IoU-misalignment issue with the redesigned box refinement module. However, such recent detectors mainly focus on the detection of targets such as cars or vehicles, while smaller vulnerable instances are often overlooked, due to the lack of both rich features and long-range connections to similar objects in the scene.

B. Attention-Based 3D Detection
While being able to provide long-range connections, selfattention-based modules usually act as feature enhancers in 3D detectors in early work. DVFENet [22] enhances the features by adding the graph-attention-based branch in parallel with the baseline sparse convolution backbone. S-AT GCN [23] adds a spatial-attention module to PointPillars [20] to reduce partition effects. Pyramid-RCNN [8] improves the second-stage module by introducing the pyramid RoI head with conventional attention-and graph-based operators. VoTr [7] rebuilds the detector's backbone with a 3D voxel transformer with large memory consumption. CT3D [24] consists of a channel-wise transformer, which operates on raw 3D points. VoxSet [12] detects 3D objects with set-to-set translation, reducing memory usage and runtime. However, transformers are usually introduced to voxel-only networks. Increasing the usability of transformers on 3D raw points is not trivial, due to the unordered nature of point clouds.

C. Generative Methods for 3D Detection
To solve the inconsistent point density, PC-RGNN [25] predicts the complete shapes of objects with a point cloud completion module. The point cloud completion module renders additional points to the proposals with a multi-resolution graph encoder and a point pyramid decoder. Associate-3Ddet [26] and AGO-Net [26] mimic the bio-model by learning to map incomplete perceived features of objects to more complete features of corresponding class-wise conceptual models. Such a generative feature enhancement scheme greatly improves the detection accuracy on distant objects with fewer numbers of points. SIENet [27] predicts the spatial shapes of foreground points in proposals, where the prediction module is trained with external data. Semantic point generation (SPG) [14] closes the domain gap by adopting an SPG module to recover the foreground points overlooked by the sensors. Btc-Det [13] predicts the occupancy map and estimates the complete object shapes that are occluded with prior learned information. SFD [15] generates pseudo point clouds by estimating depth on RGB images and extracting rich contextual and spatial features with attentive fusion with raw point clouds. Generative modules provide conceptual information that is not perceived by the sensors. Considering the advantages of generative modules, a more simple approach to estimate the complete object shapes is further investigated.

III. PA-RCNN: POINT AUGMENTATION FOR VULNERABLE OBJECT DETECTION
This section introduces the proposed PA-RCNN detector. Based on PV-RCNN [2], a two-stage detection framework, the author explores the improvements of bird's eye view (BEV) encoders, point sampling strategy, deformable point-voxel feature aggregation and point augmentation. Fig. 2 shows the layout of the network. The first stage of PA-RCNN encodes voxel features with the 3D backbone of sparse and sub-manifold convolutions and BEV features with the 2D backbone of an attention-based semantic mining module. BEV features are used to generate detection proposals. The proposals are then refined by the second stage to perform point augmentation and RoI-grid pooling to produce the final predictions.

A. Attention-Based Semantic Mining Module
2D BEV features are crucial in the 2-stage detection pipeline, as the detection proposals are generated solely from the BEV feature map. The quality of proposals directly affects the final results. In recent methods, flattened 2D BEV feature maps are processed with the widely used 2D backbone, consisting of a group of basic convolution layers for encoding and decoding. The features are less sensitive to small objects due to the partition effects, where small objects may be neglected or truncated through pooling. To enhance the feature richness in the 2D backbone, CIA-SSD [28] builds a dual-branch encoding scheme and SMF-SSD [29] uses multi-scale 3D features. Inspired by [30], we consider both the depth and width of the features. The deeper features focus on the high-dimensional semantic information of the scenes, while the shallower features emphasise the intraand inter-instance geometric relations. Similar to [28], a dualbranch feature encoder is constructed to avoid the shallower features being washed out in a deep neural network. On the short path, feature map resolution and the number of channels remain unchanged with fully connected layers φ. While [28] uses a single semantic branch in the 2D backbone, ASM further exploits the high-dimensional semantics by adopting multiple information paths in the semantic branch. On this long path, strided convolutions ψ are used to aggregate high dimensional semantics. Different from SMF-SSD [29], only features from the last layer of the 3D backbone are fed to ASM. The structure of ASM is shown in Fig. 2(b). Given the flatten 3D feature map, F flat , the process can be summarised as: where τ is a shared bottom-up convolution layer, F bev,g and F bev,s,i are the features from the geometric and semantics branches respectively. Unlike dense connections [31], where features from different layers are stacked through concatenation, we follow SKNet [32] to use branch-wise attention for its advantages in filtering meaningful discriminative information from a sparse feature map. To match the feature map sizes of the branches, extra deconvolution layers are introduced to follow the strided convolutions to obtain F bev,s . We compute the intermediate features z with the channel-wise addition of F bev,g and all F bev,s,i . The attention weights for each path in both branches can be denoted as Ω = {ω g , ω s,1 , . . ., ω s,i }. The weights can be given as: where A = {A g , A s,1 , . . ., A s,i } are the attention embeddings for F bev,g and each F bev,s,i respectively. The aggregated BEV features can then be given as: where ω g + l i=1 ω s,i = 1. l is the number of layers on the semantic branch. The attention mechanism naturally gathers the related information from both the geometric and semantic feature maps.

B. RoI Refinement With Auxiliary Points 1) Object-Guided Point Sampling:
The quality of point sampling greatly affects the efficiency of the refinement stage. Given the voxel features F voxel and sampled key points q = {q i | i = 1, . . ., N }, the feature encoder aggregates the features of neighbouring voxels around each key points. To ensure the aggregated features are relevant to the target objects, it is essential to select as many foreground points as possible. Therefore the distraction from the background can be minimised when more foreground points are selected.
FPS is a popular approach in recent methods, which aims to cover the scene evenly by selecting the most distant points. FPS works well on detecting cars and vans, since the larger objects have more points. However, many background points are also selected. Semantic assisted FPS [4] introduces semantic weights to the distance between points, where foreground points have higher weights. This improves the sampling efficiency with a high computational cost. Moreover, small objects are more vulnerable to FPS and often overlooked. While aggregating voxel features to the key points, features related to a smaller object can be assigned to its nearest background point that has survived the sampling process. This leads to a mislocation of the features. The impact of feature mislocation is more significant due to smaller object sizes.
Proposals generated by the first stage provide guidance to the approximate locations and sizes of the target objects. By selecting the points within the proposals, one can improve the appearance of points from the small objects in the sampled point set. The procedure is shown in Algorithm 1. The proposal Fig. 3. Illustration of the grid-based voxel-to-point feature aggregation. With the grid drawn over the neighbourhood of a key point, local grid features are first aggregated to the local grid points. The local grid features are then processed with a grouped MLP layer to produce C in channels for each grid. The C in features are passed onto a multi-head attention layer, followed by a normalisation layer and MLP. The grid-distinct features with C in channels, attention features with C out,att and the number of points are concatenated to give a C out channel output, where C out = C in + C out,att + 1.

Algorithm 1:
Object-guided Point Sampling for N points and M proposal boxes.
end if 10: end if 11: end for 12: Q ← [Q, Q j ] 13: end for boxes are also enlarged to accommodate the imperfection and include the background points around the boxes, which possess important information to distinguish the object edges. Different from a whole scene sampling, the proposed method also creates a point set for each proposal. This helps separate points from different objects and facilitates point augmentation in the second stage.
2) Grid-Based Voxel-to-Point Feature Aggregation: The voxel representation is favoured for its regularity, which simplifies the neighbour quarrying process. Voxel neighbours around the key point can easily be found by indexing, while distance calculation and sorting are required to find point neighbours. Voxel centres can be calculated with voxel indexes (i, j, k) by: where (L, W, H) are the scene size and (N x , N y , N z ) are the numbers of voxels in each dimension. The voxelisation process tends to assign the encoded features to the voxel centres. Such process leads to the loss of fine-grained point details, since the precise point positions measured by a LiDAR sensor are not used. The actual precision of feature locations is greatly dependent on the degree of voxelisation, i.e. , voxel grid size. Smaller voxel grids produce more accurate locations for feature aggregation and remarkable results with only voxels on car detection are achieved [6]. However, cyclist and pedestrian targets are more prone to failure caused by inaccurate feature locations, since the bounding box sizes are significantly smaller than those of cars and vans.
To mitigate this inefficiency, a Grid-based Voxel-to-Point Feature Aggregation module (GVPFA) is proposed, which is illustrated in Fig. 3. Positional information is implanted by adding the relative coordinates of the neighbouring points. In addition, the point density information is also inserted by adding the number of points in the vicinity. Specifically, the space around a sampled key point q i is divided into local grids G l . Contradictory to the commonly used set abstraction [2], features are first aggregated within each local grid before being summarised to the key points. The features of local grid G l,i can be expressed as: where V i is one of the neighbouring voxels around the local grid centre. Inspired by [33], features of a key point q i can be generated by a grouped MLP by: where ω l,i is the respective weight of kernel filters of the MLP and n is the number of local grids around a key point. A grouped MLP can limit the influence between different groups by isolating the feature interaction. This allows the module to produce position-specific semantics. Memory consumption can also be reduced through the use of a grouped MLP by removing unused links. However, objects can appear at any rotational angle on the ground surface. A complete detachment of features on different grids around the key point is insufficient to address this nature. The connection between these grids has to be built to realise the rotational invariance of the features. Inspired by [9], a lightweight self-attention module is deployed over the local grids to enable feature communication. Since the sparsity of points causes a more severe deficiency in detecting small objects, the number of points in the local neighbourhood is added to the feature map. The feature output can be summarised as, where n is the number of points in each key point neighbourhood. The output of the self-attention (SA) module is concatenated with n and key point features, followed by a fully connected (FC) layer to give the output feature dimension for subsequent processes. The same strategy is applied to the RoI grid pool module, where keypoint features are aggregated to the box grid points instead of local grids. The self-attention layer also provides interdependence over individual bounding boxes.
3) Point Augmentation: Detection on only downsampled points results in a lower accuracy [3]. This is caused by insufficient information and further deteriorated by the ground truth ambiguities, occlusions and missing elements in the ground truth. As such, a potential solution is to utilise generative modules to predict the missing signals and provide the omitted semantic information [13], [14], [16]. However, the extra complexity and issues with domain adaption is often overlooked.
A simple and effective point augmentation module is built to recover the approximate shape of the object based on the pure geometric relation. By assuming approximate symmetry of the detection targets, the key points and the associated features of each proposal sampled by the object-guided point sampling module are processed with the operation T . The augmented features can be generated by: where F j is the features aggregated inside the proposal bounding box B j , [p x , p y , p z ] ∈ R N ×3 is the coordinates of N points inside B j . The operation T can be mirroring or rotating the points with reference to the bounding boxes. In our case, mirroring the points and duplicating features for the new points are used. The enhanced features, as well as the original features, are fed to the RoI grid pool module, where features are gathered to the proposal box grid points accordingly. In addition, the coordinates of all raw points in the proposal boxes are processed with the GVPFA module to provide shallow and complete geometric information. The point sets for different proposals are given by the sampling layer.
With the help of the approximated object shapes and structuresensitive features, the bounding box refinement layer can regress accurate bounding boxes based on enriched semantics from both perceptual and conceptual information, especially for small and vulnerable objects like cyclists and pedestrians.

IV. EXPERIMENTS
This section presents results from thorough experiments, and is formatted to provide: 1) a brief introduction to datasets and implementation details; 2) a comparison with other state-ofthe-art methods and 3) an analysis of the effectiveness of each component in the architecture.

A. Dataset
The proposed method is evaluated on the commonly used KITTI dataset [34] and Waymo Open dataset (WOD) [35].
1) Waymo Open Dataset: WOD is a significantly larger dataset with 798 training and 202 validation sequences, with around 160 k and 40 k point cloud samples respectively. The evaluation metric is calculated as the mean Average Precision (mAP) and the mean Average Precision weighted by Heading (mAPH). The 3D intersection-over-union (IoU) thresholds for the bounding boxes are (0.7, 0.5, 0.5) for Car, Pedestrian and Cyclist categories. Depending on how the testing samples are split, the results can be formatted by difficulty levels and detection ranges. By difficulty levels, ground truth targets are divided into LEVEL_1 and LEVEL_2, which guarantees at least 5 and 1 laser points are reflected from the objects. By detection ranges, the ground truth targets are assigned to the groups of 0 − 30 m, 30 − 50 m and > 50 m from the sensor.
1) KITTI dataset: The KITTI 3D object detection dataset contains 7481 and 7518 samples for training and testing respectively. The training set is further divided into a 50/50 train/val splits with 3712 training and 3769 validation samples respectively. The official evaluation metric is the mAP calculated by the official evaluation tool with 40 points from the precision-recall curve on three difficulty levels. The 3D IoU thresholds are (0.7, 0.5) for Car and Cyclist categories.

B. Implementation
The proposed method is built based on the widely used OpenPCDet [40] codebase. Particularly, the 2D backbone of PV-RCNN [2] is replaced with our ASM module. The RoI refinement stage is also extended with our grid-based feature aggregation and point augmentation module, while keeping the rest of the network untouched.
For the ASM module, a 2-layer semantics mining sub-module is used, which consists of a 3 × 3 convolution with a feature dimension of 128 for each layer. A deconvolution layer is added to each of the layers on the semantic branch to match the feature map shape of the geometric branch. The geometric branch comprises a single 3 × 3 convolution layer with 128 output channels. The features of different branches are summarised with an attention fusion module with 256 output channels.
For the second stage of RoI refinement, raw points that lie is the enlarged proposal boxes are sampled. The proposal boxes are enlarged by 0.2m in each axis. For aggregating features from the BEV and 3D backbone, each local grid has 32 output channels. The local grid features are processed with a transformer layer. The point augmentation is configured to mirror the raw points around the proposals to obtain the estimated shapes. For aggregating the shallow and complete geometric information from raw and augmented points, the author uses a two-scale approach with local grid sizes of (2, 2, 2), (3,3,3) for Waymo Open dataset, and (3,3,3), (4,4,4) for KITTI dataset.
The model is trained with the ADAM optimiser on 4 RTX 2080 Ti GPUs. With respect to the KITTI dataset, the model is trained with a batch size of 8 and a learning rate of 0.007 for 80 epochs. For obtaining the results on the val set, the models Open dataset, the model is trained with a batch size of 8 and a learning rate of 0.007 for 40 epochs. A 20% train-split training option is also provided, where training scenes are sampled uniformly and evaluation is performed on the full validation set. Following OpenPCDet [40], the cosine annealing learning rate decay strategy is adopted and the same data augmentation scheme is used.

C. Waymo Results
The main results on Waymo Open dataset are shown in Table I, where the comparison is made between the proposed method and the recent studies across two difficulty levels and three object categories. PA-RCNN achieves state-of-the-art performance on all columns except for Vehicle LEVEL_1, on which competitive accuracy is also obtained. Note that our method outperforms BtcDet [13], which consists of generative modules, on mAPH by 0.23% and 0.39% on both levels of the Vehicle category. It is also worth mentioning that higher improvements can be seen from vulnerable targets, which are more difficult to detect. The proposed method raises the best mAPH by 1.93% and 2.29% on Pedestrian LEVEL_1 and LEVEL_2 respectively. There is also an increase of 0.8% and 0.9% in mAPH on both levels of Cyclist. The results of the model trained with 20% of the training split are also presented to compare with VoTr-TSD [7] and Pyramid-PV [8].
Tables II, III, and IV show the comparisons pertaining to the detection range. The close-range targets are easier to identify as the point density is higher, while the distant targets are more difficult to recognise as the point density decreases with the increasing distance. While achieving competitive results for close-range vehicles, a larger advantage of the proposed method can be observed on distant objects. Although BtcDet [13] with generative modules dominates 0-30 m in vehicle detection, PA-RCNN obtains better results for vehicles at further than 30 m. This may be explained by the greater effectiveness of point augmentation and object-guided sampling on instances with fewer points. The improvement in close-range performance is limited, as the objects in this range are usually comparatively  Table V illustrates the results of the KITTI val split. Our method improves Car mAP by +0.94% and +0.48% as compared with those of its baseline (PV-RCNN [2]) and the second best model (PDV [39]) with respect to the moderate difficulty. In both the Pedestrian and Cyclist classes, the proposed method also shows competitive performance to some of the multi-modality methods. We also include the performance of PA-RCNN(P), where the only Point Augmentation module is adapted to the PointRCNN [5] framework. Since PointRCNN is a point-only detector, we can observe the improvement solely induced by PA. However, the improvement on vulnerable objects is limited as a   Table VI presents the results on the KITTI testing server. While achieving an improvement on the cyclist class over the multi-class detectors (PV-RCNN [2], PV-RCNN++ [36] and PDV [39]), the proposed model obtains the best overall detection on cyclists among methods without generative modules. However, the performance enhancement in the Car category is limited compared to the state-of-the-art. This can be explained by the smaller voxel size on the KITTI dataset. Note that WOD has a voxel size of (0.1, 0.1, 0.15)m, while KITTI has a voxel size of (0.05, 0.05, 0.1)m. Compared to WOD, the finer-grained voxelisation on the KITTI dataset permitted by the smaller detection range allows the baseline detectors to extract information from denser feature maps. This limits the improvement provided by the proposed point augmentation and the GVPFA modules, which aim to compensate for the information loss due to the partition effects. The degradation in the Pedestrian category can also be explained by the finer input of KITTI. Furthermore, by examining the prediction results on the val set, some false positives regarding the Pedestrian class are actual target pedestrians visually observable from RGB images. A sample is shown on the left in Fig. 7(a), where the unlabelled pedestrians are correctly detected. It can be hypothesised that   a similar case would be observed on the more difficult test set. Such results are less frequent on WOD. Fig. 4, 5, and 6 show the mAP by distance on KITTI val set. It is noticeable that PA-RCNN outperforms the baseline in all ranges, except for the cars from between 10 and 30 meters.

E. Ablation Study
This section compares the effectiveness of each component and the variation of the network.
Effect of network components: Table VII shows the quantitative improvement as LEVEL_2 mAPH contributed by each component on a 10% training set where Config. 1 is the baseline PVRCNN [2] network re-implemented in the OpenPCDet codebase [40]. By introducing the attention-based semantic mining module to the BEV feature map (Config. 2), a 0.68%, 1.14% and   VII  COMPONENT ANALYSIS ON 10% WAYMO OPEN DATASET. SM, OG, GB AND  PA REPRESENT THE SEMANTIC MINING BEV ENCODER, OBJECT-GUIDED  POINT SAMPLING, GRID-BASED VOXEL-TO-POINT FEATURE AGGREGATION  AND POINT AUGMENTATION RESPECTIVELY   TABLE VIII  COMPARISON Table VIII shows a comparison of different schemes. By just including all raw points around the proposal boxes (Config. 4-a), the largest improvement of 0.98% is seen on the Pedestrian LEVEL_2 mAPH. Minor improvements are observed for rotating the points around the box centres and mirroring the points about the transversal centre (yz) plane (Config. 4-b and 4-c). It is hypothesised that the orientation information can be corrupted through the transformations, leading to sub-optimal results on the heading weighted performances. However, the improvement provided by the increased point density has overpowered the deficiency of sub-optimal transformation. The more accurate localisation of bounding boxes compensates for the corrupted heading estimation. This can be explained by the visualisations in Fig. 1, where all three transformations provide rich geometry information for locating the target accurately and directional information is degraded when the object is rotated or mirrored longitudinally.
By mirroring the point about the longitudinal centre (xz) plane (Config. 5), the symmetric characteristics can be fully utilised. Noticeable gains are achieved in all classes. Particularly, a further boost of 0.93% and 0.87% is observed in the vulnerable pedestrian and cyclist classes. Visualisations of the point augmentation are shown in Fig. 7. It can be observed that the fictional points generated by the point augmentation module provide extra information on distant objects by estimating the complete shapes. The RGB image in Fig. 7(a) shows that PA-RCNN can successfully detect distant pedestrians and cyclists, which are visually visible but without groundtruth labels. Fig. 7 includes two samples from the KITTI val set. A dense point set can be seen with the augmented points generated by the PA module, achieving accurate detection on distance and vulnerable targets. Fig. 8 shows a visualisation of the detection results on a sample instance from the KITTI val set. Table IX summarises the numbers of false positives on KITTI dataset. With more distinctive details added by the point augmentation module, it is noticeable that the amount the false positives is drastically reduced. In addition, the inclusion of point density information also provides another confidence measure to help reduce false positives, especially on vulnerable targets.

G. Comparison With Other Augmentation Methods
The point augmentation module in the proposed method aims to estimate more complete shapes of the targets. The same task can be achieved by a point completion (PC) module in PC-RGNN [25], where a multi-resolution graph encoder and a point pyramid decoder are used. The PC module is applied to the 3D proposals and trained with the completion loss and adversarial loss, with the discriminator aiming to distinguish fictitious points from the real point cloud. While more trainable parameters need to be considered in PC-RGNN, the proposed PA-RCNN requires no additional optimisation targets for shape estimation. SIENet [27] builds a Spatial Information Enhancement (SIE) module based on PV-RCNN. The SIE module is tasked to complete the shape of proposals from the first stage RPN. The spatial shape prediction module in SIENet consists of a PointNet-based encoder-decoder, which maps an N × 3 incomplete shape to a dense and complete shape with 1024 points. The SIE module is pre-trained with samples from the external ShapeNet [44] dataset for the Car category. For the Pedestrian and Cyclist categories, training samples are taken from KITTI dataset, due to the lack of corresponding external data sources. The semantic point generation (SPG) module [14] generates augmented points based on the voxel features in the proposal regions. The point generation module is trained to map the voxel or pillar features to voxel centroids and mean point features. Similar to SIENet, BtcDet [13] estimates the complete shape of objects by leveraging the more complete objects in KITTI  dataset. While the more complete objects are used as training targets in SIENet, BtcDet finds the best match from a collection of labelled objects according to a heuristic function. The points of the best match are then added to the proposal bounding box. Note that an extra database of the labelled object with complete object points is required before the training. Moreover, due to the database being only generated from KITTI dataset, the performance is limited when the point distribution is different in an unseen dataset. While most of the above methods require the design of additional training objectives, the proposed PA-RCNN model is a pure end-to-end network, where no extra trainable parameters are added for shape completion. SFD [15] augments the detection workflow by performing depth completion on RGB images. The generated pseudo clouds with RGB information realise the depth-based data augmentation. Despite that SFD achieved remarkable performance on single-class Car detection on the KITTI test set, PA-RCNN explores the improvement with only point cloud inputs and provides competitive multi-class detection results.
Experiments have also been conducted for feature-level augmentation. However, it requires additional memory for operation and provides a limited improvement on the final results in our investigations. Based on the study of related works, some existing methods have explored feature-level augmentation, such as AGO-Net [26], BtcDet [13] and SFD [15]. AGO-Net [26] uses the conceptual-perceptual approach and is trained in a self-contained manner. The conceptual network is trained with a fully augmented dataset, where the incomplete objects are replaced by their closest pairs with appropriate transformation. The perceptual network is trained with the original dataset, with an additional loss for feature adaptation. The parameters of the perceptual network are then adjusted in accordance with the conceptual features, which are generated by the conceptual network from the same training samples with full augmentation. BtcDet [13] creates a database for the occluded regions, which is used to train the model for estimating occupancy probability. The occupancy probability map is used as additional features to boost the main detection performance. SFD [15] performs feature level augmentation by generating pseudo point clouds with additional image inputs, which are encoded by the Colour Point Convolution (CPConv) [15] to obtain the pseudo RoI features. Feature level augmentation provides significant improvements on detection accuracy with the help of additional inputs or hand-crafted optimisation targets. However, neither model pre-training nor an extra database is required to train PA-RCNN end-to-end.

V. CONCLUSION
This article presents a two-stage detection network for intelligent vehicles incorporating the enhanced feature encoding and aggregation scheme to focus on detecting pedestrians and cyclists. A shape estimation module with no trainable parameters is introduced to remedy the point sparsity and signal loss on vulnerable road users. Promising results on KITTI and Waymo Open datasets show the effectiveness of each component of the architecture. The compatibility of the current model allows us to adapt the proposed method to more 3D detection backbones in the future. However, the improvement is limited when the voxelisation is more sophisticated. In the context of smaller voxels, 1) the error between voxel centres and the actual point locations is smaller; and 2) there is less discrepancy in point density in each voxel. Furthermore, while the proposed method induces considerably less computation burden compared to generative models, there is still a future plan to optimise the point augmentation module for better memory usage and inference time. The extra reduction in computational resources used by the new modules can facilitate the deployment of the algorithm to onboard computers of intelligent vehicles. The scalability of the framework should also be investigated to incorporate a larger backbone for more complex scenes.

A. More Qualitative Results
In this section, more visualisations are provided. Fig. 9(a) shows that the PA-RCNN has the capability of detecting distant objects. Effective pedestrian detection is seen from Fig. 9(b). More complicated scenarios can be observed in Fig. 10 for the Waymo Open Dataset. It is noticeable that PA-RCNN is able to accurately locate distant pedestrians and vehicles with occlusions and incomplete shapes in crowded urban scenes. Fig. 11 depicts the point augmentation on the pedestrian class.     Bottom: image with the annotations of ground truth bounding boxes in blue and predicted bounding boxes in green. Arrows indicate the correctly detected targets, which are not labelled in the dataset. Orange circles highlight the ambiguous targets associated with the wrong detection shown in Fig. 14. The blue circle indicates the target, whose front-facing direction is incorrectly predicted with a Mirror (xz-plane) transformation shown in Fig. 12(b). Fig. 14 shows the visualisations of proposals generated by PA-RCNN for a complex sample with a number of occlusions and overlaps in the KITTI val split. Fig. 13 shows the visualisation of proposals with respective three transformation schemes for the above example. For clarity, only 50 proposals are displayed in green. Fig. 14 shows that the sub-optimal transformation scheme leads to the false positive detection on several ambiguous targets (i.e. humans sitting in the background highlighted in orange) in a complicated scene. Fig. 12 shows the comparisons of final detection results. It can be seen that the optimal transformation