MAPLE: Masked Pseudo-Labeling autoEncoder for Semi-supervised Point Cloud Action Recognition

Xiaodong Chen^1,2

Xinchen Liu²

Wu Liu²

Yongdong Zhang¹

Jungong Han³

Tao Mei²

¹University Of Science And Technology Of China

²AI Research of JD.com

³Aberystwyth University

ACM Multimedia 2022 (ACM MM) 2022, Poster Presentation

Download

Videos at ACM MM'2022

Analysis

GitHub Repo

Abstract

Recognizing human actions from point cloud videos has attracted tremendous attention from both academia and industry due to its wide applications like automatic driving, robotics, and human-computer interaction. However, current methods for point cloud action recognition usually require a huge amount of data with manual annotations and a complex backbone network with high computation cost, which makes it impractical for real-world applications. Therefore, this paper considers the task of semi-supervised point cloud action recognition with an efficient model. To this end, we propose a Masked Pseudo-Labeling autoEncoder (MAPLE) framework to learn effective representations with much fewer annotations for point cloud action recognition. In particular, we design a novel and efficient Decoupled spatial-temporal TransFormer (DestFormer) as the backbone of MAPLE. In DestFormer, the spatial and temporal dimensions of the 4D point cloud videos are decoupled to achieve an efficient self-attention for learning both long-term and short-term features. Moreover, to learn discriminative features from fewer annotations, we design a masked pseudo-labeling autoencoder structure to guide the DestFormer to reconstruct features of masked frames from the available frames. More importantly, for unlabeled data, we exploit the pseudo-labels from the classification head as the supervision signal for the reconstruction of features from the masked frames. Finally, comprehensive experiments demonstrate that MAPLE achieves superior results on three public benchmarks and outperforms the state-of-the-art method by 8.08% accuracy on the MSR-Action3D dataset.

5-Minute presentation video (TBD)

Architecture of DestFormer

The detail of DestFormer. (a) Data Preparation: we construct some local areas (e.g. ``a'') on adjacent frames (e.g. ``t1'', ``t2'') from the input `x_i` as what P4Conv do. (b) Spatial Extractor: we adopt P4Conv for modeling short-time local information and feed the output `s_i` frame by frame into a spatial transformer for extracting the merged local feature `m_i`. (c) Temporal Aggregator: we generate the short-term global feature `g_i` through the pooling layer and aggregate the long-term global information with the temporal encoder. (d) Prediction Head: we project the global feature `v_i` into label space via the classification head.

Architecture of MAPLE

The detail of our MAPLE. (a) Adopting our DestFormer backbone and the cross-entropy loss for the supervised training. (b) The complete training process of MAPLE: (1) The spatial extractor encodes the input video as the short-term global feature `g_i`. (2) After randomly discarding the short-term global feature `g_i`, the temporal encoder projects the visible subset of `g_i` as the latent representation `z_i`. (3) The temporal decoder is responsible for reconstructing `r_i` from the latent representation `z_i` and the mask tokens `M`. (4) The classification head generates the pseudo-label `P_i` and `\hat{P}_i` as our reconstruction target. Note that the modules here with the same colors share weights.

Evaluation and Visualization

The results of MAPLE

The Semi-supervised results of MAPLE on MSR-Action3D and NTU RGB+D 60

MSR-Action3D

NTU RGB+D 60

Ablation Study

The influence of the masking ratio and the depth of the temporal decoder.

The accuracy of classification on NTU RGB+D 60 5% labeled dataset with different masking ratios.

The accuracy of classification on NTU RGB+D 60 5% labeled dataset with different depth of temporal decoder.

Download

The division of each semi-supervised datasets

MSR-Action3D
NTU RGB+D 60
NTU RGB+D 120

Updates

[30/06/2022] We include new subsections to track updates and address FAQs.

FAQs

Q1: Explain the trend in Figure 6.

A1: As discussed in Section 4.5, the 75% masking ratio of reconstruction achieves the peak of classification performance. This trend of the plot in Figure 6 can be explained in terms of information density [1]. Similar to the Masked AutoEncoder (MAE) for image classification and video recognition [1][2][3], point cloud videos tend to have severe spatial-temporal redundancy. With this characteristic, the masked short-term features can be reconstructed from the small visible features by a high-level understanding of human actions.

Q2: Why do the figures shown in the appendix use the L2 norm instead of the optimization function used in the paper? Does the exploding and vanishing problem only occur when using the L2 norm?

A2:
(1). The purpose of the figure shown in the appendix is to explore the stability of network training under different training strategies. The exploding and vanishing problem is not the same as gradient exploding and gradient vanishing. It indicates the difference in the feature size under each training strategy.
(2). Whether the optimization function or norm is used, the exploding and vanishing problem always occurs and often leads to a decrease in classification accuracy. Compare to the optimization function, the norm can reflect the training process more intuitively and effectively.

Reference:
[1] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross B. Girshick: Masked Autoencoders Are Scalable Vision Learners. CoRR abs/2111.06377 (2021)
[2] Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He: Masked Autoencoders As Spatiotemporal Learners. CoRR abs/2205.09113 (2022)
[3] Zhan Tong, Yibing Song, Jue Wang, Limin Wang: VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. CoRR abs/2203.12602 (2022)

Paper

Chen, Liu, Liu, Zhang, Zhang, Han, Mei.
MAPLE: Masked Pseudo-Labeling autoEncoder for Semi-supervised Point Cloud Action Recognition
In ACM MM, 2022 (Poster).
(arXiv)

Cite

				@inproceedings{chen2022MAPLE,
				title={MAPLE: Masked Pseudo-Labeling autoEncoder for Semi-supervised Point Cloud Action Recognition},
				author=Chen, Liu and Liu, Zhang and Han, Mei.},
				booktitle={ACM Multimedia (ACM MM)},
				year={2022}
				}

Acknowledgements

This work was supported by the National Key R&D Program of China under Grant No.2020AAA0103800.
This work was done when Xiaodong Chen was an intern at JD AI Research.

Contact

For further questions and suggestions, please contact Xiaodong Chen (cxd1230@mail.ustc.edu.cn).