Murdoch University Research Repository

Welcome to the Murdoch University Research Repository

The Murdoch University Research Repository is an open access digital collection of research
created by Murdoch University staff, researchers and postgraduate students.

Learn more

A Multi-modal, discriminative and spatially invariant CNN for RGB-D object labeling

Asif, U., Bennamoun, M. and Sohel, F. (2017) A Multi-modal, discriminative and spatially invariant CNN for RGB-D object labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40 (9). pp. 2051-2065.

Link to Published Version:
*Subscription may be required


While deep convolutional neural networks have shown a remarkable success in image classification, the problems of inter-class similarities, intra-class variances, the effective combination of multimodal data, and the spatial variability in images of objects remain to be major challenges. To address these problems, this paper proposes a novel framework to learn a discriminative and spatially invariant classification model for object and indoor scene recognition using multimodal RGB-D imagery. This is achieved through three postulates: 1) spatial invariance - this is achieved by combining a spatial transformer network with a deep convolutional neural network to learn features which are invariant to spatial translations, rotations, and scale changes, 2) high discriminative capability - this is achieved by introducing Fisher encoding within the CNN architecture to learn features which have small inter-class similarities and large intra-class compactness, and 3) multimodal hierarchical fusion - this is achieved through the regularization of semantic segmentation to a multi-modal CNN architecture, where class probabilities are estimated at different hierarchical levels (i.e., imageand pixel-levels), and fused into a Conditional Random Field (CRF)- based inference hypothesis, the optimization of which produces consistent class labels in RGB-D images. Extensive experimental evaluations on RGB-D object and scene datasets, and live video streams (acquired from Kinect) show that our framework produces superior object and scene classification results compared to the state-of-the-art methods.

Item Type: Journal Article
Murdoch Affiliation(s): School of Engineering and Information Technology
Publisher: IEEE
Copyright: © 2017 IEEE
Item Control Page Item Control Page