Overall Objectives
Research Program
Application Domains
New Software and Platforms
New Results
Bilateral Contracts and Grants with Industry
Partnerships and Cooperations
XML PDF e-pub
PDF e-Pub

Section: New Results

Recognizing Pedestrians using Cross-Modal Convolutional Networks

Participants : Danut-Ovidiu Pop, Fawzi Nashashibi.

Pedestrian detection and recognition is of great importance for autonomous vehicles. A pedestrian detection system depends on: 1) the sensors utilized to capture the visual data, 2) the features extracted from the acquired images and 3) the classification process. Considering existing data-sets of images (Daimler, Caltech and KITTI) we have focused only on the last two points. Our question is whether one modality can be used exclusively (standpoint one) for training the classification model used to recognize pedestrians in another modality or only partially (standpoint two) for improving the training of the classification model in another modality. If it is trained on multi-modal data, can the system still work when the data from one of the domains is missing? How much information is redundant across the domains (can we regenerate data in one domain on the basis of the observation from the other domain)? How could a multi-modal system be trained, when data in one of the modalities is scarce (e.g. many more images in the visual spectrum than depth). To our knowledge, these questions have not yet been answered for the pedestrian recognition task. Our work proposes to solve this brain-teaser through various experiments based on the Daimler stereo vision data set. This year, we perform the following experimental studies (More detail can be found in [32], [33], [34]):

  1. Three different image modalities (Intensity, Depth, Optical Flow) for improving the classification component are considered. The Classical Training and the Cross Training methods are analyzed. On the Cross Training method, the CNN is trained and validated on different images modalities, in contrast to classical training method in which the training and validation of each CNN is on same images modality.

  2. In [33], [34] we study how learning representations from one modality would enable prediction for other modalities, which one terms as cross modality. Several approaches are proposed:

    a) A correlated model where a unique CNN is trained with Intensity, Depth and Flow images for each frame,

    b) An incremental model where a CNN is trained with the first modality images frames, then a second CNN, initialized by transfer learning on the first one is trained on the second modality images frames, and finally a third CNN initialized on the second one, is trained on the last modality images frames.

    c) A particular cross-modality model, where each CNN is trained on one modality, but tested on a different one.

  3. In [32] two different fusion schemes are studied:

    a) The early fusion model is built by concatenating three image modalities (intensity, depth and optical flow) to feed a unique CNN.

    b) The late fusion model consists in fusing the outputs scores (the class probability estimate) of three independent CNNs, trained on intensity, depth and optical flow images, by a classifier system.