Executive Summary – 1b. Exploring Modalities

Key Concepts

Workflow

  1. Inspect & visualise RGB images, LiDAR depth maps and textual captions.
  2. Pre-process each modality (normalise pixels, quantise point-clouds, tokenise text).
  3. Train stand-alone encoders to obtain image, depth and text embeddings.
  4. Apply contrastive alignment to create a shared multimodal space.

Results

Embedding PairRecall@1
Image → Text77 %
Depth → Image68 %
Text → Depth63 %
Aligned embeddings enable zero-shot cross-modal retrieval of scene descriptions.

Practical Insights


The notebook contains interactive widgets to play audio clips and visualise retrieved images side-by-side.