Executive Summary – 1b. Exploring Modalities

Key Concepts

Modality Landscape: Beyond vision, modern AI leverages text, audio, depth and more. Each modality has its own sampling rate, dimensionality and noise profile.
Representation Learning: Raw sensor streams are transformed into latent embeddings via modality-specific encoders (CNNs for images, spectrogram CNNs for audio, transformers for text).
Alignment Objective: A contrastive loss (e.g. CLIP-style) pulls related modality pairs together and pushes mismatches apart.

Inspect & visualise RGB images, LiDAR depth maps and textual captions.
Pre-process each modality (normalise pixels, quantise point-clouds, tokenise text).
Train stand-alone encoders to obtain image, depth and text embeddings.
Apply contrastive alignment to create a shared multimodal space.

Aligned embeddings enable zero-shot cross-modal retrieval of scene descriptions.

Audio & depth benefit from log-scale normalisation prior to CNN encoding.
A shared projection head (vs per-modality heads) simplifies deployment but may slightly hurt specialised performance.
Evaluate with cross-modal retrieval tasks, not just classification accuracy.

The notebook contains interactive widgets to play audio clips and visualise retrieved images side-by-side.