Executive Summary – 1b. Exploring Modalities
Key Concepts
- Modality Landscape: Beyond vision, modern AI leverages text, audio, depth and more. Each modality has its own sampling rate, dimensionality and noise profile.
- Representation Learning: Raw sensor streams are transformed into latent embeddings via modality-specific encoders (CNNs for images, spectrogram CNNs for audio, transformers for text).
- Alignment Objective: A contrastive loss (e.g. CLIP-style) pulls related modality pairs together and pushes mismatches apart.
Workflow
- Inspect & visualise RGB images, LiDAR depth maps and textual captions.
- Pre-process each modality (normalise pixels, quantise point-clouds, tokenise text).
- Train stand-alone encoders to obtain image, depth and text embeddings.
- Apply contrastive alignment to create a shared multimodal space.
Results
Embedding Pair | Recall@1 |
---|
Image → Text | 77 % |
Depth → Image | 68 % |
Text → Depth | 63 % |
Aligned embeddings enable zero-shot cross-modal retrieval of scene descriptions.
Practical Insights
- Audio & depth benefit from log-scale normalisation prior to CNN encoding.
- A shared projection head (vs per-modality heads) simplifies deployment but may slightly hurt specialised performance.
- Evaluate with cross-modal retrieval tasks, not just classification accuracy.
The notebook contains interactive widgets to play audio clips and visualise retrieved images side-by-side.