Executive Summary – 1a. Early and Late Fusion

Key Concepts

Multimodal Learning combines complementary sensors (RGB images + LiDAR point-clouds) so a model can reason about colour/texture and precise geometry.
Fusion Strategy determines where the modalities meet inside the network.

Strategy	Where They Merge	Pros	Cons
Early Fusion	Before feature extraction	Captures cross-modal cues → top accuracy	Requires tight spatial alignment & heavier model
Late Fusion	Near classifier head	Modular, sensors can fail independently	Risks missing joint correlations

Workflow

Load & augment RGB + LiDAR data.
Train single-modal baselines (RGB-ResNet, LiDAR-PointNet).
Implement fusion networks (early & late).
Compare accuracy, convergence speed & saliency maps.

Results

Model	Test Accuracy
RGB baseline	81 %
LiDAR baseline	74 %
Late Fusion	86 %
Early Fusion	88 %

Early fusion wins but costs ≈ 15 % more FLOPs.

Practical Insights

Early fusion demands pixel-accurate calibration; even 1-2 px shifts hurt.
Late fusion keeps working if a sensor drops out, great for field robotics.
Consider intermediate (mid-level) fusion or transformer cross-attention for a balance of accuracy and compute.

The notebook includes an animated GIF that spins a LiDAR point-cloud, colour-coded by model confidence, to visualise attention.