Executive Summary – 1a. Early and Late Fusion
Key Concepts
- Multimodal Learning combines complementary sensors (RGB images + LiDAR point-clouds) so a model can reason about colour/texture and precise geometry.
- Fusion Strategy determines where the modalities meet inside the network.
Strategy | Where They Merge | Pros | Cons |
---|
Early Fusion | Before feature extraction | Captures cross-modal cues → top accuracy | Requires tight spatial alignment & heavier model |
Late Fusion | Near classifier head | Modular, sensors can fail independently | Risks missing joint correlations |
Workflow
- Load & augment RGB + LiDAR data.
- Train single-modal baselines (RGB-ResNet, LiDAR-PointNet).
- Implement fusion networks (early & late).
- Compare accuracy, convergence speed & saliency maps.
Results
Model | Test Accuracy |
---|
RGB baseline | 81 % |
LiDAR baseline | 74 % |
Late Fusion | 86 % |
Early Fusion | 88 % |
Early fusion wins but costs ≈ 15 % more FLOPs.
Practical Insights
- Early fusion demands pixel-accurate calibration; even 1-2 px shifts hurt.
- Late fusion keeps working if a sensor drops out, great for field robotics.
- Consider intermediate (mid-level) fusion or transformer cross-attention for a balance of accuracy and compute.
The notebook includes an animated GIF that spins a LiDAR point-cloud, colour-coded by model confidence, to visualise attention.