NVIDIA DLI Course

Building AI Agents with Multimodal Models

A comprehensive showcase of advanced multimodal techniques for
combining vision, language, and structured data

Course Notebooks

1a. Early and Late Fusion

Multimodal models are a simple concept with a surprisingly complex practice. This notebook compares different data types and demonstrates early-vs-late fusion techniques using a robotics use-case.

Learning Objectives

  • Explore the properties of LiDAR data
  • Construct and compare single-modal RGB and LiDAR models
  • Build a late-fusion multimodal model
  • Build an early-fusion multimodal model
View Notebook Findings & Insights

1b. Exploring Modalities

Investigate individual sensor streams (RGB, depth and text) and learn how to encode and align them into a shared embedding space.

Learning Objectives

  • Visualise and pre-process RGB, depth and text data
  • Train modality-specific encoders
  • Align embeddings with a contrastive loss
  • Evaluate cross-modal retrieval performance
View Notebook Findings & Insights

Course Overview

This comprehensive NVIDIA Deep Learning Institute course explores the cutting-edge intersection of multiple data modalities in AI systems. Through hands-on notebooks and practical exercises, I've gained expertise in building systems that can process, understand, and generate across text, images, audio, and other data types.

The course progresses from foundational fusion techniques to advanced vector search systems with graph-based retrieval augmented generation, providing both theoretical understanding and practical implementation skills for building sophisticated multimodal AI agents.

Course Progression

  1. Fundamentals of Multimodal Fusion - Early and late fusion techniques
  2. Modality Exploration - Understanding different data types
  3. Advanced Integration - Intermediate fusion and contrastive learning
  4. Specialized Techniques - Projection methods and OCR pipelines
  5. Production Systems - Vector search and Graph RAG implementation

Course Structure

8 comprehensive notebooks spanning theory and practical implementation

Technologies

PyTorch, TensorRT, NVIDIA NeMo, Triton Inference Server

Key Skills

Multimodal fusion, LLM integration, vector search, Graph RAG

Certification

NVIDIA Deep Learning Institute certified - June 2025

Skills Acquired

Multimodal Fusion Techniques

Early, Late, and Intermediate approaches

Contrastive Learning

Cross-modal representation alignment

Vector Search Systems

Fast similarity search for multimodal data

Graph RAG Integration

Knowledge graphs with retrieval augmentation

OCR Pipelines

Text extraction from visual data

Projection Methods

Dimensionality reduction for multimodal data

NVIDIA DLI Certification

Successfully completed NVIDIA's Deep Learning Institute certification in "Building AI Agents with Multimodal Models" with hands-on expertise in multimodal fusion, vector search systems, and graph-based RAG.

View Certificate