NVIDIA DLI Course

Building AI Agents with Multimodal Models

A comprehensive showcase of advanced multimodal techniques for
combining vision, language, and structured data

Course Notebooks

1a. Early and Late Fusion

Multimodal models are a simple concept with a surprisingly complex practice. This notebook compares different data types and demonstrates early-vs-late fusion techniques using a robotics use-case.

Learning Objectives

Explore the properties of LiDAR data
Construct and compare single-modal RGB and LiDAR models
Build a late-fusion multimodal model
Build an early-fusion multimodal model

View Notebook Findings & Insights

1b. Exploring Modalities

Investigate individual sensor streams (RGB, depth and text) and learn how to encode and align them into a shared embedding space.

Learning Objectives

Visualise and pre-process RGB, depth and text data
Train modality-specific encoders
Align embeddings with a contrastive loss
Evaluate cross-modal retrieval performance

View Notebook Findings & Insights

Course Overview

This comprehensive NVIDIA Deep Learning Institute course explores the cutting-edge intersection of multiple data modalities in AI systems. Through hands-on notebooks and practical exercises, I've gained expertise in building systems that can process, understand, and generate across text, images, audio, and other data types.

The course progresses from foundational fusion techniques to advanced vector search systems with graph-based retrieval augmented generation, providing both theoretical understanding and practical implementation skills for building sophisticated multimodal AI agents.

Course Progression

Fundamentals of Multimodal Fusion - Early and late fusion techniques
Modality Exploration - Understanding different data types
Advanced Integration - Intermediate fusion and contrastive learning
Specialized Techniques - Projection methods and OCR pipelines
Production Systems - Vector search and Graph RAG implementation

Course Structure

8 comprehensive notebooks spanning theory and practical implementation

Technologies

PyTorch, TensorRT, NVIDIA NeMo, Triton Inference Server

Key Skills

Multimodal fusion, LLM integration, vector search, Graph RAG

Certification

NVIDIA Deep Learning Institute certified - June 2025

Skills Acquired

Multimodal Fusion Techniques

Early, Late, and Intermediate approaches

Contrastive Learning

Cross-modal representation alignment

Vector Search Systems

Fast similarity search for multimodal data

Graph RAG Integration

Knowledge graphs with retrieval augmentation

OCR Pipelines

Text extraction from visual data

Projection Methods

Dimensionality reduction for multimodal data

NVIDIA DLI Certification

Successfully completed NVIDIA's Deep Learning Institute certification in "Building AI Agents with Multimodal Models" with hands-on expertise in multimodal fusion, vector search systems, and graph-based RAG.

View Certificate