2025-COSS-MultGPU | Compute Ontario Training

Enrolment options

Data Parallelism and Model Parallelism for Scaling Training Across Multiple GPUs

Description: Larger Deep Neural Networks (DNNs) are typically more powerful, but training models across multiple GPUs or multiple nodes isn’t trivial and requires a an understanding of both AI and high-performance computing (HPC). In this workshop we will give an overview of activation checkpointing, gradient accumulation, and various forms of data and model parallelism to overcome the challenges associated with large-model memory footprint, and walk through some examples.

Teacher: Jonathan Dursi (NVIDIA)

Level: Intermediate/Advanced

Format: Lecture + Demo

Certificate: Attendance

Prerequisites:

Familiarity with training models in Pytorch on a single GPU will be assumed.

Self enrolment (Participant)

Guests cannot access this course. Please log in.