Installation and Deployment

Installation and Deployment Detailed Explanation

1. System Prerequisites and Driver Installation

Why This Matters

Before running any AI workloads on a GPU, you must make sure the operating system recognizes the GPU, and the software stack can talk to it properly.

That starts with:

Installing NVIDIA GPU drivers
Setting up a container runtime
Testing it using nvidia-smi or Docker containers

This is the foundation. If this part fails, nothing else will work—including Kubernetes or Slurm.

Step 1: Check Hardware Requirements

Before installing anything:

Minimum Requirements:

A NVIDIA GPU (Ampere, Volta, or newer preferred for AI)
A 64-bit Linux OS (Ubuntu 20.04/22.04 is most common)
At least 8–16 GB RAM and sufficient disk space
Network connectivity (for package installation and updates)

To check if the system sees your GPU, run:

lspci | grep -i nvidia

This should show a line like:

01:00.0 VGA compatible controller: NVIDIA Corporation GA100 [A100]

Step 2: Install the NVIDIA GPU Driver

This driver enables Linux to communicate with the GPU hardware.

Without it, your system won’t know a GPU even exists.

Install Driver (Ubuntu)

Open terminal and run:

sudo apt update
sudo apt install -y nvidia-driver-535
sudo reboot

nvidia-driver-535 is a common version used with CUDA 12
You can change the version depending on your CUDA/toolkit compatibility

After reboot, test it with:

nvidia-smi

You should see output like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05    Driver Version: 535.86.05    CUDA Version: 12.2     |
| GPU Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| 0  A100-SXM4-40GB      On   | 00000000:3D:00.0 Off |                    0 |
+-----------------------------------------------------------------------------+

If you see an error or command not found, the driver wasn’t installed correctly.

Step 3: (Optional but Important) Install CUDA Toolkit

If your AI workloads use CUDA directly (TensorFlow, PyTorch, or custom C++/CUDNN apps), you may also need the CUDA Toolkit.

Ways to Install:

Runfile installer from NVIDIA site (flexible, but manual)
Package managers (DEB/RPM via apt or yum)
Containers (most modern AI workflows use containers with CUDA pre-installed)

For container workflows, the CUDA Toolkit is often not required on the host, only in the container image.

Step 4: Install the NVIDIA Container Toolkit

This allows Docker to run containers that can access the GPU.

Without this, your AI containers won’t be able to see the GPU—even if nvidia-smi works on the host.

Setup Commands:

Run these step by step:

#Detect distribution
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

#Add NVIDIA’s package repo
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

#Install the toolkit
sudo apt update
sudo apt install -y nvidia-container-toolkit

#Restart Docker
sudo systemctl restart docker

Step 5: Test GPU Container Access

This is the final validation.

Run this command:

docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

You should see the same GPU stats output as you saw earlier.

If you do, congratulations—your GPU host is ready for container-based AI workloads.

Summary of Beginner Tasks You Should Master

Task	Command or Step
Verify GPU hardware	`lspci
Install GPU driver	`sudo apt install nvidia-driver-535` + `nvidia-smi`
Install NVIDIA container toolkit	Series of `curl` + `apt install` + `restart docker`
Validate with test container	`docker run --gpus all nvidia/cuda:12.0-base nvidia-smi`

2. Containerization and GPU Integration

Why Containerization?

Containers are lightweight environments that package:

An application (like your AI model)
Its dependencies (Python, libraries)
Runtime tools (CUDA, cuDNN)

They allow you to:

Run AI workloads consistently across machines
Avoid "it works on my laptop but not on the server"
Deploy jobs at scale in Kubernetes, Slurm, or Fleet Command

Why GPU Integration in Containers?

By default, Docker containers can’t access GPUs unless you:

Install the NVIDIA Container Toolkit
Configure Docker to use the NVIDIA runtime

This setup allows containers to run code on the GPU just like the host system.

Docker vs. Podman (Brief Comparison)

Tool	Notes
Docker	Most commonly used; works well with NVIDIA toolkit
Podman	Docker alternative; rootless by default; compatible with NVIDIA

For simplicity, Docker is recommended for beginners and is what most AI workloads still use.

Installing the NVIDIA Container Toolkit (Step-by-Step)

Already covered earlier, but here’s the logic behind each step:

1. Identify OS Version (Why?)

NVIDIA packages are OS-specific, so we detect which version of Ubuntu/Debian we’re using:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

2. Add NVIDIA’s Docker Repository

This tells your system where to find and download the container tools:

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

3. Install Toolkit & Restart Docker

sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

nvidia-container-toolkit includes:
- NVIDIA runtime
- NVIDIA hooks to connect containers with GPUs

Testing GPU Access in Containers

Test 1: See if Docker sees your GPU

docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Expected output:

NVIDIA-SMI 535.86.05   Driver Version: 535.86.05   CUDA Version: 12.2
GPU Name       Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC
A100-SXM4-40GB On            | 00000000:17:00.0 Off |                    0

If it works, you now have:

Docker running
NVIDIA runtime active
A working GPU inside a container

Optional: Set NVIDIA as Default Runtime

To avoid having to specify --gpus all each time, modify Docker’s config:

sudo nano /etc/docker/daemon.json

Add or update this section:

{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

Then restart Docker:

sudo systemctl restart docker

Now all containers will assume GPU access unless otherwise specified.

Using Docker Images with CUDA & AI Libraries

Use official NVIDIA containers from NGC (NVIDIA GPU Cloud) or Docker Hub:

Image	Use Case
`nvidia/cuda:12.0-base`	Basic CUDA environment
`nvcr.io/nvidia/pytorch:24.02-py3`	PyTorch with CUDA preinstalled
`nvcr.io/nvidia/tensorflow:24.01-tf2-py3`	TensorFlow 2.x + CUDA
`nvcr.io/nvidia/tritonserver:24.02-py3`	Inference with Triton

You’ll need to register for NGC access to pull from nvcr.io.

Skills You Should Practice

Task	Command / Tool
Run container with GPU	`docker run --gpus all nvidia/cuda:12.0-base nvidia-smi`
Set default GPU runtime	Edit `/etc/docker/daemon.json`
Pull CUDA container	`docker pull nvidia/cuda:12.0-base`
Build custom AI image	`Dockerfile` + `docker build -t mymodel .`
Debug GPU access issues	Use `nvidia-smi` in host and inside the container

3. Kubernetes GPU Deployment

Why Use Kubernetes for AI GPU Workloads?

Kubernetes is the standard platform for running containerized applications at scale, including:

AI model training
Inference services
Scheduled batch jobs
Multi-user GPU clusters

But Kubernetes does not support GPUs natively. You need to:

Install GPU drivers
Expose GPU hardware to Kubernetes
Set up monitoring tools

That’s where the NVIDIA GPU Operator and supporting components come in.

Step-by-Step: GPU Enablement in Kubernetes

Step 1: Install a Kubernetes Cluster

You can use any of the following to set up a basic cluster:

Method	Use Case
`kubeadm`	Production or large-scale setups
`microk8s`	Lightweight, great for testing
`minikube`	Local testing only
`RKE` or `k3s`	Lightweight Kubernetes distros

For multi-node production environments, kubeadm is recommended.

Example (single-node setup with kubeadm):

sudo apt update
sudo apt install -y kubelet kubeadm kubectl
sudo kubeadm init --pod-network-cidr=10.244.0.0/16

Apply a CNI (like Flannel or Calico), and then your cluster is ready.

Step 2: Install the NVIDIA GPU Operator

The GPU Operator automates:

GPU driver installation
DCGM setup
Device plugin deployment
GPU monitoring tools (exporters, collectors)

You can install it using Helm or kubectl.

Prerequisites:

Helm installed
NVIDIA Container Toolkit installed on nodes
nvidia-smi working on nodes

Install with Helm

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

helm install --wait \
  --generate-name \
  -n gpu-operator \
  --create-namespace \
  nvidia/gpu-operator

This creates a new namespace gpu-operator and deploys:

Component	Purpose
Driver container	Installs GPU drivers inside a container
Device plugin	Exposes GPUs to Kubernetes
DCGM exporter	Sends GPU metrics to Prometheus
Validator	Verifies if the node is fully GPU-ready

Step 3: Check GPU Node Readiness

Check the nvidia.com/gpu resource is available:

kubectl get nodes -o json | jq '.items[].status.allocatable'

Or:

kubectl describe node <node-name>

You should see:

Allocatable:
  nvidia.com/gpu: 1

This means your GPU is now visible to Kubernetes!

Step 4: Run a GPU-Powered Pod

Example pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  containers:
  - name: nvidia-container
    image: nvidia/cuda:12.0-base
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1

Deploy it with:

kubectl apply -f gpu-test.yaml

Check output with:

kubectl logs gpu-test

You should see the same nvidia-smi output from your host system.

Required GPU Components in Kubernetes

Component	Role
nvidia-device-plugin	Reports GPU hardware to the scheduler
nvidia-container-runtime	Allows GPU access in Docker-based containers
dcgm-exporter	Sends GPU telemetry to Prometheus
NVIDIA driver container	Installs the correct GPU driver version automatically

All of these are installed automatically by the GPU Operator.

Troubleshooting Tips

Problem	Fix
Pod stuck in `Pending`	Node may not have available GPUs or device plugin not running
`nvidia-smi` fails inside container	Check if driver is correctly installed by operator
No GPU resource on node	Restart operator pods or revalidate node readiness
Metrics not appearing in Prometheus	Ensure `dcgm-exporter` is running and connected

Beginner Tasks You Should Try

Task	Command / File
Install GPU Operator	`helm install nvidia/gpu-operator`
Deploy GPU pod	Use `resources.limits.nvidia.com/gpu` in YAML
Check GPU availability	`kubectl describe node`
View container logs	`kubectl logs pod-name`
Watch GPU plugin status	`kubectl get pods -n gpu-operator`

4. Base Command Manager (BCM) Setup

What is Base Command Manager (BCM)?

BCM is NVIDIA’s enterprise tool for managing AI clusters equipped with NVIDIA GPUs.

It provides a centralized way to:

Configure and monitor GPU nodes
Submit, track, and manage AI workloads
Integrate with Slurm for job scheduling
Visualize GPU health, utilization, and job status via CLI or Web UI

BCM is ideal for on-premises GPU clusters in:

Research centers
Data centers
Enterprise AI labs

Key Features of BCM

Feature	Description
Node registration	Add and manage GPU-enabled servers in the cluster
Slurm integration	Use Slurm to queue and dispatch AI jobs
GPU monitoring	Track health, utilization, and errors (via DCGM)
Access control	Assign users, roles, and resource quotas
Web UI & CLI support	Full management experience from either interface

BCM Setup Workflow

Step 1: Provision Host OS

Before you install BCM, you must:

Use a supported OS: Ubuntu 20.04+, RHEL 8+, or CentOS 8+
Ensure NVIDIA GPU drivers are installed
Confirm nvidia-smi works
Install container runtime (Docker)

Example:

sudo apt update
sudo apt install -y nvidia-driver-535 docker.io
nvidia-smi

Step 2: Install BCM Agent

The BCM agent is installed on every node (head + worker).

You get the agent package from NVIDIA via:

Official ISO or installation script
Enterprise portal or DGX system image (if you have a DGX server)

Run the installer on each node.

Step 3: Initialize BCM on Each Node

On the head node:

bcminit

This command:

Configures BCM services
Registers the host in the BCM controller
Sets up directories and default configs

On worker nodes:

bcminit --join --controller-ip <head-node-ip>

This joins the node to the cluster.

Step 4: Access BCM Web UI

By default, BCM runs a web interface on port 8443 or 443.

Access it at:

https://<head-node-ip>:8443

You can:

View all GPU nodes
Check usage and health
Submit jobs (through Slurm)
Add/remove users

Step 5: Integrate Slurm

Slurm is included with BCM and is the primary scheduler.

Check Slurm status:

sinfo
squeue

You can now submit AI jobs using:

sbatch train_model.sh

Where train_model.sh is your job script, like:

#!/bin/bash
#SBATCH --job-name=training
#SBATCH --gres=gpu:1
python train.py

Step 6: Validate the Cluster

To check if everything is working:

bcmsystem status       # Checks BCM services
sinfo                  # Shows node availability
squeue                 # Shows job queue
nvidia-smi             # Confirms GPU visibility

Also, open the Web UI and confirm:

Nodes are listed
GPUs are shown with health data
Slurm jobs appear in the interface

BCM with MIG and DCGM

BCM supports:

MIG-based GPU slicing
You can assign specific MIG instances per job or user.
DCGM integration
GPU telemetry is shown in the UI:
- Temperatures
- ECC errors
- Utilization
- Power draw

BCM User Roles and Access Control

In multi-user environments:

Admins manage infrastructure and configs
Operators manage jobs and monitor resources
Users submit jobs, view logs, and access datasets

All permissions can be set via:

Web UI → User Settings
bcmusers CLI tools

Skills You Should Learn for BCM

Skill	Description
Install BCM and register nodes	Use `bcminit` to configure both head and worker nodes
Access and use the Web UI	Manage cluster health, users, jobs visually
Use Slurm for job scheduling	Submit jobs using `sbatch`, monitor with `squeue`, `sacct`
Monitor GPU performance	View GPU stats via UI (DCGM), CLI (`nvidia-smi`), and Slurm logs
Manage users and roles	Create and configure roles for multi-tenant GPU clusters

5. DOCA – Data Center-on-a-Chip Architecture

What is DOCA?

DOCA is NVIDIA’s software framework designed to run on BlueField DPUs (Data Processing Units). It provides:

High-performance networking acceleration
Security enforcement (Zero Trust security at the infrastructure level)
Data path offloading from the CPU

DOCA is not about running AI models directly, but about optimizing and securing AI infrastructure, especially when deploying across high-performance edge or data center environments.

Key Components of DOCA

Component	Description
DOCA SDK	For developers to build custom apps that run on DPUs
DOCA Services	Pre-built containerized apps for networking, security, and storage offload
DOCA Runtime	Runtime engine for deploying and running DOCA applications
BlueField DPU	The hardware platform that runs DOCA (like a smart NIC or infrastructure CPU)

Why DOCA Matters in AI Deployments

Use Case	How DOCA Helps
Secure inference at the edge	Enforce network isolation and deep packet inspection
Large-scale AI model deployment	Offload storage/network processing from host CPUs
Data pipeline acceleration (e.g. NVMe, PCIe)	Boost performance of data movement to/from GPU compute nodes
Regulatory compliance (Zero Trust)	Ensure infrastructure-level trust, encryption, and access control

Example DOCA Use Cases in AI Infrastructure

1. Secure AI Model Inference at the Edge

Imagine a hospital using AI models for image classification on patient data.
DOCA can:
- Inspect network packets on the DPU before they reach the CPU or GPU
- Run lightweight inference at the DPU level (e.g., for preliminary triage)
- Ensure only encrypted traffic is allowed

This reduces latency, improves privacy, and offloads work from central nodes.

2. High-Speed Data Ingestion for Training

In data centers, AI workloads often pull data from:
- NVMe over Fabrics
- RDMA-enabled storage
- Streaming video from remote sensors
DOCA can:
- Accelerate this I/O pipeline directly on the DPU
- Filter data packets to reduce CPU/GPU load
- Preprocess or cache data closer to the GPU node

How DOCA Applications are Deployed

Most DOCA services are packaged as containers and run directly on the BlueField DPU, separate from the host CPU.

You interact with DOCA via:

DOCA CLI:

Used to manage DPU configuration and services.

doca_app_manager list
doca_service_control start <service_name>

NVIDIA NGC Registry:

Many DOCA containers are hosted here.

To deploy:

Use Fleet Command or container management system
Start services with systemctl or docker on BlueField OS
Connect to host applications or monitoring tools (like DCGM)

Important DOCA Concepts You Should Know

Term	Meaning
SmartNIC	Network Interface Card with built-in compute (i.e., BlueField)
Zero Trust Security	All connections must be authenticated, authorized, and encrypted
PCIe Filtering	Block or redirect data moving through PCIe bus using policy
Telemetry Offload	Stream system monitoring data from DPU directly to remote observability tools

Beginner Tasks to Practice with DOCA (if access to BlueField available)

Task	How to Do It
Access DOCA runtime on BlueField	SSH into DPU OS or serial console
List DOCA containers or services	Use `docker ps` or `doca_app_manager list`
Enable a DOCA service	`systemctl start doca-telemetry.service` or equivalent
Monitor performance	Use NVIDIA NIM or Fleet Command monitoring
Integrate with host GPU monitoring	Stream DPU telemetry alongside DCGM metrics

Summary: Why DOCA Matters in Installation & Deployment

Role in AI Infrastructure	Value It Brings
Data security at the edge	Keeps inference data private and compliant
Network acceleration	Offloads TCP/IP stack, reduces CPU/GPU contention
PCIe and storage path optimization	Speeds up data loading to GPUs for training
Telemetry collection and isolation	Improves observability and system resilience

6. Fleet Command Deployment

What is Fleet Command?

Fleet Command is NVIDIA’s cloud-based management platform for deploying and operating AI applications at the edge.

It allows you to:

Manage fleets of remote edge devices (e.g., Jetson, A100, BlueField systems)
Remotely deploy AI containers (from NGC or custom registries)
Monitor telemetry, health, and logs
Perform secure updates and troubleshooting

Ideal for industries like retail, healthcare, manufacturing, and logistics that need remote AI processing close to data sources.

Edge Deployment Architecture

Fleet Command consists of two key layers:

Layer	Function
Cloud Control Plane	Hosted by NVIDIA. Manages devices, applications, and monitoring
Edge Nodes	The devices (e.g. Jetson, GPU servers) running the workloads

Edge nodes pull workloads from the cloud and push logs and telemetry back.

Supported Hardware Platforms

Device/Server Type	Example Models
NVIDIA Jetson	Xavier NX, AGX Orin
NVIDIA-Certified Servers	With A100, L40, H100, etc.
BlueField DPUs	Combined with DOCA for secure inference
OEM Edge Devices	Integrated with NVIDIA GPUs

Typical Fleet Command Use Cases

Industry	Example Use Case
Retail	Smart checkout, customer analytics
Healthcare	Medical image inference at clinics
Logistics	Real-time video analytics for safety and compliance
Smart Cities	Edge traffic monitoring and public safety systems

Fleet Command Edge Deployment Process

Step 1: Register Device

Log into the Fleet Command portal on NVIDIA LaunchPad or enterprise dashboard.
Register a new device by:
- Generating a secure registration token
- Downloading the Edge Node Installer

On the edge system:

sudo bash edge-node-installer.sh --token <your-token>

This:

Installs core services
Configures networking and security keys
Connects device to the Fleet Command control plane

Step 2: Configure Applications

You can:

Use prebuilt containers from NGC
Upload your own container images (from private registries)
Define deployment parameters, such as:
- Resource limits
- Volume mounts
- Environment variables

Step 3: Deploy Applications

In the Web UI:

Select the device
Choose the app/container
Click Deploy

Fleet Command pulls the container image, launches it on the device, and:

Starts the container with NVIDIA GPU access
Monitors logs and performance
Reports back to the UI

You can stop, update, or redeploy at any time.

Step 4: Monitor and Troubleshoot

You can view:

System logs (journal logs, container logs)
Application logs
Device health: CPU, memory, GPU, temperature
Connectivity status

If a node fails or loses connection:

You get an alert in the dashboard
You can re-register or reset remotely

Security Features in Fleet Command

Security Mechanism	Purpose
Secure bootstrapping	TLS-authenticated registration process
Remote software updates	Signatures and rollback support
Isolated container runtime	Applications run in secure containers
Role-Based Access Control (RBAC)	Define who can deploy, monitor, and access logs

Fleet Command is designed with Zero Trust principles—critical for edge deployments in regulated industries.

Beginner Tasks to Practice (if you access LaunchPad or Fleet Command sandbox)

Task	How to Do It
Register an edge device	Use `edge-node-installer.sh` with secure token
Deploy an AI container	Select image, define parameters, deploy via Web UI
View logs	Go to Logs tab for the device or application
Monitor health	View resource usage and status in the dashboard
Stop or restart applications	Click “Stop” or “Restart” in the App control panel

Summary: What You Gain with Fleet Command

Feature	Value
Centralized control	Manage all edge AI nodes from one dashboard
Remote deployment	No physical access required to update or manage systems
Real-time monitoring	Know immediately if a workload or device has an issue
Scalability	Manage tens, hundreds, or thousands of edge nodes
Security and compliance	Protect sensitive data and models in untrusted environments

7. Magnum IO – Multi-GPU, Multi-Node I/O Acceleration

What is Magnum IO?

Magnum IO is a suite of software libraries, tools, and drivers from NVIDIA designed to optimize I/O (input/output) operations for:

Multi-GPU workloads
Multi-node training clusters
High-performance AI and HPC applications

It enables GPUs to communicate faster with each other and with storage and networking systems—eliminating bottlenecks in distributed training or inference.

Why Magnum IO Matters for AI Workloads

Challenge	How Magnum IO Helps
Multi-GPU communication latency	Uses NVLink/NVSwitch with NCCL, UCX
Poor I/O performance in large clusters	Accelerates data movement via GPUDirect
CPU bottlenecks in communication stack	Offloads I/O using DPU + GPUDirect RDMA
Inefficient distributed training	Ensures synchronized model updates via NCCL

Key Components of Magnum IO

Component	Description
NCCL (NVIDIA Collective Communication Library)	Handles GPU-to-GPU data movement across nodes
UCX (Unified Communication X)	Framework that abstracts different transport methods
UCC (Unified Collective Communication)	Layer that sits above UCX for collective ops
GPUDirect RDMA	Enables network cards to communicate directly with GPU memory
GPUDirect Storage	Enables storage devices to read/write from GPU memory directly

Typical Use Case: Distributed AI Training

Let’s say you're training a ResNet model on 16 A100 GPUs spread across 4 servers.

Without Magnum IO:

Each GPU sends gradients through the CPU and NIC to other GPUs
Data sync is slow and CPU-bound

With Magnum IO:

Gradients are sent directly GPU-to-GPU using NVLink + NCCL
GPUDirect RDMA handles networking bypassing CPU
Result: Faster convergence and lower training time

How It Works (Simplified Diagram)

[GPU 1]───NVLink───[GPU 2]
   │                   │
   └─GPUDirect──NIC───┘
             │
        Ethernet/IB
             │
         [Other Node]

GPUs communicate over NVLink/NVSwitch inside the node
Data goes through NICs directly using GPUDirect
NCCL handles collective operations like AllReduce, Broadcast

How to Enable and Use Magnum IO

Use NVIDIA NGC Containers (e.g., PyTorch with NCCL preinstalled):
```
docker run --gpus all nvcr.io/nvidia/pytorch:24.02-py3
```
Enable GPUDirect RDMA in the kernel and driver
- Kernel modules like nvidia_peermem must be loaded
- NIC (e.g., Mellanox) must support RDMA
Use NCCL backend in training code:
- PyTorch:
```
dist.init_process_group(backend='nccl')
```
- TensorFlow:
  Uses horovod or tf.distribute under the hood with NCCL
Ensure correct topology:
- Run nvidia-smi topo -m to see NVLink/NVSwitch layout
Monitor performance:
- Use nvprof, nsys, or DCGM metrics
- Watch NCCL logs for communication efficiency

Best Practices for Magnum IO Clusters

Practice	Description
Use NVSwitch/NVLink where available	Enables full bandwidth between GPUs
Use Mellanox NICs with RDMA	Required for GPUDirect RDMA
Align process placement with topology	Use tools like `mpirun --map-by ppr:...` to colocate processes
Use NGC containers with NCCL	Preconfigured and optimized for distributed training
Enable `nv_peer_mem` on all nodes	Required for GPUDirect to work correctly

Summary: Why Magnum IO Is Critical

Benefit	Result
Fast multi-GPU communication	Speeds up gradient exchange and model updates
CPU offloading	More resources available for preprocessing or other tasks
Optimized I/O paths	Reduces training time and improves throughput
Compatibility with AI frameworks	Works with PyTorch, TensorFlow, Horovod, etc.
Scalable to hundreds of nodes	Suitable for supercomputers and hyperscale AI infrastructure

8. Storage Considerations & Deployment Automation

Why Storage Matters in AI Workloads

AI workloads are data-intensive. Poor storage architecture can cause:

Slow training speeds
Data loading bottlenecks
Underutilized GPUs

To avoid this, you must:

Use high-throughput file systems
Optimize I/O paths
Enable RDMA where possible

Storage Considerations for AI

1. Use Parallel File Systems

File System	Notes
Lustre	High-performance, widely used in HPC
BeeGFS	Easy to scale, optimized for mixed workloads
IBM Spectrum Scale (GPFS)	Very scalable, enterprise-ready

These systems:

Split large files across multiple storage servers
Support many concurrent readers/writers (perfect for multi-GPU training)

2. Use RDMA (Remote Direct Memory Access)

RDMA allows storage/network adapters to:

Transfer data directly between memory regions
Bypass CPU
Lower latency and increase bandwidth

Especially helpful when using GPUDirect Storage in large AI training clusters

3. Local SSDs and Data Caching

Use NVMe SSDs for local caching or high-speed scratch space
Preload frequently used datasets onto fast local disks
Use data prefetching techniques (e.g., PyTorch DataLoader with prefetch_factor)

4. Mounting AI Data in Kubernetes

Kubernetes pods can access:

Persistent Volumes (PVs) mounted via:
- NFS
- iSCSI
- CSI drivers
Object storage (e.g., MinIO, S3) via SDKs or FUSE
Shared network storage (e.g., Lustre via hostPath or CSI plugin)

Example (Persistent Volume Claim for NFS):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ai-data
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 500Gi

Deployment Automation Tools

Manual deployment is error-prone. In real AI environments, use automation tools to ensure:

Consistency across clusters
Reproducibility
Scalability

1. Terraform

Infrastructure-as-Code (IaC) for provisioning:
- VMs
- Storage
- Networking
Common for cloud-based AI infrastructure (AWS, Azure, GCP)

Example:

resource "aws_instance" "gpu_worker" {
  ami           = "ami-12345678"
  instance_type = "p4d.24xlarge"
}

2. Ansible

Automates software setup:
- GPU driver installation
- Docker/NVIDIA runtime setup
- BCM agent installation
- Slurm configuration

Example playbook:

- name: Install NVIDIA driver
  apt:
    name: nvidia-driver-535
    state: present

3. Helm

Kubernetes package manager
Use for:
- Deploying GPU Operator
- Installing monitoring stacks (Prometheus, Grafana)
- Managing custom apps with config values

Example:

helm install gpu-operator nvidia/gpu-operator -n gpu-system

4. GitOps with ArgoCD or Flux

Store Kubernetes manifests in Git
Automatically sync deployments on change
Enables version control for infrastructure

Use this for:

Production-grade MLOps
Secure deployment pipelines

Summary of Skills You Must Master

Task	Tools/Skills Required
Set up fast AI storage	Lustre, BeeGFS, NFS, GPFS, local NVMe
Optimize data access	Use prefetching, RDMA, local caching
Automate GPU node deployment	Ansible, Terraform
Automate Kubernetes apps	Helm, GitOps (ArgoCD/Flux)
Integrate with storage in K8s	PVCs, CSI drivers, object store SDKs
Validate end-to-end performance	Monitor training speed, GPU utilization, I/O stats

Installation & Deployment – Final Review Table

Subtopic	Key Tools / Commands
GPU Driver Installation	`nvidia-smi`, `apt`, CUDA toolkit
Container Runtime	Docker, NVIDIA Container Toolkit
Kubernetes GPU Setup	GPU Operator, `kubectl`, `nvidia.com/gpu`
BCM Cluster Management	`bcminit`, Slurm, BCM Web UI
DOCA & SmartNICs	BlueField, DOCA SDK, Secure Edge Inference
Fleet Command Deployment	Secure tokens, remote deployment, telemetry logs
Magnum IO	NCCL, GPUDirect RDMA, multi-node communication
Storage for AI	Lustre, RDMA, PVCs, NVMe
Deployment Automation	Terraform, Ansible, Helm, GitOps

Installation and Deployment (Additional Content)

1. Network Configuration & Debugging in Containers and Kubernetes

Purpose:

To ensure GPU containers or Pods can access external networks, internal services, and cluster components correctly.

Docker Network Modes Comparison

Mode	Description	Use Case
bridge	Default mode, NAT-based network with separate IP	Good for isolation; requires port mapping
host	Shares host’s network namespace	Better performance; limited isolation

In bridge mode, the container gets a virtual IP and communicates via the host. In host mode, the container sees the same IP as the host, which is suitable for high-performance GPU networking (e.g., RDMA).

Kubernetes Network Plugins Overview

Plugin	Feature Highlights
Flannel	Simple, stable; uses VXLAN or host-gw
Calico	Supports network policy, IP-in-IP, BGP
Cilium	eBPF-based, supports L7 security & observability

Calico is the most widely used in GPU clusters due to its rich policy support and performance.

Useful Debugging Commands

ifconfig                # Check container or node network interface
ip a                    # View all interface and IP assignments
kubectl get pods -o wide    # Show Pod IPs and node assignments
kubectl exec pod -- ping google.com  # Verify external access

2. GPU Quotas and Resource Limits in Kubernetes

Purpose:

To prevent over-allocation and support fair GPU sharing in multi-user environments.

ResourceQuota Example

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ai-team
spec:
  hard:
    nvidia.com/gpu: "4"

This restricts the entire namespace to use no more than 4 GPUs total.

LimitRange Example

apiVersion: v1
kind: LimitRange
metadata:
  name: gpu-defaults
  namespace: ai-team
spec:
  limits:
  - type: Container
    default:
      nvidia.com/gpu: 1
    max:
      nvidia.com/gpu: 2

This ensures each container in the namespace will use 1 GPU by default, with a maximum of 2.

Multi-Tenant Isolation Strategy

Use separate Namespaces for each team or user group.
Combine with RBAC policies to restrict access.
Enforce quotas with LimitRange + ResourceQuota.

3. Cluster Auto-Validation Script

Purpose:

Quickly verify that a fresh cluster deployment is healthy and ready for GPU workloads.

Recommended Bash Script Structure:

#!/bin/bash

echo "== Checking GPU =="
nvidia-smi || echo "GPU driver not working"

echo "== Checking Docker & GPU Toolkit =="
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi || echo "GPU container failed"

echo "== Checking Kubernetes =="
kubectl get nodes
kubectl describe node $(hostname)

echo "== Checking BCM (if installed) =="
bcmsystem status || echo "BCM not installed or not initialized"

This script validates GPU drivers, container runtime, Kubernetes node readiness, and optional BCM agent status.

4. Log Debugging & Common Error Diagnosis

Issue	Root Cause	Solution
`nvidia-smi` shows no GPU	Driver not installed or kernel mismatch	Run `dmesg`, check `lsmod
`nvidia-smi` fails inside container	Missing NVIDIA Container Toolkit or `--gpus` not set	Install toolkit, start container with `--gpus all`
Pod stuck in `Pending`	No GPU node or device plugin not running	Check `kubectl get pods -n gpu-operator`
BCM node not recognized	Wrong IP or missing join command	Check network, rerun `bcminit --join`

Diagnostic Tools:

journalctl -u kubelet
docker logs <container-id>
kubectl describe pod <name>
kubectl get events

5. Docker Image Optimization Techniques

Purpose:

Build efficient and portable GPU images for training or inference.

Sample Dockerfile with CUDA Support

FROM nvidia/cuda:12.2.0-cudnn8-runtime-ubuntu22.04

RUN apt update && apt install -y python3 python3-pip
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . /app
WORKDIR /app

CMD ["python3", "main.py"]

Optimization Strategies

Technique	Description
Clean intermediate layers	`apt clean && rm -rf /var/lib/apt/lists/*`
Multi-stage builds	Compile in one stage, copy only runtime to final
Use `docker buildx`	Enables cross-platform builds (e.g., ARM64)

Example:

docker buildx build --platform=linux/amd64,linux/arm64 -t myimage:gpu .

6. GPU Driver and CUDA Compatibility Reference

CUDA Version	Required Driver Version	Supported Architectures
CUDA 12.2	≥ 535.x	Hopper, Ada, Ampere
CUDA 11.8	≥ 510.x	Ampere, Volta, Turing
CUDA 10.2	≥ 440.x	Volta, Turing

Best Practices:

Always match CUDA version with driver version.
Use nvidia-smi to verify driver/CUDA compatibility.
Avoid mixing too old drivers with new CUDA versions.

Shopping cart

Subtotal:

NCP-AIO Installation and Deployment

Detailed list of NCP-AIO knowledge points

Installation and Deployment Detailed Explanation

1. System Prerequisites and Driver Installation

Why This Matters

Step 1: Check Hardware Requirements

Minimum Requirements:

Step 2: Install the NVIDIA GPU Driver

Install Driver (Ubuntu)

Step 3: (Optional but Important) Install CUDA Toolkit

Ways to Install:

Step 4: Install the NVIDIA Container Toolkit

Setup Commands:

Step 5: Test GPU Container Access

Run this command:

Summary of Beginner Tasks You Should Master

2. Containerization and GPU Integration

Why Containerization?

Why GPU Integration in Containers?

Docker vs. Podman (Brief Comparison)

Installing the NVIDIA Container Toolkit (Step-by-Step)

1. Identify OS Version (Why?)

2. Add NVIDIA’s Docker Repository

3. Install Toolkit & Restart Docker

Testing GPU Access in Containers

Test 1: See if Docker sees your GPU

Optional: Set NVIDIA as Default Runtime

Using Docker Images with CUDA & AI Libraries

Skills You Should Practice

3. Kubernetes GPU Deployment

Why Use Kubernetes for AI GPU Workloads?

Step-by-Step: GPU Enablement in Kubernetes

Step 1: Install a Kubernetes Cluster

Step 2: Install the NVIDIA GPU Operator

Prerequisites:

Install with Helm

Step 3: Check GPU Node Readiness

Step 4: Run a GPU-Powered Pod

Required GPU Components in Kubernetes

Troubleshooting Tips

Beginner Tasks You Should Try

4. Base Command Manager (BCM) Setup

What is Base Command Manager (BCM)?

Key Features of BCM

BCM Setup Workflow

Step 1: Provision Host OS

Step 2: Install BCM Agent

Step 3: Initialize BCM on Each Node

Step 4: Access BCM Web UI

Step 5: Integrate Slurm

Step 6: Validate the Cluster

BCM with MIG and DCGM

BCM User Roles and Access Control

Skills You Should Learn for BCM

5. DOCA – Data Center-on-a-Chip Architecture

What is DOCA?

Key Components of DOCA

Why DOCA Matters in AI Deployments

Example DOCA Use Cases in AI Infrastructure

1. Secure AI Model Inference at the Edge

2. High-Speed Data Ingestion for Training

How DOCA Applications are Deployed

DOCA CLI:

NVIDIA NGC Registry:

Important DOCA Concepts You Should Know

Beginner Tasks to Practice with DOCA (if access to BlueField available)

Summary: Why DOCA Matters in Installation & Deployment

6. Fleet Command Deployment

What is Fleet Command?

Edge Deployment Architecture

Supported Hardware Platforms

Typical Fleet Command Use Cases

Fleet Command Edge Deployment Process

Step 1: Register Device

Step 2: Configure Applications

Step 3: Deploy Applications

Step 4: Monitor and Troubleshoot