Before running any AI workloads on a GPU, you must make sure the operating system recognizes the GPU, and the software stack can talk to it properly.
That starts with:
Installing NVIDIA GPU drivers
Setting up a container runtime
Testing it using nvidia-smi or Docker containers
This is the foundation. If this part fails, nothing else will work—including Kubernetes or Slurm.
Before installing anything:
A NVIDIA GPU (Ampere, Volta, or newer preferred for AI)
A 64-bit Linux OS (Ubuntu 20.04/22.04 is most common)
At least 8–16 GB RAM and sufficient disk space
Network connectivity (for package installation and updates)
To check if the system sees your GPU, run:
lspci | grep -i nvidia
This should show a line like:
01:00.0 VGA compatible controller: NVIDIA Corporation GA100 [A100]
This driver enables Linux to communicate with the GPU hardware.
Without it, your system won’t know a GPU even exists.
Open terminal and run:
sudo apt update
sudo apt install -y nvidia-driver-535
sudo reboot
nvidia-driver-535 is a common version used with CUDA 12
You can change the version depending on your CUDA/toolkit compatibility
After reboot, test it with:
nvidia-smi
You should see output like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05 Driver Version: 535.86.05 CUDA Version: 12.2 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| 0 A100-SXM4-40GB On | 00000000:3D:00.0 Off | 0 |
+-----------------------------------------------------------------------------+
If you see an error or command not found, the driver wasn’t installed correctly.
If your AI workloads use CUDA directly (TensorFlow, PyTorch, or custom C++/CUDNN apps), you may also need the CUDA Toolkit.
Runfile installer from NVIDIA site (flexible, but manual)
Package managers (DEB/RPM via apt or yum)
Containers (most modern AI workflows use containers with CUDA pre-installed)
For container workflows, the CUDA Toolkit is often not required on the host, only in the container image.
This allows Docker to run containers that can access the GPU.
Without this, your AI containers won’t be able to see the GPU—even if nvidia-smi works on the host.
Run these step by step:
#Detect distribution
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
#Add NVIDIA’s package repo
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
#Install the toolkit
sudo apt update
sudo apt install -y nvidia-container-toolkit
#Restart Docker
sudo systemctl restart docker
This is the final validation.
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
You should see the same GPU stats output as you saw earlier.
If you do, congratulations—your GPU host is ready for container-based AI workloads.
| Task | Command or Step |
|---|---|
| Verify GPU hardware | `lspci |
| Install GPU driver | sudo apt install nvidia-driver-535 + nvidia-smi |
| Install NVIDIA container toolkit | Series of curl + apt install + restart docker |
| Validate with test container | docker run --gpus all nvidia/cuda:12.0-base nvidia-smi |
Containers are lightweight environments that package:
An application (like your AI model)
Its dependencies (Python, libraries)
Runtime tools (CUDA, cuDNN)
They allow you to:
Run AI workloads consistently across machines
Avoid "it works on my laptop but not on the server"
Deploy jobs at scale in Kubernetes, Slurm, or Fleet Command
By default, Docker containers can’t access GPUs unless you:
Install the NVIDIA Container Toolkit
Configure Docker to use the NVIDIA runtime
This setup allows containers to run code on the GPU just like the host system.
| Tool | Notes |
|---|---|
| Docker | Most commonly used; works well with NVIDIA toolkit |
| Podman | Docker alternative; rootless by default; compatible with NVIDIA |
For simplicity, Docker is recommended for beginners and is what most AI workloads still use.
Already covered earlier, but here’s the logic behind each step:
NVIDIA packages are OS-specific, so we detect which version of Ubuntu/Debian we’re using:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
This tells your system where to find and download the container tools:
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker
nvidia-container-toolkit includes:
NVIDIA runtime
NVIDIA hooks to connect containers with GPUs
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
Expected output:
NVIDIA-SMI 535.86.05 Driver Version: 535.86.05 CUDA Version: 12.2
GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC
A100-SXM4-40GB On | 00000000:17:00.0 Off | 0
If it works, you now have:
Docker running
NVIDIA runtime active
A working GPU inside a container
To avoid having to specify --gpus all each time, modify Docker’s config:
sudo nano /etc/docker/daemon.json
Add or update this section:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Then restart Docker:
sudo systemctl restart docker
Now all containers will assume GPU access unless otherwise specified.
Use official NVIDIA containers from NGC (NVIDIA GPU Cloud) or Docker Hub:
| Image | Use Case |
|---|---|
nvidia/cuda:12.0-base |
Basic CUDA environment |
nvcr.io/nvidia/pytorch:24.02-py3 |
PyTorch with CUDA preinstalled |
nvcr.io/nvidia/tensorflow:24.01-tf2-py3 |
TensorFlow 2.x + CUDA |
nvcr.io/nvidia/tritonserver:24.02-py3 |
Inference with Triton |
You’ll need to register for NGC access to pull from nvcr.io.
| Task | Command / Tool |
|---|---|
| Run container with GPU | docker run --gpus all nvidia/cuda:12.0-base nvidia-smi |
| Set default GPU runtime | Edit /etc/docker/daemon.json |
| Pull CUDA container | docker pull nvidia/cuda:12.0-base |
| Build custom AI image | Dockerfile + docker build -t mymodel . |
| Debug GPU access issues | Use nvidia-smi in host and inside the container |
Kubernetes is the standard platform for running containerized applications at scale, including:
AI model training
Inference services
Scheduled batch jobs
Multi-user GPU clusters
But Kubernetes does not support GPUs natively. You need to:
Install GPU drivers
Expose GPU hardware to Kubernetes
Set up monitoring tools
That’s where the NVIDIA GPU Operator and supporting components come in.
You can use any of the following to set up a basic cluster:
| Method | Use Case |
|---|---|
kubeadm |
Production or large-scale setups |
microk8s |
Lightweight, great for testing |
minikube |
Local testing only |
RKE or k3s |
Lightweight Kubernetes distros |
For multi-node production environments, kubeadm is recommended.
Example (single-node setup with kubeadm):
sudo apt update
sudo apt install -y kubelet kubeadm kubectl
sudo kubeadm init --pod-network-cidr=10.244.0.0/16
Apply a CNI (like Flannel or Calico), and then your cluster is ready.
The GPU Operator automates:
GPU driver installation
DCGM setup
Device plugin deployment
GPU monitoring tools (exporters, collectors)
You can install it using Helm or kubectl.
Helm installed
NVIDIA Container Toolkit installed on nodes
nvidia-smi working on nodes
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install --wait \
--generate-name \
-n gpu-operator \
--create-namespace \
nvidia/gpu-operator
This creates a new namespace gpu-operator and deploys:
| Component | Purpose |
|---|---|
| Driver container | Installs GPU drivers inside a container |
| Device plugin | Exposes GPUs to Kubernetes |
| DCGM exporter | Sends GPU metrics to Prometheus |
| Validator | Verifies if the node is fully GPU-ready |
Check the nvidia.com/gpu resource is available:
kubectl get nodes -o json | jq '.items[].status.allocatable'
Or:
kubectl describe node <node-name>
You should see:
Allocatable:
nvidia.com/gpu: 1
This means your GPU is now visible to Kubernetes!
Example pod spec:
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
containers:
- name: nvidia-container
image: nvidia/cuda:12.0-base
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
Deploy it with:
kubectl apply -f gpu-test.yaml
Check output with:
kubectl logs gpu-test
You should see the same nvidia-smi output from your host system.
| Component | Role |
|---|---|
| nvidia-device-plugin | Reports GPU hardware to the scheduler |
| nvidia-container-runtime | Allows GPU access in Docker-based containers |
| dcgm-exporter | Sends GPU telemetry to Prometheus |
| NVIDIA driver container | Installs the correct GPU driver version automatically |
All of these are installed automatically by the GPU Operator.
| Problem | Fix |
|---|---|
Pod stuck in Pending |
Node may not have available GPUs or device plugin not running |
nvidia-smi fails inside container |
Check if driver is correctly installed by operator |
| No GPU resource on node | Restart operator pods or revalidate node readiness |
| Metrics not appearing in Prometheus | Ensure dcgm-exporter is running and connected |
| Task | Command / File |
|---|---|
| Install GPU Operator | helm install nvidia/gpu-operator |
| Deploy GPU pod | Use resources.limits.nvidia.com/gpu in YAML |
| Check GPU availability | kubectl describe node |
| View container logs | kubectl logs pod-name |
| Watch GPU plugin status | kubectl get pods -n gpu-operator |
BCM is NVIDIA’s enterprise tool for managing AI clusters equipped with NVIDIA GPUs.
It provides a centralized way to:
Configure and monitor GPU nodes
Submit, track, and manage AI workloads
Integrate with Slurm for job scheduling
Visualize GPU health, utilization, and job status via CLI or Web UI
BCM is ideal for on-premises GPU clusters in:
Research centers
Data centers
Enterprise AI labs
| Feature | Description |
|---|---|
| Node registration | Add and manage GPU-enabled servers in the cluster |
| Slurm integration | Use Slurm to queue and dispatch AI jobs |
| GPU monitoring | Track health, utilization, and errors (via DCGM) |
| Access control | Assign users, roles, and resource quotas |
| Web UI & CLI support | Full management experience from either interface |
Before you install BCM, you must:
Use a supported OS: Ubuntu 20.04+, RHEL 8+, or CentOS 8+
Ensure NVIDIA GPU drivers are installed
Confirm nvidia-smi works
Install container runtime (Docker)
Example:
sudo apt update
sudo apt install -y nvidia-driver-535 docker.io
nvidia-smi
The BCM agent is installed on every node (head + worker).
You get the agent package from NVIDIA via:
Official ISO or installation script
Enterprise portal or DGX system image (if you have a DGX server)
Run the installer on each node.
On the head node:
bcminit
This command:
Configures BCM services
Registers the host in the BCM controller
Sets up directories and default configs
On worker nodes:
bcminit --join --controller-ip <head-node-ip>
This joins the node to the cluster.
By default, BCM runs a web interface on port 8443 or 443.
Access it at:
https://<head-node-ip>:8443
You can:
View all GPU nodes
Check usage and health
Submit jobs (through Slurm)
Add/remove users
Slurm is included with BCM and is the primary scheduler.
Check Slurm status:
sinfo
squeue
You can now submit AI jobs using:
sbatch train_model.sh
Where train_model.sh is your job script, like:
#!/bin/bash
#SBATCH --job-name=training
#SBATCH --gres=gpu:1
python train.py
To check if everything is working:
bcmsystem status # Checks BCM services
sinfo # Shows node availability
squeue # Shows job queue
nvidia-smi # Confirms GPU visibility
Also, open the Web UI and confirm:
Nodes are listed
GPUs are shown with health data
Slurm jobs appear in the interface
BCM supports:
MIG-based GPU slicing
You can assign specific MIG instances per job or user.
DCGM integration
GPU telemetry is shown in the UI:
Temperatures
ECC errors
Utilization
Power draw
In multi-user environments:
Admins manage infrastructure and configs
Operators manage jobs and monitor resources
Users submit jobs, view logs, and access datasets
All permissions can be set via:
Web UI → User Settings
bcmusers CLI tools
| Skill | Description |
|---|---|
| Install BCM and register nodes | Use bcminit to configure both head and worker nodes |
| Access and use the Web UI | Manage cluster health, users, jobs visually |
| Use Slurm for job scheduling | Submit jobs using sbatch, monitor with squeue, sacct |
| Monitor GPU performance | View GPU stats via UI (DCGM), CLI (nvidia-smi), and Slurm logs |
| Manage users and roles | Create and configure roles for multi-tenant GPU clusters |
DOCA is NVIDIA’s software framework designed to run on BlueField DPUs (Data Processing Units). It provides:
High-performance networking acceleration
Security enforcement (Zero Trust security at the infrastructure level)
Data path offloading from the CPU
DOCA is not about running AI models directly, but about optimizing and securing AI infrastructure, especially when deploying across high-performance edge or data center environments.
| Component | Description |
|---|---|
| DOCA SDK | For developers to build custom apps that run on DPUs |
| DOCA Services | Pre-built containerized apps for networking, security, and storage offload |
| DOCA Runtime | Runtime engine for deploying and running DOCA applications |
| BlueField DPU | The hardware platform that runs DOCA (like a smart NIC or infrastructure CPU) |
| Use Case | How DOCA Helps |
|---|---|
| Secure inference at the edge | Enforce network isolation and deep packet inspection |
| Large-scale AI model deployment | Offload storage/network processing from host CPUs |
| Data pipeline acceleration (e.g. NVMe, PCIe) | Boost performance of data movement to/from GPU compute nodes |
| Regulatory compliance (Zero Trust) | Ensure infrastructure-level trust, encryption, and access control |
Imagine a hospital using AI models for image classification on patient data.
DOCA can:
Inspect network packets on the DPU before they reach the CPU or GPU
Run lightweight inference at the DPU level (e.g., for preliminary triage)
Ensure only encrypted traffic is allowed
This reduces latency, improves privacy, and offloads work from central nodes.
In data centers, AI workloads often pull data from:
NVMe over Fabrics
RDMA-enabled storage
Streaming video from remote sensors
DOCA can:
Accelerate this I/O pipeline directly on the DPU
Filter data packets to reduce CPU/GPU load
Preprocess or cache data closer to the GPU node
Most DOCA services are packaged as containers and run directly on the BlueField DPU, separate from the host CPU.
You interact with DOCA via:
Used to manage DPU configuration and services.
doca_app_manager list
doca_service_control start <service_name>
Many DOCA containers are hosted here.
To deploy:
Use Fleet Command or container management system
Start services with systemctl or docker on BlueField OS
Connect to host applications or monitoring tools (like DCGM)
| Term | Meaning |
|---|---|
| SmartNIC | Network Interface Card with built-in compute (i.e., BlueField) |
| Zero Trust Security | All connections must be authenticated, authorized, and encrypted |
| PCIe Filtering | Block or redirect data moving through PCIe bus using policy |
| Telemetry Offload | Stream system monitoring data from DPU directly to remote observability tools |
| Task | How to Do It |
|---|---|
| Access DOCA runtime on BlueField | SSH into DPU OS or serial console |
| List DOCA containers or services | Use docker ps or doca_app_manager list |
| Enable a DOCA service | systemctl start doca-telemetry.service or equivalent |
| Monitor performance | Use NVIDIA NIM or Fleet Command monitoring |
| Integrate with host GPU monitoring | Stream DPU telemetry alongside DCGM metrics |
| Role in AI Infrastructure | Value It Brings |
|---|---|
| Data security at the edge | Keeps inference data private and compliant |
| Network acceleration | Offloads TCP/IP stack, reduces CPU/GPU contention |
| PCIe and storage path optimization | Speeds up data loading to GPUs for training |
| Telemetry collection and isolation | Improves observability and system resilience |
Fleet Command is NVIDIA’s cloud-based management platform for deploying and operating AI applications at the edge.
It allows you to:
Manage fleets of remote edge devices (e.g., Jetson, A100, BlueField systems)
Remotely deploy AI containers (from NGC or custom registries)
Monitor telemetry, health, and logs
Perform secure updates and troubleshooting
Ideal for industries like retail, healthcare, manufacturing, and logistics that need remote AI processing close to data sources.
Fleet Command consists of two key layers:
| Layer | Function |
|---|---|
| Cloud Control Plane | Hosted by NVIDIA. Manages devices, applications, and monitoring |
| Edge Nodes | The devices (e.g. Jetson, GPU servers) running the workloads |
Edge nodes pull workloads from the cloud and push logs and telemetry back.
| Device/Server Type | Example Models |
|---|---|
| NVIDIA Jetson | Xavier NX, AGX Orin |
| NVIDIA-Certified Servers | With A100, L40, H100, etc. |
| BlueField DPUs | Combined with DOCA for secure inference |
| OEM Edge Devices | Integrated with NVIDIA GPUs |
| Industry | Example Use Case |
|---|---|
| Retail | Smart checkout, customer analytics |
| Healthcare | Medical image inference at clinics |
| Logistics | Real-time video analytics for safety and compliance |
| Smart Cities | Edge traffic monitoring and public safety systems |
Log into the Fleet Command portal on NVIDIA LaunchPad or enterprise dashboard.
Register a new device by:
Generating a secure registration token
Downloading the Edge Node Installer
On the edge system:
sudo bash edge-node-installer.sh --token <your-token>
This:
Installs core services
Configures networking and security keys
Connects device to the Fleet Command control plane
You can:
Use prebuilt containers from NGC
Upload your own container images (from private registries)
Define deployment parameters, such as:
Resource limits
Volume mounts
Environment variables
In the Web UI:
Select the device
Choose the app/container
Click Deploy
Fleet Command pulls the container image, launches it on the device, and:
Starts the container with NVIDIA GPU access
Monitors logs and performance
Reports back to the UI
You can stop, update, or redeploy at any time.
You can view:
System logs (journal logs, container logs)
Application logs
Device health: CPU, memory, GPU, temperature
Connectivity status
If a node fails or loses connection:
You get an alert in the dashboard
You can re-register or reset remotely
| Security Mechanism | Purpose |
|---|---|
| Secure bootstrapping | TLS-authenticated registration process |
| Remote software updates | Signatures and rollback support |
| Isolated container runtime | Applications run in secure containers |
| Role-Based Access Control (RBAC) | Define who can deploy, monitor, and access logs |
Fleet Command is designed with Zero Trust principles—critical for edge deployments in regulated industries.
| Task | How to Do It |
|---|---|
| Register an edge device | Use edge-node-installer.sh with secure token |
| Deploy an AI container | Select image, define parameters, deploy via Web UI |
| View logs | Go to Logs tab for the device or application |
| Monitor health | View resource usage and status in the dashboard |
| Stop or restart applications | Click “Stop” or “Restart” in the App control panel |
| Feature | Value |
|---|---|
| Centralized control | Manage all edge AI nodes from one dashboard |
| Remote deployment | No physical access required to update or manage systems |
| Real-time monitoring | Know immediately if a workload or device has an issue |
| Scalability | Manage tens, hundreds, or thousands of edge nodes |
| Security and compliance | Protect sensitive data and models in untrusted environments |
Magnum IO is a suite of software libraries, tools, and drivers from NVIDIA designed to optimize I/O (input/output) operations for:
Multi-GPU workloads
Multi-node training clusters
High-performance AI and HPC applications
It enables GPUs to communicate faster with each other and with storage and networking systems—eliminating bottlenecks in distributed training or inference.
| Challenge | How Magnum IO Helps |
|---|---|
| Multi-GPU communication latency | Uses NVLink/NVSwitch with NCCL, UCX |
| Poor I/O performance in large clusters | Accelerates data movement via GPUDirect |
| CPU bottlenecks in communication stack | Offloads I/O using DPU + GPUDirect RDMA |
| Inefficient distributed training | Ensures synchronized model updates via NCCL |
| Component | Description |
|---|---|
| NCCL (NVIDIA Collective Communication Library) | Handles GPU-to-GPU data movement across nodes |
| UCX (Unified Communication X) | Framework that abstracts different transport methods |
| UCC (Unified Collective Communication) | Layer that sits above UCX for collective ops |
| GPUDirect RDMA | Enables network cards to communicate directly with GPU memory |
| GPUDirect Storage | Enables storage devices to read/write from GPU memory directly |
Let’s say you're training a ResNet model on 16 A100 GPUs spread across 4 servers.
Without Magnum IO:
Each GPU sends gradients through the CPU and NIC to other GPUs
Data sync is slow and CPU-bound
With Magnum IO:
Gradients are sent directly GPU-to-GPU using NVLink + NCCL
GPUDirect RDMA handles networking bypassing CPU
Result: Faster convergence and lower training time
[GPU 1]───NVLink───[GPU 2]
│ │
└─GPUDirect──NIC───┘
│
Ethernet/IB
│
[Other Node]
GPUs communicate over NVLink/NVSwitch inside the node
Data goes through NICs directly using GPUDirect
NCCL handles collective operations like AllReduce, Broadcast
Use NVIDIA NGC Containers (e.g., PyTorch with NCCL preinstalled):
docker run --gpus all nvcr.io/nvidia/pytorch:24.02-py3
Enable GPUDirect RDMA in the kernel and driver
Kernel modules like nvidia_peermem must be loaded
NIC (e.g., Mellanox) must support RDMA
Use NCCL backend in training code:
PyTorch:
dist.init_process_group(backend='nccl')
TensorFlow:
Uses horovod or tf.distribute under the hood with NCCL
Ensure correct topology:
nvidia-smi topo -m to see NVLink/NVSwitch layoutMonitor performance:
Use nvprof, nsys, or DCGM metrics
Watch NCCL logs for communication efficiency
| Practice | Description |
|---|---|
| Use NVSwitch/NVLink where available | Enables full bandwidth between GPUs |
| Use Mellanox NICs with RDMA | Required for GPUDirect RDMA |
| Align process placement with topology | Use tools like mpirun --map-by ppr:... to colocate processes |
| Use NGC containers with NCCL | Preconfigured and optimized for distributed training |
Enable nv_peer_mem on all nodes |
Required for GPUDirect to work correctly |
| Benefit | Result |
|---|---|
| Fast multi-GPU communication | Speeds up gradient exchange and model updates |
| CPU offloading | More resources available for preprocessing or other tasks |
| Optimized I/O paths | Reduces training time and improves throughput |
| Compatibility with AI frameworks | Works with PyTorch, TensorFlow, Horovod, etc. |
| Scalable to hundreds of nodes | Suitable for supercomputers and hyperscale AI infrastructure |
AI workloads are data-intensive. Poor storage architecture can cause:
Slow training speeds
Data loading bottlenecks
Underutilized GPUs
To avoid this, you must:
Use high-throughput file systems
Optimize I/O paths
Enable RDMA where possible
| File System | Notes |
|---|---|
| Lustre | High-performance, widely used in HPC |
| BeeGFS | Easy to scale, optimized for mixed workloads |
| IBM Spectrum Scale (GPFS) | Very scalable, enterprise-ready |
These systems:
Split large files across multiple storage servers
Support many concurrent readers/writers (perfect for multi-GPU training)
RDMA allows storage/network adapters to:
Transfer data directly between memory regions
Bypass CPU
Lower latency and increase bandwidth
Especially helpful when using GPUDirect Storage in large AI training clusters
Use NVMe SSDs for local caching or high-speed scratch space
Preload frequently used datasets onto fast local disks
Use data prefetching techniques (e.g., PyTorch DataLoader with prefetch_factor)
Kubernetes pods can access:
Persistent Volumes (PVs) mounted via:
NFS
iSCSI
CSI drivers
Object storage (e.g., MinIO, S3) via SDKs or FUSE
Shared network storage (e.g., Lustre via hostPath or CSI plugin)
Example (Persistent Volume Claim for NFS):
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ai-data
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 500Gi
Manual deployment is error-prone. In real AI environments, use automation tools to ensure:
Consistency across clusters
Reproducibility
Scalability
Infrastructure-as-Code (IaC) for provisioning:
VMs
Storage
Networking
Common for cloud-based AI infrastructure (AWS, Azure, GCP)
Example:
resource "aws_instance" "gpu_worker" {
ami = "ami-12345678"
instance_type = "p4d.24xlarge"
}
Automates software setup:
GPU driver installation
Docker/NVIDIA runtime setup
BCM agent installation
Slurm configuration
Example playbook:
- name: Install NVIDIA driver
apt:
name: nvidia-driver-535
state: present
Kubernetes package manager
Use for:
Deploying GPU Operator
Installing monitoring stacks (Prometheus, Grafana)
Managing custom apps with config values
Example:
helm install gpu-operator nvidia/gpu-operator -n gpu-system
Store Kubernetes manifests in Git
Automatically sync deployments on change
Enables version control for infrastructure
Use this for:
Production-grade MLOps
Secure deployment pipelines
| Task | Tools/Skills Required |
|---|---|
| Set up fast AI storage | Lustre, BeeGFS, NFS, GPFS, local NVMe |
| Optimize data access | Use prefetching, RDMA, local caching |
| Automate GPU node deployment | Ansible, Terraform |
| Automate Kubernetes apps | Helm, GitOps (ArgoCD/Flux) |
| Integrate with storage in K8s | PVCs, CSI drivers, object store SDKs |
| Validate end-to-end performance | Monitor training speed, GPU utilization, I/O stats |
| Subtopic | Key Tools / Commands |
|---|---|
| GPU Driver Installation | nvidia-smi, apt, CUDA toolkit |
| Container Runtime | Docker, NVIDIA Container Toolkit |
| Kubernetes GPU Setup | GPU Operator, kubectl, nvidia.com/gpu |
| BCM Cluster Management | bcminit, Slurm, BCM Web UI |
| DOCA & SmartNICs | BlueField, DOCA SDK, Secure Edge Inference |
| Fleet Command Deployment | Secure tokens, remote deployment, telemetry logs |
| Magnum IO | NCCL, GPUDirect RDMA, multi-node communication |
| Storage for AI | Lustre, RDMA, PVCs, NVMe |
| Deployment Automation | Terraform, Ansible, Helm, GitOps |
To ensure GPU containers or Pods can access external networks, internal services, and cluster components correctly.
| Mode | Description | Use Case |
|---|---|---|
| bridge | Default mode, NAT-based network with separate IP | Good for isolation; requires port mapping |
| host | Shares host’s network namespace | Better performance; limited isolation |
In bridge mode, the container gets a virtual IP and communicates via the host. In host mode, the container sees the same IP as the host, which is suitable for high-performance GPU networking (e.g., RDMA).
| Plugin | Feature Highlights |
|---|---|
| Flannel | Simple, stable; uses VXLAN or host-gw |
| Calico | Supports network policy, IP-in-IP, BGP |
| Cilium | eBPF-based, supports L7 security & observability |
Calico is the most widely used in GPU clusters due to its rich policy support and performance.
ifconfig # Check container or node network interface
ip a # View all interface and IP assignments
kubectl get pods -o wide # Show Pod IPs and node assignments
kubectl exec pod -- ping google.com # Verify external access
To prevent over-allocation and support fair GPU sharing in multi-user environments.
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ai-team
spec:
hard:
nvidia.com/gpu: "4"
This restricts the entire namespace to use no more than 4 GPUs total.
apiVersion: v1
kind: LimitRange
metadata:
name: gpu-defaults
namespace: ai-team
spec:
limits:
- type: Container
default:
nvidia.com/gpu: 1
max:
nvidia.com/gpu: 2
This ensures each container in the namespace will use 1 GPU by default, with a maximum of 2.
Use separate Namespaces for each team or user group.
Combine with RBAC policies to restrict access.
Enforce quotas with LimitRange + ResourceQuota.
Quickly verify that a fresh cluster deployment is healthy and ready for GPU workloads.
#!/bin/bash
echo "== Checking GPU =="
nvidia-smi || echo "GPU driver not working"
echo "== Checking Docker & GPU Toolkit =="
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi || echo "GPU container failed"
echo "== Checking Kubernetes =="
kubectl get nodes
kubectl describe node $(hostname)
echo "== Checking BCM (if installed) =="
bcmsystem status || echo "BCM not installed or not initialized"
This script validates GPU drivers, container runtime, Kubernetes node readiness, and optional BCM agent status.
| Issue | Root Cause | Solution |
|---|---|---|
nvidia-smi shows no GPU |
Driver not installed or kernel mismatch | Run dmesg, check `lsmod |
nvidia-smi fails inside container |
Missing NVIDIA Container Toolkit or --gpus not set |
Install toolkit, start container with --gpus all |
Pod stuck in Pending |
No GPU node or device plugin not running | Check kubectl get pods -n gpu-operator |
| BCM node not recognized | Wrong IP or missing join command | Check network, rerun bcminit --join |
Diagnostic Tools:
journalctl -u kubelet
docker logs <container-id>
kubectl describe pod <name>
kubectl get events
Build efficient and portable GPU images for training or inference.
FROM nvidia/cuda:12.2.0-cudnn8-runtime-ubuntu22.04
RUN apt update && apt install -y python3 python3-pip
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . /app
WORKDIR /app
CMD ["python3", "main.py"]
| Technique | Description |
|---|---|
| Clean intermediate layers | apt clean && rm -rf /var/lib/apt/lists/* |
| Multi-stage builds | Compile in one stage, copy only runtime to final |
Use docker buildx |
Enables cross-platform builds (e.g., ARM64) |
Example:
docker buildx build --platform=linux/amd64,linux/arm64 -t myimage:gpu .
| CUDA Version | Required Driver Version | Supported Architectures |
|---|---|---|
| CUDA 12.2 | ≥ 535.x | Hopper, Ada, Ampere |
| CUDA 11.8 | ≥ 510.x | Ampere, Volta, Turing |
| CUDA 10.2 | ≥ 440.x | Volta, Turing |
Best Practices:
Always match CUDA version with driver version.
Use nvidia-smi to verify driver/CUDA compatibility.
Avoid mixing too old drivers with new CUDA versions.
Why must administrators verify compatibility between NVIDIA GPU drivers and CUDA Toolkit versions before deployment?
Because CUDA applications require driver versions that support the specific CUDA runtime used by the framework.
CUDA frameworks rely on features implemented in compatible driver versions. If the GPU driver is older than the required CUDA runtime, applications may fail to start or produce runtime errors. Administrators must consult compatibility matrices provided by NVIDIA to ensure that installed drivers support the intended CUDA version. This is particularly important when deploying machine learning frameworks such as PyTorch or TensorFlow, which bundle specific CUDA dependencies. A common mistake is upgrading CUDA without updating the GPU driver, resulting in initialization failures when the framework attempts to access GPU capabilities not supported by the installed driver.
Demand Score: 84
Exam Relevance Score: 90
What command can administrators use to verify that NVIDIA drivers are properly installed and that GPUs are recognized by the system?
Administrators typically run the nvidia-smi command.
nvidia-smi queries the installed NVIDIA driver and displays information about detected GPUs, including device model, driver version, memory usage, and active processes. When drivers are correctly installed, the command returns a detailed table showing GPU devices and operational status. If the command fails or returns an error, it often indicates a driver installation issue, missing kernel modules, or unsupported hardware configuration. Administrators frequently use this command immediately after installation to confirm that GPUs are accessible before deploying CUDA or containerized AI workloads.
Demand Score: 78
Exam Relevance Score: 86
Why is containerization commonly used when deploying AI workloads on GPU infrastructure?
Containerization provides consistent runtime environments for AI frameworks and GPU dependencies.
AI workloads often depend on complex combinations of libraries, CUDA versions, and deep learning frameworks. Containers package these dependencies into reproducible environments that can run consistently across different servers. Using container technologies such as Docker allows administrators to deploy workloads without manually configuring each node's software stack. GPU-enabled containers can access host GPUs through specialized runtimes, enabling scalable AI deployments. Without containerization, dependency conflicts may arise when multiple frameworks require different CUDA versions or library configurations.
Demand Score: 74
Exam Relevance Score: 83
What deployment issue occurs if kernel modules required by the NVIDIA driver are not loaded?
The operating system will not detect or properly interface with the GPU hardware.
NVIDIA drivers rely on kernel modules that enable communication between the operating system and GPU hardware. If these modules fail to load during installation or system boot, the GPU becomes inaccessible to applications. Administrators may observe errors when running GPU utilities or attempting to execute CUDA workloads. This issue can occur due to kernel version mismatches, incomplete driver installations, or secure boot restrictions. Ensuring kernel modules are loaded correctly is a critical validation step after deploying GPU drivers.
Demand Score: 69
Exam Relevance Score: 82
Why should administrators test GPU functionality after deployment before onboarding AI workloads?
Testing ensures the infrastructure is correctly configured and capable of executing GPU workloads.
Initial testing confirms that drivers, CUDA libraries, and runtime components function correctly before production workloads are deployed. Administrators typically run diagnostic tools or simple CUDA sample applications to verify GPU computation capability. Without testing, misconfigurations such as incorrect drivers, missing libraries, or runtime conflicts may only appear after critical workloads are scheduled. Early validation prevents failures during model training or inference deployments and ensures that GPU resources are ready for operational workloads.
Demand Score: 65
Exam Relevance Score: 80