Shopping cart

Subtotal:

$0.00

NCP-AIO Troubleshooting and Optimization

Troubleshooting and Optimization

Detailed list of NCP-AIO knowledge points

Troubleshooting and Optimization Detailed Explanation

1. GPU Monitoring and Diagnostics

Why This Matters

In high-performance AI environments, GPUs are your most valuable and expensive resources.
You must ensure they are:

  • Healthy

  • Efficiently utilized

  • Properly monitored

Misconfiguration, overheating, ECC errors, or memory issues can silently reduce performance or cause job failures.

Tool 1: DCGM (Data Center GPU Manager)

DCGM is a low-overhead GPU telemetry and diagnostics framework from NVIDIA. It can be run:

  • Directly on bare-metal

  • Inside Docker containers

  • In Kubernetes environments as part of GPU Operator stack

Use Cases of DCGM:
Use Case Example
Monitor GPU temperature Alert if above 85°C
Track ECC errors Detect GPU memory instability
Enforce power or thermal limits Trigger alerts on breach
Monitor application behavior Collect metrics like memory, SM, PCIe usage
Export metrics to Prometheus Build dashboards with Grafana
Key DCGM Commands
  1. List Available GPUs:
dcgmi discovery -l
  1. Run Health Checks:
dcgmi health -c

This checks:

  • Temperature

  • PCIe connectivity

  • Power

  • ECC

  1. Get Performance Stats:
dcgmi stats --groupId 0

This gives:

  • GPU utilization

  • Memory usage

  • PCIe throughput

  • SM clock rate

  • Power draw

  1. Collect I/O and NVLink Counters:
    DCGM supports advanced custom metrics:
  • DCGM_FI_DEV_NVLINK_THROUGHPUT_TX

  • DCGM_FI_DEV_POWER_USAGE
    These can be scraped into Prometheus exporters.

Tool 2: nvidia-smi – Your GPU Swiss Army Knife

A command-line tool installed with the NVIDIA driver.

Basic Usage:
nvidia-smi

This displays:

  • Driver version

  • CUDA version

  • GPU name

  • Memory usage

  • Power draw

  • Temperature

  • Active processes

Advanced Commands
  1. View Topology and NVLink Connectivity:
nvidia-smi topo -m

Outputs a matrix showing GPU interconnect speeds and NVLink links:

        GPU0    GPU1    CPU Affinity
GPU0     X      NV#           0-19
GPU1    NV#      X            0-19
  1. Continuous Monitoring (like top for GPUs):
nvidia-smi dmon

Shows live output of:

  • GPU utilization

  • Memory usage

  • Encoder/decoder usage

  • PCIe stats

Check MIG Configurations:
nvidia-smi mig -lgi

List GPU instances created via MIG.

Example Scenario

Symptom: GPU memory is full, job crashes.

Diagnosis:

  1. Use nvidia-smi to check memory usage in real-time.

  2. Use dcgmi stats to observe if memory pressure is rising over time.

  3. If running in Kubernetes: export metrics via DCGM exporter and observe in Grafana.

Action:

  • Reduce batch size

  • Switch to mixed precision (FP16)

  • Fix data loader (e.g., infinite loop)

Summary of Skills for GPU Monitoring

Tool Purpose
nvidia-smi General GPU monitoring and topology
nvidia-smi dmon Real-time stats like GPU load & memory
DCGM CLI (dcgmi) Deeper health metrics and diagnostics
Prometheus + DCGM Dashboards for cluster-wide monitoring

2. Docker & Container Issues

Why Docker Integration Is Critical

Most AI workloads today run in containers to ensure:

  • Portability

  • Dependency isolation

  • Easy deployment

However, if Docker is not properly configured to interface with NVIDIA GPUs, containers may:

  • Fail to detect GPUs

  • Crash during runtime

  • Perform poorly (e.g., no GPU acceleration)

Common Docker GPU Failures

Symptom Possible Cause
CUDA device not found NVIDIA runtime not enabled
Container runs, but uses only CPU Started without --gpus all flag
nvidia-smi fails in container Missing drivers or runtime
MIG mode GPU not visible in container MIG instance not assigned to container

Fix 1: Always Start Containers With GPU Access

When running containers, explicitly request GPUs:

docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

You can limit to 1 GPU with --gpus device=0
Or assign a MIG instance directly using --gpus '"device=UUID"'

Fix 2: Validate NVIDIA Container Runtime

Ensure Docker is configured to use nvidia runtime.

Check available runtimes:
docker info | grep -i runtime

You should see:

Runtimes: runc nvidia
Default Runtime: runc

If nvidia is missing, install NVIDIA Container Toolkit.

To test runtime:
docker run --rm --runtime=nvidia nvidia/cuda:12.0-base nvidia-smi

If that works, the runtime is installed correctly.

Fix 3: Check NVIDIA Container Toolkit Installation

Steps for Ubuntu:
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

Also verify:

  • Kernel modules loaded: nvidia, nvidia_uvm, nvidia_drm

  • Docker version ≥ 19.03

Fix 4: Debugging MIG Containers

Problem: MIG instance not accessible in container

Fix:

  • Make sure MIG instance is created using nvidia-smi mig -cgi

  • Assign instance to container with exact GPU UUID:

docker run --gpus '"device=GPU-UUID"' ...

Use this command to list instances:

nvidia-smi -L

Real-World Debugging Example

Symptom: A container launches but shows no GPU available

Step-by-step:
  1. Inside container:
nvidia-smi

→ Fails

  1. On host:
nvidia-smi        # works
docker info       # check runtime
  1. Fix:
sudo apt install nvidia-container-toolkit
sudo systemctl restart docker

Re-run:

docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Summary of Skills for Container Troubleshooting

Skill Tool / Command
Confirm GPU access in container nvidia-smi inside container
Use GPU-enabled container docker run --gpus all ...
Check runtime support `docker info
Assign MIG GPU instance Use --gpus "device=UUID" and nvidia-smi -L
Install NVIDIA runtime properly nvidia-container-toolkit, restart Docker

3. Kubernetes Troubleshooting

Why GPU Troubleshooting in Kubernetes Is Tricky

Kubernetes abstracts away hardware details. So when something goes wrong, such as:

  • Pods not seeing GPUs

  • Jobs stuck in Pending

  • GPU metrics not being collected

You’ll need to dig into multiple layers:

  • Node configuration

  • Pod YAML

  • NVIDIA plugins

  • Cluster resource scheduling

Common GPU-Related Kubernetes Issues

Symptom Likely Root Cause
Pod does not detect GPU nvidia-device-plugin not running or misconfigured
Pod stuck in Pending No node with available GPU or incorrect resources spec
Device plugin DaemonSet failed NodeSelector or taint mismatch, missing runtime support
Metrics not showing in Prometheus dcgm-exporter not running or not scraped properly

Fix 1: Check if GPU Plugin Is Running

NVIDIA Device Plugin is deployed as a DaemonSet. Check its status:

kubectl get daemonset -n gpu-operator

You should see:

nvidia-device-plugin-daemonset   DESIRED: X   READY: X

Check logs:

kubectl logs -n gpu-operator <device-plugin-pod-name>

If it fails to start:

  • Check nvidia-smi on the node

  • Check Docker runtime (nvidia must be available)

Fix 2: Verify Pod Resource Requests

Kubernetes won’t schedule a GPU pod unless you request the GPU correctly:

Example Pod YAML:
resources:
  limits:
    nvidia.com/gpu: 1

Missing this will result in the pod being scheduled without GPU access.

Also confirm:

  • The node has available GPUs

  • You are not exceeding resource quotas

Fix 3: Check Node Labels and Taints

Device Plugin only works on nodes that:

  • Have GPU hardware

  • Are correctly labeled (nvidia.com/gpu.present=true)

  • Do not have taints that block the DaemonSet (or have matching tolerations)

Check node label:
kubectl get nodes --show-labels
Check taints:
kubectl describe node <node-name> | grep Taint

If there's a taint like:

nvidia.com/gpu=present:NoSchedule

Make sure your pod has:

tolerations:
- key: "nvidia.com/gpu"
  operator: "Exists"
  effect: "NoSchedule"

Fix 4: Troubleshoot with kubectl describe and Logs

If a pod fails to start:

kubectl describe pod <pod-name>

Look for:

  • Failed scheduling

  • Events at the bottom

  • Misconfigured requests

To inspect logs:

kubectl logs <pod-name>

To debug the container runtime interface:

crictl ps -a

Real-World Debugging Scenario

Symptom: Pod is in Pending state with GPU request.

Diagnosis:

kubectl describe pod train-gpu-job

Output:

0/4 nodes are available: 4 Insufficient nvidia.com/gpu.

→ No node has free GPU

Fix:

  • Free GPU on a node

  • Add new GPU-enabled node

  • Reduce nvidia.com/gpu request

Summary of Kubernetes Troubleshooting Skills

Task Command / Method
Check GPU DaemonSet kubectl get daemonset -n gpu-operator
Inspect GPU plugin logs kubectl logs on plugin pod
Validate resource requests Pod YAML must include limits.nvidia.com/gpu: 1
Check node labels and taints kubectl describe node
Troubleshoot pod scheduling failures kubectl describe pod, review Events section
Confirm runtime is GPU-compatible docker info, nvidia-smi on the node

4. Performance Bottlenecks

Overview

Even when infrastructure is working correctly, AI workloads can suffer from:

  • Low GPU utilization

  • High job runtime

  • Unexpected memory errors

  • Slow I/O throughput

To fix this, you must profile all layers of the stack:

  • CPU

  • Memory

  • GPU

  • Storage

  • Interconnect

1. CPU Bottlenecks

Symptoms:
  • GPU is idle or lightly loaded

  • Data preprocessing is slow

  • CPU usage near 100% on one or few threads

Tools:
  • htop: View CPU cores, load balance

  • top: General process-level view

  • mpstat -P ALL: Per-core CPU usage

  • Prometheus + Grafana: Cluster-wide CPU visualization

Fixes:
  • Increase data loading workers (e.g., num_workers in PyTorch)

  • Move preprocessing to GPU (e.g., use NVIDIA DALI)

  • Balance threads across cores

2. Memory Bottlenecks

Symptoms:
  • Pod or container killed with OOMKilled

  • Frequent swapping or paging

  • dmesg shows out-of-memory (OOM) killer activity

Diagnostic Commands:
dmesg | grep -i oom
free -h
vmstat 1
Fixes:
  • Increase memory allocation in Kubernetes:
resources:
  requests:
    memory: "16Gi"
  • Optimize memory usage in code

  • Cache data properly instead of reading repeatedly

3. GPU Bottlenecks

Symptoms:
  • GPU is underutilized (< 30%) despite active training

  • Training job is slow or stuck on data loading

Tools:
  • nvidia-smi dmon: View real-time GPU stats

  • dcgmi stats: Monitor streaming multiprocessor (SM) activity

  • Nsight Systems: Timeline of kernel execution and memory copies

Common Causes:
  • Small batch sizes

  • Data loader too slow

  • Poor kernel performance (e.g., low occupancy)

Fixes:
  • Use larger batches (if memory allows)

  • Use mixed precision (FP16) for better throughput

  • Optimize data pipeline

  • Use faster augmentation tools (e.g., DALI)

4. I/O Bottlenecks

Symptoms:
  • GPU spends time waiting for data

  • Job is I/O-bound, not compute-bound

  • Logs show data not loading in time

Diagnostic Tools:
  • iostat, iotop: Disk read/write stats

  • nvidia-smi dmon: Look for spikes in memory throughput

  • DCGM custom metrics for I/O counters

  • Nsight Systems: Shows memory copy stalls

Fixes:
  • Use parallel file systems (Lustre, BeeGFS)

  • Enable RDMA if supported

  • Prefetch data in code

  • Use TFRecords, LMDB, or WebDataset for optimized storage formats

Real-World Scenario

Symptom: GPU utilization is only 20% during training

Diagnosis:

  1. nvidia-smi dmon → GPU not active

  2. top → Python process active, CPU at 100%

  3. Python log shows: “Loading batch…”

Cause: Data loading is the bottleneck

Fix:

  • Increase num_workers in DataLoader

  • Move preprocessing to GPU with NVIDIA DALI

Summary: Common Bottlenecks & Fixes

Bottleneck Type Symptom Tool(s) Common Fixes
CPU High wait time, low GPU usage htop, mpstat, Prometheus Tune data loader, use GPU-accelerated preprocessing
Memory OOMKilled, swap, slow system dmesg, free, vmstat Allocate more memory, optimize memory usage
GPU Low utilization, slow training nvidia-smi, Nsight Systems Increase batch size, use FP16, optimize kernels
I/O Data stalls, slow dataset access iostat, Nsight, DCGM Use RDMA, faster storage, caching, better formats

5. Interconnect Analysis (NVLink & NVSwitch)

Why GPU Interconnects Matter

In multi-GPU systems (like DGX/HGX servers), GPUs need to exchange data at high speeds—especially during:

  • Distributed training (e.g., gradient sharing)

  • Model parallelism (splitting parts of models across GPUs)

  • Tensor or batch parallel computation

Slow or misconfigured interconnects lead to:

  • High latency

  • Communication bottlenecks

  • Uneven performance across GPUs

NVLink: High-Speed GPU-to-GPU Communication

What is NVLink?
  • A direct, high-bandwidth connection between NVIDIA GPUs

  • Faster than PCIe (up to 600 GB/s bidirectional on newer models)

  • Used in most modern HGX and DGX systems

Key Features:
Feature Description
Peer-to-peer access GPUs can directly access each other’s memory
Transparent to user No code changes needed if using NCCL or CUDA
Scalable Supports topologies up to 8 GPUs interconnected
View NVLink Topology

Use:

nvidia-smi topo -m

Example output:

        GPU0    GPU1    GPU2
GPU0     X      NV1     NV2
GPU1    NV1      X      SYS
GPU2    NV2     SYS      X

Legend:

  • NV1, NV2: Connected via NVLink

  • SYS: Communication via PCIe (slower)

Interpret This Output:
  • Direct NVLink between GPUs is best

  • If traffic goes over SYS, performance will degrade

NVSwitch: Full-Bandwidth All-to-All Fabric

What is NVSwitch?
  • A switching fabric that allows any GPU to talk to any other GPU at full NVLink speed

  • Used in systems like DGX A100, DGX H100, or HGX-8/16

  • Replaces the limited point-to-point NVLink architecture

Why it Matters:
  • Uniform communication bandwidth

  • Reduces “topology bottlenecks”

  • No need to worry about which GPU talks to which

NCCL Tests: Measure Interconnect Performance

NCCL (NVIDIA Collective Communication Library) is optimized for:

  • AllReduce, Broadcast, Gather, Scatter

  • Multi-GPU training frameworks like PyTorch, TensorFlow, Horovod

To Run Bandwidth/Latency Tests:

Clone and build:

git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1

Run a simple test:

mpirun -np 4 ./build/all_reduce_perf -b 8 -e 512M -f 2 -g 1

This will show:

  • Bandwidth (GB/s)

  • Latency per GPU pair

Tip:

Run tests between GPUs with and without NVLink to compare performance.

Real-World Debugging Example

Symptom: Multi-GPU training is slower on some nodes.

Steps:
  1. Run nvidia-smi topo -m
    → GPU2 communicates over SYS, others via NVLink

  2. Run NCCL tests:
    → Lower bandwidth between GPU2 and others

  3. Fix:

    • Rewire NVLink bridge (if physical)

    • Reassign workload to NVLink-connected GPUs only

Summary: Interconnect Troubleshooting

Tool / Method Use Case
nvidia-smi topo -m View GPU connectivity and NVLink layout
NCCL Tests Benchmark GPU-to-GPU bandwidth
Nsight Systems Identify communication stalls
DCGM Bandwidth Metrics Export NVLink/PCIe stats over time
Hardware Check Confirm NVLink bridges or NVSwitch status

Best Practices

Best Practice Why It Matters
Place jobs on NVLink-connected GPUs Ensures fast peer-to-peer communication
Monitor GPU interconnect bandwidth Avoid silent slowdowns during training
Use NVSwitch if available Guarantees full-bandwidth across all GPUs
Run NCCL benchmarks after cluster deployment Validate performance before real training begins

6. Nsight Systems & Nsight Compute

Why Use Profiling Tools?

Even if your code runs and uses GPUs, performance may still be poor due to:

  • Inefficient kernel launches

  • Poor memory access patterns

  • Synchronization overhead

  • I/O stalls or CPU bottlenecks

To identify these issues, NVIDIA offers two powerful tools:

Tool Focus Level What It Analyzes
Nsight Systems System-wide Timeline of CPU, GPU, memory, I/O, and kernels
Nsight Compute Kernel-level (CUDA) Thread occupancy, memory coalescing, warps

Nsight Systems

Purpose:
  • Provides a global view of your application

  • Helps spot issues across CPU, GPU, and system resources

Key Features:
  • Timeline view of:

    • Kernel launches

    • Memory copies

    • CPU thread activity

    • API calls (e.g., cuMemcpy, cudaLaunchKernel)

  • Zoom into specific regions to spot stalls

Launch via CLI:
nsys profile -o output_name ./your_script.sh

Or for Python:

nsys profile -o tf_model python train.py
Output:

Generates .qdrep and .nsys-rep files, viewable in:

  • Nsight Systems GUI

  • Or CLI summary:

nsys stats output_name.qdrep
Common Issues Nsight Systems Reveals:
Symptom Interpretation
Kernel launches delayed CPU bottleneck or sync blocking
Long memory copies Poor data locality or missing page pinning
GPU idle while CPU active Data not arriving fast enough
Lots of small kernel launches Launch overheads outweigh compute gains

Nsight Compute

Purpose:
  • Dive into individual kernel performance

  • Understand what limits your GPU execution

Key Metrics:
Metric What It Tells You
Occupancy GPU thread block utilization
Memory Coalescing Whether threads access memory efficiently
Warp Execution Efficiency Fraction of warps that are active
Shared Memory Utilization Helps tune for memory-bound kernels
Launch via CLI:
nsys profile -t cuda ./app

To analyze a single kernel:

nv-nsight-cu-cli ./app

Or open the .ncu-rep in Nsight Compute GUI for visual profiling.

Common Issues Nsight Compute Detects:
Issue Description Suggested Fix
Low occupancy Blocks too small, too few threads Increase thread count or adjust block size
Poor memory coalescing Unaligned memory access Use contiguous memory layouts
Branch divergence Threads follow different code paths Refactor kernel logic
High register pressure Too many registers per thread Optimize variable usage or split kernels

Example Use Case: PyTorch Model is Slow

  1. Run:
nsys profile -o myrun python train.py
  1. Inspect Timeline:
  • GPU is idle most of the time

  • cudaMemcpyAsync takes 80% of time

→ Suggests data is not on GPU or input pipeline is too slow

  1. Fix:
  • Move preprocessing to GPU (e.g., DALI)

  • Use pinned memory for data transfers

  • Batch memory copies

Summary: When to Use Each Tool

Tool Use Case
Nsight Systems End-to-end profiling, timeline view
Nsight Compute In-depth kernel analysis, tuning CUDA performance

Practical Tips

Tip Why It Helps
Always profile after major code changes Prevent hidden regressions
Use Nsight in early development, not just final Catch architectural mistakes early
Pair with Prometheus + Grafana for correlation Relate kernel issues with system-level metrics
Use multiple runs to find consistent patterns Avoid noise from one-off spikes or jitter

7. Training Optimization Strategies

Why Optimization Matters

Training large AI models is expensive and time-consuming. Without optimization:

  • Training takes days instead of hours

  • GPU resources are underutilized

  • Scaling across GPUs becomes inefficient

By optimizing training, you can:

  • Reduce training time

  • Lower compute cost

  • Improve model convergence stability

1. Data Pipeline Optimization

Symptoms of Poor Data Pipeline:
  • GPU idle time is high

  • Training slow despite low CPU load

  • I/O waits visible in profiling tools (e.g., Nsight Systems)

Optimization Techniques:
Technique Benefit
Data prefetching Fetch next batch while current is training
Parallel data loading Use num_workers > 0 in PyTorch
Efficient formats Use TFRecords, WebDataset, or LMDB
GPU-based decoding Use NVIDIA DALI for image/video data
Data sharding Split data across nodes to reduce I/O
PyTorch Example:
DataLoader(dataset, batch_size=32, num_workers=8, prefetch_factor=2)

2. I/O Optimization

Slow data loading from storage (especially NFS) can bottleneck training.

Solutions:
  • Use parallel file systems (Lustre, BeeGFS, GPFS)

  • Use local NVMe SSDs for temporary caching

  • Enable RDMA for remote file access

  • Use GPUDirect Storage if supported

3. GPU Usage Optimization

Most modern GPUs support mixed precision training (e.g., FP16 or BF16), which improves performance.

Techniques:
Optimization Benefit
Mixed precision Use Tensor Cores, reduce memory
Larger batch sizes Better GPU utilization
Gradient accumulation Simulate large batches on small GPUs
PyTorch Example:
from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()
for data in loader:
    with autocast():
        output = model(data)
        loss = loss_fn(output, target)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

4. Distributed Communication Optimization

When using multiple GPUs/nodes, communication can dominate training time.

Techniques:
Method Tool / Library Benefit
AllReduce optimization NCCL Faster gradient sync
Horovod Fusion Buffer Horovod Reduces network overhead
Overlapping compute & comm Native PyTorch DDP or Horovod Better GPU utilization
Use NVLink/NVSwitch Hardware Low-latency inter-GPU traffic

5. Scheduling Strategy

If multiple jobs or users share a cluster, poor scheduling can result in:

  • Resource starvation

  • GPU fragmentation

  • Overhead from context switching

Best Practices:
Technique Benefit
Bind jobs to NVLink-connected GPUs Maximizes inter-GPU speed
Set job affinity (K8s, Slurm) Avoids placement on overloaded nodes
Use MIG (on A100/H100) Enables fair GPU partitioning
Prioritize critical jobs Use Slurm QOS or K8s priority class

Summary: Training Optimization Techniques

Area Method Tool/Tech Used
Data Prefetch, shard, parallel load DALI, PyTorch DataLoader
I/O Parallel FS, RDMA, caching BeeGFS, GPUDirect Storage
GPU FP16, batch size, accumulation PyTorch AMP, Tensor Cores
Communication AllReduce, NCCL, overlapping transfer NCCL, Horovod, DDP
Scheduling Topology-aware scheduling NVLink, MIG, Kubernetes, Slurm

8. Logs, Debugging Tools & Common Error Types

1. Why Logs Are Essential

In production environments, logs are your first line of defense when something breaks.

They help answer:

  • Why did my job crash?

  • Was it an infrastructure issue or a code bug?

  • Did the system run out of memory, GPU, or file descriptors?

2. Key Logging and Debugging Tools

Tool Use Case
journalctl System-wide logs (including Docker and Slurm)
docker logs <id> Container-specific logs
kubectl logs <pod> Kubernetes container logs
Slurm logs Scheduler + compute node logs
dmesg Kernel-level logs (e.g., OOM errors)
System Logs
journalctl -xe

Use for checking driver issues, hardware failures, or runtime crashes.

Docker Logs
docker ps -a
docker logs <container-id>

Check for:

  • CUDA errors

  • Missing libraries

  • Runtime crashes

Kubernetes Logs

Check the logs of a pod:

kubectl logs <pod-name>

For multi-container pods:

kubectl logs <pod-name> -c <container-name>

Describe pod to see failure reasons:

kubectl describe pod <pod-name>
Slurm Logs

Slurm logs are usually found in:

/var/log/slurm/slurmctld.log
/var/log/slurm/slurmd.log

Useful when:

  • Jobs are not starting

  • Node goes offline

  • GPU resources not allocated correctly

3. Common Error Types You Should Recognize

Here’s a cheat sheet of the most frequent runtime errors you’ll see and how to interpret them.

GPU Not Found

Error Message:

CUDA device not available

Diagnosis:

  • NVIDIA driver not loaded

  • Docker container lacks --gpus flag

  • MIG instance not assigned

Fix:

  • Run nvidia-smi on host and in container

  • Check runtime settings

  • Use --gpus all or --gpus "device=UUID"

Out-of-Memory (OOM)

Error Message:

Killed: Out of memory

Diagnosis:

  • Container memory request too small

  • Job exceeded available RAM or GPU memory

Fix:

  • Increase resources.requests.memory in K8s

  • Tune batch size or model size

  • Use mixed precision (FP16)

GPU Device Busy

Error Message:

GPU already in use

Diagnosis:

  • Another process is using GPU

  • Container trying to access unavailable MIG partition

Fix:

  • Use nvidia-smi to inspect usage

  • Kill unused processes

  • Reschedule job to another node

Job Hangs / No Output

Symptoms:

  • Job appears to run but no logs or model output

Diagnosis:

  • Inter-GPU communication bottleneck

  • Infinite loop or deadlock in code

  • Blocked I/O

Fix:

  • Profile with Nsight Systems

  • Add debug-level logging

  • Test components in isolation

Final Summary: Must-Have Debugging Skills

Issue Type Tool / Method
Crash Diagnosis journalctl, docker logs, kubectl logs
GPU Access nvidia-smi, container runtime settings
Memory Issues dmesg, Prometheus memory metrics
Slurm Failures slurmctld.log, squeue, sinfo
Stuck Jobs Profiling tools + log inspection

Troubleshooting and Optimization (Additional Content)

1. GPU Monitoring & DCGM Integration

DCGM + Prometheus + Grafana Visualization Architecture

To understand how GPU metrics flow from GPU hardware to visualization dashboards:

Data Flow Overview:

[NVIDIA GPU Hardware]
        ↓
[DCGM Exporter DaemonSet]
        ↓
[Prometheus Scraper]
        ↓
[Grafana Dashboard]
  • DCGM Exporter collects metrics from nvidia-smi and dcgmi.

  • Prometheus scrapes metrics via dcgm-exporter:9400/metrics.

  • Grafana queries Prometheus and displays GPU utilization, power, temperature, and ECC stats.

Helm-Based DCGM Exporter Installation

To deploy DCGM Exporter via Helm in a Kubernetes GPU cluster:

Sample values.yaml:

daemonset:
  enabled: true

serviceMonitor:
  enabled: true

resources:
  limits:
    nvidia.com/gpu: 1

podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "9400"

Then run:

helm repo add nvidia https://nvidia.github.io/k8s-dcgm
helm install dcgm-exporter nvidia/dcgm-exporter -f values.yaml

2. Kubernetes GPU Optimization Enhancements

Topology-Aware Scheduling (v1.24+)

This feature ensures Pods are scheduled based on:

  • GPU NUMA locality

  • NVLink connectivity

  • CPU-GPU binding consistency

Benefits:

  • Reduces cross-NUMA latency

  • Enhances inter-GPU bandwidth

  • Improves overall GPU pipeline efficiency

Usage:

  • Use TopologyManager and CPUManager in Kubelet.

  • Combine with NVIDIA Device Plugin using topologyPolicy: best-effort.

NVIDIA GPU Operator Structure Diagram (Simplified)

Component Role
nvidia-device-plugin Reports GPU devices to kubelet
dcgm-exporter Exposes GPU metrics to Prometheus
nvidia-driver-daemonset Ensures host has correct GPU driver
mig-manager Manages MIG profiles if enabled
validator Validates installation and readiness

All components are deployed via a Helm chart and run as DaemonSets or Pods on GPU nodes.

3. Training Optimization: Memory and Interconnects

Memory Fragmentation in MIG Scenarios

Problem: Fragmented GPU memory may block container startup or reduce performance.

Solutions:

  • Use fixed batch sizes to stabilize allocation patterns.

  • Pre-allocate MIG profiles with known shapes (e.g., 1g.5gb).

  • Use Nsight Compute to analyze memory heatmap and pinpoint fragmentation sources.

GPUDirect Storage & RDMA

  • GDS enables direct data movement from NVMe SSD to GPU without CPU copy.

  • RDMA enables direct memory access between nodes for high-speed networking.

Validation Tools:

  • fio + nvidia-fs for GDS performance

  • ib_write_bw for RDMA throughput

Command Example:

fio --name=gds-test --filename=/mnt/data/testfile --ioengine=nvidia-fs --rw=read --size=4G

4. Nsight Tools Usage and Profiling Diagnosis

Nsight Systems .qdrep Report Structure

GUI Layout Overview:

  • Top timeline: GPU kernels, memory copies

  • Middle pane: CPU threads and sync points

  • Bottom: API trace and durations

Diagnostic Use Cases:

Symptom Interpretation
Long cudaMemcpyAsync Bottlenecked I/O
Gaps between kernel launches CPU-side delay or sync issue
Repetitive short kernels Poor kernel fusion, launch overhead

Nsight Compute Kernel Report Interpretation

Metric Meaning Optimization
Occupancy < 40% Underutilized GPU Increase threads per block
High warp divergence Control flow inefficiency Restructure kernel logic
Shared memory underused Memory bandwidth waste Refactor memory access

5. Multi-Tenant GPU Scheduling Strategy

NVLink-Aware Job Placement

To avoid performance imbalance in multi-GPU nodes:

User GPU Allocation
A GPU0–GPU3 (full NVLink interconnect)
B GPU4–GPU7

How to Configure:

  • Kubernetes: Use nodeSelector or custom scheduler extender.

  • Slurm: Set GRES + topology-aware constraints.

MIG-Based Multi-User Resource Planning

Diagram Example (A100 40GB):

MIG Instance User Assigned Profile
GPU0:0 User A 1g.5gb
GPU0:1 User B 2g.10gb
GPU1:0 User C 3g.20gb

Key Principles:

  • Match workloads to MIG profiles

  • Use nvidia-smi -L to map UUIDs

  • Kubernetes: Assign using --gpus "device=UUID"

Frequently Asked Questions

What is the most common reason for CUDA “out of memory” errors during model training?

Answer:

The model or batch size exceeds the available GPU memory capacity.

Explanation:

Deep learning training workloads allocate GPU memory for model parameters, intermediate tensors, and batch data. When the required memory exceeds the available GPU memory, CUDA throws an out-of-memory error. Administrators and developers often resolve this by reducing batch size, enabling gradient accumulation, or using mixed precision training. Another contributing factor may be memory fragmentation or other processes occupying GPU memory. Monitoring tools help identify memory usage patterns before workloads fail.

Demand Score: 88

Exam Relevance Score: 92

How can administrators identify whether GPU performance issues are caused by hardware bottlenecks or software configuration problems?

Answer:

They analyze GPU utilization metrics alongside system telemetry such as CPU, memory, and I/O usage.

Explanation:

Performance bottlenecks may originate from multiple system components. If GPU utilization remains low while CPU or disk usage is high, the issue likely stems from input pipelines or storage limitations rather than GPU capability. Conversely, consistently high GPU utilization combined with slow training progress may indicate inefficient model configurations or driver-related issues. By correlating metrics across system layers, administrators can determine whether performance problems arise from hardware limitations, workload design, or infrastructure configuration.

Demand Score: 82

Exam Relevance Score: 88

Why might GPU utilization remain low during deep learning training even when GPUs are available?

Answer:

Low utilization often occurs when data pipelines or CPU preprocessing cannot feed data to the GPU fast enough.

Explanation:

Deep learning pipelines depend on continuous data delivery to GPUs. If CPU preprocessing tasks such as data augmentation or dataset loading become bottlenecks, GPUs remain idle while waiting for input. Storage latency or insufficient parallel data loading can also reduce utilization. Administrators typically analyze system metrics to identify these bottlenecks and optimize the pipeline by increasing data loader workers, improving storage performance, or restructuring preprocessing tasks.

Demand Score: 79

Exam Relevance Score: 86

What troubleshooting step should administrators take if GPU devices disappear after a system update?

Answer:

Verify that NVIDIA drivers remain compatible with the updated kernel version.

Explanation:

Operating system updates often modify the kernel, which can break compatibility with previously installed GPU drivers. If drivers were compiled for an older kernel, GPU modules may fail to load after the update. Administrators should reinstall or rebuild the NVIDIA driver so it matches the current kernel environment. Failure to perform this step may result in GPUs not appearing in system utilities or compute frameworks.

Demand Score: 75

Exam Relevance Score: 87

Why can running multiple AI workloads simultaneously reduce GPU efficiency?

Answer:

Competing workloads may cause resource contention and inefficient GPU scheduling.

Explanation:

When multiple jobs share GPU resources without proper scheduling controls, they compete for memory bandwidth, compute units, and memory allocation. This contention may cause context switching overhead and inefficient hardware utilization. AI operations teams typically implement scheduling policies or GPU partitioning mechanisms to balance workloads. Without such controls, performance variability increases and some workloads may fail due to insufficient resources.

Demand Score: 72

Exam Relevance Score: 84

What optimization technique can improve training performance without increasing GPU hardware capacity?

Answer:

Mixed precision training can improve performance and reduce GPU memory consumption.

Explanation:

Mixed precision training uses lower precision numerical formats such as FP16 for certain computations while maintaining FP32 precision where necessary. This approach reduces memory consumption and increases throughput because GPUs can process lower precision operations more efficiently. Many modern deep learning frameworks support automated mixed precision to simplify implementation. Administrators and engineers often enable this feature when optimizing training performance on existing GPU infrastructure.

Demand Score: 70

Exam Relevance Score: 83

NCP-AIO Training Course
$68$29.99
NCP-AIO Training Course