Shopping cart

Subtotal:

$0.00

NCP-AIO Workload Management

Workload Management

Detailed list of NCP-AIO knowledge points

Workload Management Detailed Explanation

1. Kubernetes for AI Workload Management

Why Use Kubernetes for AI Workloads?

Kubernetes (often shortened as K8s) is a powerful open-source container orchestration system.

In AI environments, it helps:

  • Deploy and manage containerized ML/AI applications

  • Scale workloads automatically

  • Ensure high availability and efficient use of GPUs

  • Run jobs on multi-node clusters

Kubernetes is especially valuable for MLOps pipelines, large-scale training, and inference systems.

Containers + Kubernetes = Scalable AI

Before Kubernetes, people ran scripts manually or used tools like Docker Compose. But in large environments, you need to manage:

  • Dozens of models

  • Thousands of jobs

  • Dynamic GPU availability

  • Failures and restarts

Kubernetes automates all of this.

Key Kubernetes Concepts for AI

1. Pods & Nodes
Term Description
Pod The smallest unit in Kubernetes; usually runs one container (e.g., a training job or inference service).
Node A physical or virtual machine (e.g., a GPU server) that runs Pods. Nodes can be GPU-capable or CPU-only.

Example: A pod might contain your train.py running on a Docker container. Kubernetes places this pod onto a node that has available GPU(s).

2. DaemonSets

A DaemonSet is a special controller that runs a specific pod on every node.

Used for:

  • GPU monitoring tools like DCGM exporter

  • Logging agents

  • Node-level background processes

Think of it as “auto-install this tool on every machine in the cluster.”

3. NVIDIA GPU Operator

Instead of installing GPU drivers and tools manually, NVIDIA provides a Kubernetes Operator to automate everything.

The NVIDIA GPU Operator installs:

  • GPU drivers

  • DCGM for GPU health monitoring

  • NVIDIA device plugin (exposes GPU to pods)

  • Container Runtime (e.g., nvidia-container-runtime)

It makes Kubernetes GPU-ready without manual setup.

4. CRI-O / containerd

These are lightweight container runtimes, alternatives to Docker.

In modern Kubernetes clusters:

  • Docker is being replaced by containerd or CRI-O

  • NVIDIA’s GPU integration works with all of them

You don’t need to know deep details, just know that Kubernetes uses these runtimes to launch containers.

NVIDIA Device Plugin

This plugin connects NVIDIA GPUs to Kubernetes.

What It Does:
  • Detects GPU hardware

  • Tells Kubernetes how many GPUs are available

  • Allows users to request 1 or more GPUs per pod

Example: Request 1 GPU in Pod YAML
resources:
  limits:
    nvidia.com/gpu: 1

This ensures Kubernetes schedules the pod on a node with at least 1 available GPU.

GPU Scheduling Strategy in Kubernetes

Since AI workloads are resource-hungry, Kubernetes lets you control exactly where and how GPU jobs are scheduled.

Key Techniques:
Technique What It Controls
Node Selectors Only run on nodes with a GPU label (e.g., gpu=true)
Taints and Tolerations Prevent regular jobs from using GPU nodes unless allowed
Affinity Rules Co-locate jobs together (or spread them out)
Resource Requests & Limits Ensure fair usage of CPU, RAM, and GPUs
Example: Pod Spec with Node Selector and GPU Limit
apiVersion: v1
kind: Pod
metadata:
  name: ai-job
spec:
  nodeSelector:
    accelerator: nvidia
  containers:
    - name: trainer
      image: myrepo/train:latest
      resources:
        limits:
          nvidia.com/gpu: 1
  • This pod must run on a node labeled with accelerator=nvidia

  • It will be assigned exactly 1 GPU

Beginner Tasks to Practice with Kubernetes
Task How to Do It
Deploy a pod with GPU Create a YAML file with nvidia.com/gpu: 1 in resources.limits
Label a node for GPU jobs kubectl label node my-node accelerator=nvidia
View node resources kubectl describe node <name>
Monitor pod status kubectl get pods, kubectl describe pod <name>
Install GPU Operator (using Helm) helm install nvidia-gpu-operator nvidia/gpu-operator

2. Slurm-Based Workload Management

Why Use Slurm?

Slurm (Simple Linux Utility for Resource Management) is a job scheduler used in High-Performance Computing (HPC) and on-premises AI clusters.

It is especially common in environments where:

  • Fine-grained control over GPU/CPU resources is needed

  • Batch jobs (e.g., long AI training runs) are common

  • You want predictable scheduling, even without containers

While Kubernetes excels at flexibility and containerization, Slurm is often preferred when you need deterministic control, especially in academic labs, national data centers, and tightly managed enterprise clusters.

How Slurm Works (Simple Explanation)

Imagine a queue at a coffee shop:

  • People (jobs) arrive and line up

  • The barista (Slurm) takes the next person in line

  • If the espresso machine (GPU) is busy, the next person waits

That’s how Slurm queues jobs and assigns hardware.

Slurm Job Lifecycle (Step-by-Step)

1. Write a Job Script

Create a Bash script to define:

  • How many GPUs and CPUs you want

  • How long your job can run

  • What commands to run

Example: train_model.sh

#!/bin/bash
#SBATCH --job-name=training
#SBATCH --output=output.log
#SBATCH --gres=gpu:1
#SBATCH --mem=16G
#SBATCH --time=04:00:00

python train.py
2. Submit the Job
sbatch train_model.sh

Your job is added to the queue and waits until resources are available.

3. Monitor the Queue
squeue

See who is running, who is waiting, and which node each job is on.

4. Check Job Logs

Slurm writes stdout and stderr to the file you specify with --output=....

cat output.log

Important Slurm Commands

Command Description
sbatch Submit a job script
squeue View jobs in queue
scancel Cancel a job
sinfo Show node and partition status
sacct View historical job info (after they run)

Advanced Features in Slurm

1. QoS (Quality of Service)
  • Controls job priority, time limits, and resource limits

  • For example, "student jobs" may have lower priority than "production jobs"

2. Partitions
  • Logical grouping of nodes (like “queues”)

  • You can define:

    • A “GPU” partition for nodes with GPUs

    • A “debug” partition for short test jobs

  • Each partition can have its own limits and policies

Example:

#SBATCH --partition=gpu
3. Job Preemption
  • High-priority jobs can interrupt lower-priority ones

  • Useful when urgent tasks need to run immediately

SlurmDBD (Slurm Database Daemon)

  • Used to record job history, usage stats, and billing data

  • Helps track:

    • Which users use the most GPUs

    • How much compute time each project consumes

Real-World Example: Scheduling AI Jobs with Slurm

A data science team wants to train 3 models:

  • Model A needs 2 GPUs for 4 hours

  • Model B needs 1 GPU for 8 hours

  • Model C is just a test and needs 10 minutes

With Slurm:

  1. Each team member writes a sbatch script

  2. They submit jobs to the gpu partition

  3. Slurm queues them and runs them as resources become available

  4. Logs are saved and usage is recorded in SlurmDBD

Beginner Tasks to Practice with Slurm

Task Command or Action
Submit a GPU job sbatch train.sh
Request specific resources Use --gres=gpu:N, --mem=XXG, --cpus-per-task
Choose partition --partition=gpu
View jobs squeue, sacct
Cancel a job scancel <job_id>

Summary: Kubernetes vs. Slurm

Feature Kubernetes Slurm
Container Support Excellent (native) Requires Docker/Singularity
Job Scheduling Model Dynamic (on-demand) Batch (queued)
GPU Awareness Via device plugin Built-in with --gres=gpu:N
Multi-user Clusters Supported Supported
Popular In Cloud, DevOps, MLOps HPC, research labs, on-prem AI

3. Multi-Tenancy and Isolation

What is Multi-Tenancy?

Multi-tenancy means that multiple users or teams share the same infrastructure—like GPU clusters, Kubernetes clusters, or Slurm environments.

It’s common in:

  • AI research groups

  • Universities

  • Enterprise AI platforms

  • Public or hybrid cloud environments

Each “tenant” is like a separate user, team, or project.
Good infrastructure ensures they can co-exist without interfering with each other.

Why Isolation Is Important

Without isolation, you might face:

Problem Example
One user uses all the GPUs Slows down other users’ jobs
User A crashes User B’s job Shared memory or job conflicts
Sensitive data is leaked No secure boundaries between containers or users
Logs and metrics are mixed up Hard to debug who caused what

Techniques for Isolation in AI Infrastructure

1. MIG (Multi-Instance GPU)

Already covered in the Administration section, but here’s a recap:

  • MIG divides a single physical GPU into multiple isolated instances

  • Each instance has dedicated memory, cache, compute cores

  • Multiple users can run AI jobs safely on the same GPU

Example:

  • 1 user gets a 1g.5gb MIG slice

  • Another user gets a 2g.10gb slice

  • Each runs separately, no interference

2. Kubernetes Namespaces

A namespace in Kubernetes is like a virtual environment or “folder” for resources.

Each team or project can have:

  • Its own pods

  • Its own config

  • Its own resource quotas

  • Its own logs and metrics

Benefits:

  • Separation of users and workloads

  • Better access control (via RBAC)

  • Quota enforcement (e.g., max 5 GPUs)

Create a namespace:

kubectl create namespace team-a

Run a pod inside that namespace:

kubectl apply -f pod.yaml -n team-a
3. Slurm Partitions

In Slurm, a partition is like a “queue” or “group of nodes.”

You can assign specific nodes to specific users or jobs:

  • Partition gpu-research: 4 A100 nodes for research team

  • Partition gpu-production: 8 RTX 6000 nodes for production team

Control access using ACLs (access control lists):

PartitionName=gpu-research Nodes=node[01-04] AllowUsers=alice,bob

Submit a job to that partition:

sbatch --partition=gpu-research script.sh
4. Fair-Share Scheduling (Slurm)

Used when many users submit jobs, but some are more active than others.

Slurm keeps track of how much compute time each user has consumed and adjusts job priorities accordingly.

User Used GPU Hours Job Priority
Alice 0 HIGH
Bob 100 LOW

Controlled by:

  • PriorityWeightFairshare

  • FairShare value in Slurm configuration

5. Node Affinity and Taints (Kubernetes)

You can isolate workloads based on node labels.

Example:

  • Label nodes for different teams:
kubectl label node gpu-node-1 team=mlops
  • Schedule pods only on “mlops” nodes:
nodeSelector:
  team: mlops

Also:

  • Taints prevent jobs from running on certain nodes unless they "tolerate" them

  • Used for dedicated GPU nodes, security boundaries, or debug workloads

Real-World Use Case

Scenario:
Three teams (A, B, and C) share a Kubernetes cluster with 8 GPUs.

Goal Strategy
Team A can use 4 GPUs Create namespace “team-a” + GPU quota
Team B runs on separate nodes Use nodeSelector + taints
Team C shares GPUs via MIG Assign MIG slices to their pods

Result:
Each team runs independently without affecting each other.

Skills You Should Learn for Multi-Tenant Isolation

Skill Description
Use MIG Allocate safe, isolated GPU slices to users
Create Kubernetes namespaces Separate workloads and configs by team or project
Set resource quotas (K8s) Prevent any user/team from consuming too many GPUs
Configure Slurm partitions Assign GPU nodes to specific users or groups
Enable fair-share scheduling (Slurm) Automatically adjust priorities based on usage history
Use node selectors and taints Pin workloads to specific GPU nodes or prevent job misplacement

4. Monitoring & Logging

Why Monitoring and Logging Matter

In AI infrastructure, it's not enough to just run jobs. You also need to:

  • Track performance: Are GPUs being fully used?

  • Catch failures early: Are jobs crashing due to memory limits?

  • Optimize resource usage: Are some nodes overloaded?

  • Audit and debug: Who ran what, and when?

Without proper monitoring and logging, you’re flying blind—especially in large, multi-user environments.

Monitoring Tools (GPU-Focused)

1. Prometheus
  • An open-source monitoring system

  • It scrapes metrics from services like DCGM (Data Center GPU Manager)

  • Stores time-series data (metrics over time)

  • Works with alert systems and Grafana dashboards

What It Monitors:

  • GPU utilization

  • Memory usage

  • Temperature

  • Power draw

  • Error events (e.g., ECC errors)

How It Works:

  • Each GPU node runs a DCGM exporter

  • Prometheus queries those exporters at regular intervals (e.g., every 15s)

2. Grafana
  • A dashboard tool that connects to Prometheus

  • Displays data visually: graphs, gauges, heatmaps, alerts

Common Dashboards:

  • Per-GPU usage over time

  • Job-by-job memory usage

  • Temperature heatmaps for racks of GPU servers

  • Alerts for failed jobs or hardware issues

Grafana helps operators, admins, and even users see what’s happening at a glance.

3. Kubectl Logs (Kubernetes)
  • Shows stdout/stderr logs from containerized AI jobs

  • Use it to debug crashes, errors, and training logs

Basic Command:

kubectl logs pod-name -n namespace

Use Cases:

  • Check why a training job failed

  • See accuracy/loss progress

  • Capture Python traceback for failed jobs

4. Slurm Logs
  • Every Slurm job has an output log file, defined with --output=...

  • By default, logs include:

    • stdout

    • stderr

    • Job status (completed, failed, cancelled)

Advanced Logs:

  • /var/log/slurm/slurmctld.log: Scheduler events

  • /var/log/slurm/slurmd.log: Node-level execution details

Command to query job history:

sacct -j <job_id>

Key Metrics to Monitor (GPU + Job-Level)

Metric Why It Matters
GPU Utilization Low usage? Maybe your model is bottlenecked on CPU or I/O
Memory Usage Frequent out-of-memory errors? Reduce batch size or adjust memory requests
Temperature Overheating GPUs may throttle performance or even shut down
Power Draw Unusually high power usage may signal runaway jobs or hardware failure
Job Duration Track how long each job runs—compare against expected durations
Container Start/Stop Times Slow starts could signal image pull issues or node pressure
Node Pressure Track CPU, RAM, disk usage—important for co-scheduled AI jobs

Best Practices for AI Monitoring

Practice Benefit
Deploy Prometheus + Grafana cluster-wide Unified monitoring for all GPU nodes
Set GPU temperature/power alerts Detect early signs of overheating or stress
Collect job-level logs Debug model errors or resource issues
Export metrics to centralized dashboard Help admins and users understand performance
Automate alerts Email/Slack/Teams notifications on failure or critical events

Real-World Use Case

Scenario: A training job is crashing after 2 hours with no clear error message.

Monitoring Workflow:

  1. Use kubectl logs (or Slurm job logs) to check for memory errors

  2. Check Prometheus → see if GPU memory hit 100%

  3. Grafana shows temperature spike before crash

  4. Action: Reduce batch size + improve airflow in server rack

Skills You Should Learn for Monitoring & Logging

Skill Description
Use kubectl logs View logs for AI training/inference jobs
Set up DCGM exporter Export GPU metrics for Prometheus
Configure Prometheus scraping Collect GPU stats cluster-wide
Build Grafana dashboards Visualize GPU usage, job health, and trends
Use Slurm’s logging and sacct Analyze completed job history and debug failed runs
Alerting setup Define conditions and notifications for GPU issues or job failures

5. Resource Quotas and Limits (Kubernetes)

Why Are Resource Quotas Important?

In shared AI infrastructure, it’s common for multiple teams or users to:

  • Submit jobs at the same time

  • Request GPUs, CPUs, memory, etc.

  • Accidentally overuse shared resources

Without quotas, one team could:

  • Use all the GPUs

  • Cause denial of service to others

  • Slow down or crash the system

Resource quotas help you ensure fairness and enforce usage limits per namespace or team.

Key Concepts

1. Namespace

Namespaces divide the cluster into logical sections, often used to group resources by:

  • Team (e.g., team-a, team-b)

  • Project

  • Environment (e.g., dev, prod)

Each namespace can have its own quota.

2. ResourceQuota Object

This is a Kubernetes object that sets a hard upper limit on what a namespace can use.

You can limit:

  • Number of pods

  • Total CPU or memory

  • Number of GPUs (nvidia.com/gpu)

Example: Limit a Namespace to 10 GPUs
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: team-a
spec:
  hard:
    requests.nvidia.com/gpu: "10"

This means:

  • Pods in the team-a namespace can collectively request up to 10 GPUs

  • If one pod asks for 4 GPUs, and another for 7 → second one will be denied

Example: Limit CPU, Memory, and GPU
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-limits
  namespace: mlops-team
spec:
  hard:
    requests.cpu: "40"
    requests.memory: "200Gi"
    requests.nvidia.com/gpu: "8"

This enforces:

  • Max 40 CPUs requested across all pods

  • Max 200 GiB of memory requested

  • Max 8 GPUs requested

How Kubernetes Enforces These Limits

Kubernetes tracks total resources requested in a namespace, not actual usage.

So if you have this pod spec:

resources:
  requests:
    nvidia.com/gpu: 2

Kubernetes:

  • Adds 2 GPUs to the total for the namespace

  • Rejects future pods if quota is exceeded

That’s why it’s important to always define resource requests in your pod specs.

If You Don’t Set Resource Requests…
  • The quota won’t be enforced properly

  • Kubernetes might allow pods to overload nodes

  • You lose fine-grained control

Always write this in your YAML:

resources:
  requests:
    nvidia.com/gpu: 1
  limits:
    nvidia.com/gpu: 1

Real-World Example

Scenario: You have 3 teams sharing a Kubernetes cluster:

  • Team A: Needs up to 8 GPUs

  • Team B: Needs up to 4 GPUs

  • Team C: Development only, 2 GPUs max

You:

  1. Create 3 namespaces: team-a, team-b, team-c

  2. Apply appropriate ResourceQuotas to each

Now the cluster can safely host jobs without overloading any team.

Skills You Should Learn for Resource Quotas

Skill Description
Create namespaces Group workloads by team/project
Apply ResourceQuota YAML Enforce limits on GPU, CPU, memory
Define resource requests in pods Ensure pods count toward the quota system
Audit resource usage Use kubectl describe quota to check current usage
Adjust quotas as needed Increase or decrease based on project or team needs

6. Job Lifecycle Automation

Why Automate AI Jobs?

Running AI jobs manually (e.g., sbatch, kubectl apply) works for one-off experiments.
But in production or at scale, you need to:

  • Run the same job daily, hourly, or on demand

  • Chain steps: preprocessing → training → evaluation → deployment

  • Handle failures, retries, logging, and monitoring

  • Track inputs, outputs, and metrics

This is where workflow automation tools come in.
They help you build repeatable, traceable, and scalable pipelines.

Popular Job Automation Tools for AI Workloads

1. Kubeflow Pipelines

Kubeflow is a Kubernetes-native MLOps platform.
Its Pipelines module lets you build end-to-end machine learning workflows.

Feature Benefit
Step-based workflows Break down jobs into modular components
Reusability Build once, run many times
GPU scheduling Native integration with Kubernetes + GPU plugin
Visualization View pipeline DAGs and metrics in a UI

Example Workflow:

  1. Data Ingestion →

  2. Feature Engineering →

  3. Model Training →

  4. Evaluation →

  5. Deployment

Each step is containerized and managed by Kubernetes.

2. Apache Airflow

Originally built for data engineering, Airflow is also widely used for:

  • Scheduling training jobs

  • Managing ETL pipelines

  • Tracking long-running AI workflows

Feature Benefit
DAGs (Directed Acyclic Graphs) Define tasks with dependencies
Rich scheduling Cron-like control (e.g., every night at 2am)
Kubernetes integration Launch tasks as Kubernetes pods

Use Case:

  • Train a model every hour

  • Trigger retraining when new data is uploaded to S3

  • Send an alert if the job fails

3. Argo Workflows

Argo is a Kubernetes-native workflow engine.

Feature Benefit
Container-native Each step runs in its own Kubernetes pod
GPU-aware Fully supports nvidia.com/gpu resources
Scalable and lightweight Ideal for microservices or model inference
YAML-based workflow definitions Easy to store in Git for CI/CD workflows

Example Use Case:

  • Run distributed training on multiple pods

  • Automatically clean up intermediate files

  • Record run metadata and metrics

Core Concepts in Workflow Automation

Concept Description
Workflow DAG A directed graph defining task order and dependencies
Step Container Each step in a workflow is a container that runs a defined script or command
Retry Policy Automatically retry a failed step X times before failing the workflow
Parameters Pass dynamic values (e.g., batch size, dataset path) to jobs
Artifacts Intermediate files (e.g., trained models, logs) saved and passed between steps

Benefits of Automating AI Workflows

Benefit Explanation
Reproducibility Always run the same code, same config
Scalability Can handle many jobs concurrently across multiple GPUs
Resilience Retry failed steps automatically
Traceability Track every step, log, metric, and artifact
Version Control + CI/CD Integrate with Git to enable production-grade MLOps pipelines

Real-World Use Case

Scenario: A company trains a new NLP model every day at midnight using new data collected during the day.

Workflow:

  1. Step 1: Download new data from S3

  2. Step 2: Clean and tokenize data

  3. Step 3: Train Transformer model (requires 2 GPUs)

  4. Step 4: Evaluate performance

  5. Step 5: If accuracy > 90%, deploy to production

Tools:

  • Airflow or Argo to automate scheduling and step execution

  • Kubernetes + GPU plugin to manage GPU allocation

Skills You Should Learn for Job Automation

Skill Description
Write workflow YAML (Argo/Kubeflow) Define steps, containers, resources, dependencies
Schedule jobs (Airflow) Use DAGs and cron triggers to run workflows periodically
Use nvidia.com/gpu in automated pods Ensure GPU resources are allocated properly
Monitor automated jobs Use built-in dashboards and Prometheus/Grafana integration
Handle failures gracefully Add retries, alerting, logging to each step

7. Real-World Scenarios You Must Understand

These scenarios simulate the types of problems and tasks you’ll face in real AI operations roles and on the NCP-AIO exam.

You should understand not just what to do, but why it works and how it interacts with Kubernetes, Slurm, GPU scheduling, monitoring, etc.

Scenario 1: Run AI Training on Kubernetes with GPUs

Problem: You need to deploy a containerized model training job that uses GPUs.

Solution:
  • Create a pod YAML:
resources:
  limits:
    nvidia.com/gpu: 2
  • Apply node selector to target GPU nodes:
nodeSelector:
  accelerator: nvidia
  • Submit using kubectl apply -f job.yaml

Concepts involved:

  • Kubernetes Pod spec

  • nvidia.com/gpu limits

  • GPU plugin

  • Node selectors

Scenario 2: Monitor GPU Jobs Using Prometheus & Grafana

Problem: You want real-time visibility into GPU usage per job.

Solution:
  • Deploy DCGM Exporter on each GPU node

  • Set up Prometheus to scrape metrics

  • Use Grafana to visualize:

    • GPU temperature

    • Memory usage

    • Power draw

    • ECC error counts

Concepts involved:

  • DCGM

  • Prometheus scraping

  • Grafana dashboard setup

  • Node metrics + per-job statistics

Scenario 3: Isolate Workloads from Multiple Teams

Problem: Team A and Team B need to run jobs on the same cluster but must not interfere with each other.

Kubernetes-based Solution:
  • Create separate namespaces: team-a, team-b

  • Assign ResourceQuotas for each namespace

  • Use RBAC to restrict access

  • Optionally assign MIG slices to teams

Slurm-based Solution:
  • Define partitions: team-a-partition, team-b-partition

  • Assign users to each partition using AllowUsers=

Concepts involved:

  • Namespaces

  • Resource quotas

  • MIG

  • Slurm partitions and access control

Scenario 4: Limit One Team to 5 GPUs in Kubernetes

Problem: You need to enforce a GPU usage limit for one team.

Solution:

Create a ResourceQuota in their namespace:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-limit
  namespace: team-x
spec:
  hard:
    requests.nvidia.com/gpu: "5"

Concepts involved:

  • ResourceQuota

  • Namespace scoping

  • GPU plugin resource requests

Scenario 5: Handle Failed Job Due to Out-of-Memory (OOM)

Problem: A training job fails after running for some time with a memory error.

Solution Steps:
  1. Check logs:

    • kubectl logs or Slurm log output
  2. Identify error:

    • "Killed: Out of Memory"
  3. Fix:

    • Reduce batch size

    • Increase memory request:

resources:
  requests:
    memory: "32Gi"

Concepts involved:

  • Pod logs

  • Memory limits

  • Container lifecycle

  • Model optimization techniques

Additional Practical Challenges to Be Familiar With

Situation You Should Know...
Job stuck in Pending state (K8s) How to debug using kubectl describe pod, check GPU availability
Slurm job runs but uses no GPU How to request GPUs correctly with --gres=gpu:1
Monitoring dashboard shows 0% GPU use How to trace bottlenecks in CPU or I/O
Kubernetes job fails silently Where to find pod logs and error status
One user using too many GPUs How to apply quotas or fair-share limits

Final Summary: Skills You Must Master in Workload Management

Topic Practical Skills
Kubernetes + GPU Plugin Write GPU-aware pod specs, use node selectors, apply limits
Slurm Submit jobs, manage partitions, enforce GPU usage policies
Monitoring Install DCGM exporters, configure Prometheus/Grafana, build dashboards
Isolation Techniques Use MIG, Kubernetes namespaces, and Slurm partitions to separate workloads
Scaling for Multi-Tenant Set up quotas, fair-share scheduling, and RBAC rules
Automation Tools Build workflows with Kubeflow, Argo, or Airflow for repeatable pipelines

Workload Management (Additional Content)

1. Kubernetes – GPU Workload Management

nvidia-device-plugin YAML Structure

The NVIDIA device plugin is typically deployed as a DaemonSet. Key YAML components include:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
spec:
  template:
    spec:
      containers:
      - name: nvidia-device-plugin-ctr
        image: nvcr.io/nvidia/k8s-device-plugin:latest
        args: ["--mig-strategy=single"]
        env:
        - name: FAIL_ON_INIT_ERROR
          value: "false"
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins

Key fields:

Field Purpose
--mig-strategy Controls MIG handling (none, single, mixed)
FAIL_ON_INIT_ERROR Allows graceful start if no GPU exists
volumeMounts Mount path for Kubernetes to access plugin sockets

Multi-Container Pod GPU Behavior

In Kubernetes, GPU resources are allocated at the Pod level:

  • All containers in the same Pod share the same GPU allocation.

  • There is no fine-grained control to split 1 GPU across containers in the same Pod (unless using MIG).

  • Default behavior: GPU is made visible to all containers via environment variable NVIDIA_VISIBLE_DEVICES.

Jobs vs. Pods

Type Behavior
Pod One-off unit, no restart on failure
Job Handles retries, status tracking, suitable for training or evaluation tasks

Best practice: Use Job for batch inference or training workloads; only use Pod for debugging or non-repeatable jobs.

Fractional GPU / MIG Support in Kubernetes

  • MIG support is available via the NVIDIA device plugin with --mig-strategy.

  • Time-slicing (e.g., A10 GPU) is experimental with newer Kubernetes + NVIDIA driver versions.

  • Still no native fractional GPU (e.g., 0.5 GPU) support in Kubernetes without MIG.

GPU Request and Limit Behavior

Field Effect
resources.requests.nvidia.com/gpu Used for scheduling
resources.limits.nvidia.com/gpu Used for runtime enforcement

Rules:

  • They must match exactly (1 and 1), or scheduling fails.

  • No overcommit allowed on GPU.

  • Omitting limits but setting requests → OK (GPU allocated, but no enforced limit).

  • Setting limits > requests → Invalid.

2. Slurm Workload Management

Job Arrays in Slurm

Used to submit multiple similar jobs with a single sbatch command:

sbatch --array=0-9 run_train.sh

Inside the script, access the job index with $SLURM_ARRAY_TASK_ID.

Use cases: Hyperparameter sweep, different seeds, model variant evaluation.

srun vs sbatch

Command Usage
sbatch Submit batch job, executes in background
srun Run interactively or within sbatch for MPI/multinode

Interactive example:

srun --gres=gpu:1 --pty bash

scontrol for Job Management

Use scontrol to modify queued or running jobs:

  • Suspend: scontrol suspend <jobid>

  • Resume: scontrol resume <jobid>

  • Change time limit: scontrol update jobid=<id> TimeLimit=01:00:00

Partition Strategy Optimization

Partition Attribute Behavior
Priority Determines job execution order
PreemptMode Allows interruption of low-priority jobs
TimeLimit Controls job length per partition

Best practice:

  • Use short partitions for testing

  • Use preemption to enable high-priority job insertion

DCGM + Prometheus in Slurm

  • Deploy dcgm-exporter on each node.

  • Collect metrics via Prometheus scrape.

  • Map GPU usage to job/user via Slurm accounting + hostname mapping.

3. Multi-Tenancy and GPU Isolation

Slurm + MIG Binding

You can assign specific MIG slices to users via GRES and Job Constraints:

#gres.conf
Name=gpu Type=MIG-GPU-0 UUID=GPU-abc123 Count=1

#slurm.conf
NodeName=nodex Gres=gpu:MIG-GPU-0:1

#job script
#SBATCH --gres=gpu:MIG-GPU-0:1

GPU Usage Auditing

Use SlurmDBD to collect per-user job GPU metrics:

sacct -u <user> --format=JobID,Elapsed,ReqGRES,AllocGRES,MaxRSS

Integrate with Grafana, Elastic, or custom scripts for department-level billing.

Kubernetes LimitRange for GPU Control

Use LimitRange to control min/max GPU per Pod:

apiVersion: v1
kind: LimitRange
spec:
  limits:
  - type: Container
    max:
      nvidia.com/gpu: 2
    min:
      nvidia.com/gpu: 1

Pair with ResourceQuota to limit total GPU per Namespace.

4. Monitoring & Logging

DCGM Exporter Metrics (Prometheus)

Metric Meaning
dcgm_gpu_power_usage Instant power draw in watts
dcgm_memory_temp Memory temperature (°C)
dcgm_ecc_errors Accumulated ECC errors (corrected/uncorrected)

Centralized Pod Log Collection

Use tools like:

  • Fluentd, Fluent Bit → forward logs from container stdout

  • Logstash → log parsing

  • ELK/EFK stack → storage and visualization

Deploy as DaemonSet + ConfigMap.

Node Problem Detector Integration

Add hardware monitoring hooks via node-problem-detector:

  • Source: /var/log/syslog, dmesg

  • Reports GPU errors, thermal events as Kubernetes Node Conditions

Alertmanager GPU Alerts

Sample config:

groups:
- name: GPU overheating
  rules:
  - alert: GPUTemperatureTooHigh
    expr: dcgm_temperature_gpu > 85
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "GPU Temperature Too High"
      description: "Node {{ $labels.instance }} GPU exceeds 85°C"

5. Resource Quota and Scheduling

LimitRange vs ResourceQuota

Resource Scope Function
LimitRange Per Pod/Container Set min/max per resource type
ResourceQuota Per Namespace Set total aggregate limit (e.g., 8 GPUs)

Use both to enforce fair resource usage.

Namespace Hierarchy and Subquotas

Kubernetes supports v1beta1 hierarchical quotas (via plugins like HNC):

  • Allow subteam quotas: teamA-dev, teamA-prod

  • Schedulers can allocate based on sub-quota priorities

Scheduler Extenders

Advanced feature to inject custom scheduling logic (e.g., for GPU memory size or NUMA topology):

  • Write HTTP extender server

  • Modify kube-scheduler config to include:

extenders:
- urlPrefix: "http://localhost:12300/scheduler"
  filterVerb: "filter"
  prioritizeVerb: "prioritize"

Used by platforms like Volcano or Alibaba AI scheduler.

6. Workflow Automation Tools

Argo Workflows with GPU

Example:

containers:
- name: train
  image: pytorch/pytorch:2.0
  resources:
    limits:
      nvidia.com/gpu: 1

Argo will schedule the step only on GPU-capable nodes.

Katib for Hyperparameter Tuning

Example config for tuning learning rate:

objective:
  type: maximize
  goal: 0.9
  objectiveMetricName: accuracy
parameters:
- name: learning_rate
  parameterType: double
  feasibleSpace:
    min: "0.01"
    max: "0.1"

Runs trials using Kubernetes jobs with GPUs.

Airflow KubernetesPodOperator

DAG snippet:

KubernetesPodOperator(
    namespace='ai-pipeline',
    image='my-model:latest',
    cmds=["python", "train.py"],
    name="train-task",
    task_id="train-model",
    resources={"limit_memory": "4Gi", "limit_gpu": "1"}
)

Airflow dynamically creates GPU-enabled Pods.

Retries & Scheduling

In Argo/Kubeflow:

retryStrategy:
  limit: 3
  retryPolicy: "Always"

Supports backoff, retry delay, onFailure handlers for production reliability.

Frequently Asked Questions

How does Kubernetes schedule workloads that require NVIDIA GPUs?

Answer:

Kubernetes schedules GPU workloads using device plugins that advertise GPU resources to the scheduler.

Explanation:

In GPU-enabled Kubernetes clusters, the NVIDIA device plugin exposes GPUs as schedulable resources. Nodes report available GPUs to the Kubernetes control plane, and workloads request them through resource limits in pod specifications. When a pod declares a GPU requirement, the scheduler places it on a node with sufficient GPU capacity. This prevents oversubscription and ensures GPU workloads run only where hardware is available. Administrators often combine device plugins with cluster monitoring to track GPU allocation and utilization. A common mistake is deploying containers without declaring GPU resource requests, which results in workloads running without GPU acceleration even though GPUs exist on the node.

Demand Score: 82

Exam Relevance Score: 87

Why might a containerized AI workload fail to detect GPUs even when the host machine has available GPUs?

Answer:

This typically occurs when the container runtime lacks GPU support or the NVIDIA container toolkit is not properly configured.

Explanation:

GPU devices are not automatically accessible inside containers. Administrators must install and configure GPU-aware container runtimes so that GPU device files and driver libraries are exposed to containers. Without this configuration, frameworks such as TensorFlow or PyTorch cannot detect GPUs and will fall back to CPU execution. This problem frequently occurs when nodes are newly added to clusters without installing the necessary container toolkit or when runtime configuration files are missing. Verifying GPU visibility inside containers using diagnostic commands ensures workloads can properly access GPU resources.

Demand Score: 78

Exam Relevance Score: 88

Why is explicit GPU resource allocation required when scheduling AI workloads in orchestration platforms?

Answer:

Explicit resource allocation prevents resource contention and ensures workloads receive dedicated GPU capacity.

Explanation:

In shared AI infrastructure, multiple workloads compete for GPU resources. Orchestration platforms rely on declared resource requests to determine how workloads are placed across nodes. When GPU requirements are explicitly specified, the scheduler ensures that workloads run only on nodes with available GPU capacity and prevents multiple workloads from unintentionally occupying the same GPU. Without these constraints, scheduling systems cannot enforce resource isolation, which may result in unpredictable performance or job failures. Proper resource specification is therefore critical for maintaining stable GPU utilization in multi-tenant environments.

Demand Score: 74

Exam Relevance Score: 83

What operational challenge occurs when multiple AI workloads attempt to use the same GPU simultaneously?

Answer:

Resource contention can degrade performance and lead to memory allocation conflicts.

Explanation:

GPUs are designed to execute parallel workloads efficiently, but uncontrolled sharing may cause conflicts in compute scheduling and memory allocation. When multiple jobs attempt to allocate large memory regions simultaneously, workloads may fail or experience reduced performance. This issue is common in shared research clusters or ML development environments. Administrators often mitigate this risk by implementing scheduling policies, workload isolation mechanisms, or GPU partitioning strategies that ensure predictable resource allocation.

Demand Score: 76

Exam Relevance Score: 85

Why do AI operations teams monitor GPU workload execution after deployment?

Answer:

Monitoring helps verify that workloads utilize GPU resources efficiently and detect abnormal execution behavior.

Explanation:

Once workloads are scheduled and running, administrators must observe runtime metrics to ensure GPUs are being used effectively. Metrics such as GPU utilization, memory usage, and job duration reveal whether workloads are performing as expected. Low utilization may indicate data pipeline bottlenecks, configuration issues, or misconfigured containers. Monitoring also helps identify workloads that monopolize GPU resources or consume excessive memory. By continuously analyzing these metrics, operations teams can adjust scheduling strategies and optimize overall cluster performance.

Demand Score: 71

Exam Relevance Score: 82

NCP-AIO Training Course
$68$29.99
NCP-AIO Training Course