Workload Management

Workload Management Detailed Explanation

1. Kubernetes for AI Workload Management

Why Use Kubernetes for AI Workloads?

Kubernetes (often shortened as K8s) is a powerful open-source container orchestration system.

In AI environments, it helps:

Deploy and manage containerized ML/AI applications
Scale workloads automatically
Ensure high availability and efficient use of GPUs
Run jobs on multi-node clusters

Kubernetes is especially valuable for MLOps pipelines, large-scale training, and inference systems.

Containers + Kubernetes = Scalable AI

Before Kubernetes, people ran scripts manually or used tools like Docker Compose. But in large environments, you need to manage:

Dozens of models
Thousands of jobs
Dynamic GPU availability
Failures and restarts

Kubernetes automates all of this.

Key Kubernetes Concepts for AI

1. Pods & Nodes

Term	Description
Pod	The smallest unit in Kubernetes; usually runs one container (e.g., a training job or inference service).
Node	A physical or virtual machine (e.g., a GPU server) that runs Pods. Nodes can be GPU-capable or CPU-only.

Example: A pod might contain your train.py running on a Docker container. Kubernetes places this pod onto a node that has available GPU(s).

2. DaemonSets

A DaemonSet is a special controller that runs a specific pod on every node.

Used for:

GPU monitoring tools like DCGM exporter
Logging agents
Node-level background processes

Think of it as “auto-install this tool on every machine in the cluster.”

3. NVIDIA GPU Operator

Instead of installing GPU drivers and tools manually, NVIDIA provides a Kubernetes Operator to automate everything.

The NVIDIA GPU Operator installs:

GPU drivers
DCGM for GPU health monitoring
NVIDIA device plugin (exposes GPU to pods)
Container Runtime (e.g., nvidia-container-runtime)

It makes Kubernetes GPU-ready without manual setup.

4. CRI-O / containerd

These are lightweight container runtimes, alternatives to Docker.

In modern Kubernetes clusters:

Docker is being replaced by containerd or CRI-O
NVIDIA’s GPU integration works with all of them

You don’t need to know deep details, just know that Kubernetes uses these runtimes to launch containers.

NVIDIA Device Plugin

This plugin connects NVIDIA GPUs to Kubernetes.

What It Does:

Detects GPU hardware
Tells Kubernetes how many GPUs are available
Allows users to request 1 or more GPUs per pod

Example: Request 1 GPU in Pod YAML

resources:
  limits:
    nvidia.com/gpu: 1

This ensures Kubernetes schedules the pod on a node with at least 1 available GPU.

GPU Scheduling Strategy in Kubernetes

Since AI workloads are resource-hungry, Kubernetes lets you control exactly where and how GPU jobs are scheduled.

Key Techniques:

Technique	What It Controls
Node Selectors	Only run on nodes with a GPU label (e.g., `gpu=true`)
Taints and Tolerations	Prevent regular jobs from using GPU nodes unless allowed
Affinity Rules	Co-locate jobs together (or spread them out)
Resource Requests & Limits	Ensure fair usage of CPU, RAM, and GPUs

Example: Pod Spec with Node Selector and GPU Limit

apiVersion: v1
kind: Pod
metadata:
  name: ai-job
spec:
  nodeSelector:
    accelerator: nvidia
  containers:
    - name: trainer
      image: myrepo/train:latest
      resources:
        limits:
          nvidia.com/gpu: 1

This pod must run on a node labeled with accelerator=nvidia
It will be assigned exactly 1 GPU

Beginner Tasks to Practice with Kubernetes

Task	How to Do It
Deploy a pod with GPU	Create a YAML file with `nvidia.com/gpu: 1` in `resources.limits`
Label a node for GPU jobs	`kubectl label node my-node accelerator=nvidia`
View node resources	`kubectl describe node <name>`
Monitor pod status	`kubectl get pods`, `kubectl describe pod <name>`
Install GPU Operator (using Helm)	`helm install nvidia-gpu-operator nvidia/gpu-operator`

2. Slurm-Based Workload Management

Why Use Slurm?

Slurm (Simple Linux Utility for Resource Management) is a job scheduler used in High-Performance Computing (HPC) and on-premises AI clusters.

It is especially common in environments where:

Fine-grained control over GPU/CPU resources is needed
Batch jobs (e.g., long AI training runs) are common
You want predictable scheduling, even without containers

While Kubernetes excels at flexibility and containerization, Slurm is often preferred when you need deterministic control, especially in academic labs, national data centers, and tightly managed enterprise clusters.

How Slurm Works (Simple Explanation)

Imagine a queue at a coffee shop:

People (jobs) arrive and line up
The barista (Slurm) takes the next person in line
If the espresso machine (GPU) is busy, the next person waits

That’s how Slurm queues jobs and assigns hardware.

Slurm Job Lifecycle (Step-by-Step)

1. Write a Job Script

Create a Bash script to define:

How many GPUs and CPUs you want
How long your job can run
What commands to run

Example: train_model.sh

#!/bin/bash
#SBATCH --job-name=training
#SBATCH --output=output.log
#SBATCH --gres=gpu:1
#SBATCH --mem=16G
#SBATCH --time=04:00:00

python train.py

2. Submit the Job

sbatch train_model.sh

Your job is added to the queue and waits until resources are available.

3. Monitor the Queue

squeue

See who is running, who is waiting, and which node each job is on.

4. Check Job Logs

Slurm writes stdout and stderr to the file you specify with --output=....

cat output.log

Important Slurm Commands

Command	Description
`sbatch`	Submit a job script
`squeue`	View jobs in queue
`scancel`	Cancel a job
`sinfo`	Show node and partition status
`sacct`	View historical job info (after they run)

Advanced Features in Slurm

1. QoS (Quality of Service)

Controls job priority, time limits, and resource limits
For example, "student jobs" may have lower priority than "production jobs"

2. Partitions

Logical grouping of nodes (like “queues”)
You can define:
- A “GPU” partition for nodes with GPUs
- A “debug” partition for short test jobs
Each partition can have its own limits and policies

Example:

#SBATCH --partition=gpu

3. Job Preemption

High-priority jobs can interrupt lower-priority ones
Useful when urgent tasks need to run immediately

SlurmDBD (Slurm Database Daemon)

Used to record job history, usage stats, and billing data
Helps track:
- Which users use the most GPUs
- How much compute time each project consumes

Real-World Example: Scheduling AI Jobs with Slurm

A data science team wants to train 3 models:

Model A needs 2 GPUs for 4 hours
Model B needs 1 GPU for 8 hours
Model C is just a test and needs 10 minutes

With Slurm:

Each team member writes a sbatch script
They submit jobs to the gpu partition
Slurm queues them and runs them as resources become available
Logs are saved and usage is recorded in SlurmDBD

Beginner Tasks to Practice with Slurm

Task	Command or Action
Submit a GPU job	`sbatch train.sh`
Request specific resources	Use `--gres=gpu:N`, `--mem=XXG`, `--cpus-per-task`
Choose partition	`--partition=gpu`
View jobs	`squeue`, `sacct`
Cancel a job	`scancel <job_id>`

Summary: Kubernetes vs. Slurm

Feature	Kubernetes	Slurm
Container Support	Excellent (native)	Requires Docker/Singularity
Job Scheduling Model	Dynamic (on-demand)	Batch (queued)
GPU Awareness	Via device plugin	Built-in with `--gres=gpu:N`
Multi-user Clusters	Supported	Supported
Popular In	Cloud, DevOps, MLOps	HPC, research labs, on-prem AI

3. Multi-Tenancy and Isolation

What is Multi-Tenancy?

Multi-tenancy means that multiple users or teams share the same infrastructure—like GPU clusters, Kubernetes clusters, or Slurm environments.

It’s common in:

AI research groups
Universities
Enterprise AI platforms
Public or hybrid cloud environments

Each “tenant” is like a separate user, team, or project.
Good infrastructure ensures they can co-exist without interfering with each other.

Why Isolation Is Important

Without isolation, you might face:

Problem	Example
One user uses all the GPUs	Slows down other users’ jobs
User A crashes User B’s job	Shared memory or job conflicts
Sensitive data is leaked	No secure boundaries between containers or users
Logs and metrics are mixed up	Hard to debug who caused what

Techniques for Isolation in AI Infrastructure

1. MIG (Multi-Instance GPU)

Already covered in the Administration section, but here’s a recap:

MIG divides a single physical GPU into multiple isolated instances
Each instance has dedicated memory, cache, compute cores
Multiple users can run AI jobs safely on the same GPU

Example:

1 user gets a 1g.5gb MIG slice
Another user gets a 2g.10gb slice
Each runs separately, no interference

2. Kubernetes Namespaces

A namespace in Kubernetes is like a virtual environment or “folder” for resources.

Each team or project can have:

Its own pods
Its own config
Its own resource quotas
Its own logs and metrics

Benefits:

Separation of users and workloads
Better access control (via RBAC)
Quota enforcement (e.g., max 5 GPUs)

Create a namespace:

kubectl create namespace team-a

Run a pod inside that namespace:

kubectl apply -f pod.yaml -n team-a

3. Slurm Partitions

In Slurm, a partition is like a “queue” or “group of nodes.”

You can assign specific nodes to specific users or jobs:

Partition gpu-research: 4 A100 nodes for research team
Partition gpu-production: 8 RTX 6000 nodes for production team

Control access using ACLs (access control lists):

PartitionName=gpu-research Nodes=node[01-04] AllowUsers=alice,bob

Submit a job to that partition:

sbatch --partition=gpu-research script.sh

4. Fair-Share Scheduling (Slurm)

Used when many users submit jobs, but some are more active than others.

Slurm keeps track of how much compute time each user has consumed and adjusts job priorities accordingly.

User	Used GPU Hours	Job Priority
Alice	0	HIGH
Bob	100	LOW

Controlled by:

PriorityWeightFairshare
FairShare value in Slurm configuration

5. Node Affinity and Taints (Kubernetes)

You can isolate workloads based on node labels.

Example:

Label nodes for different teams:

kubectl label node gpu-node-1 team=mlops

Schedule pods only on “mlops” nodes:

nodeSelector:
  team: mlops

Also:

Taints prevent jobs from running on certain nodes unless they "tolerate" them
Used for dedicated GPU nodes, security boundaries, or debug workloads

Real-World Use Case

Scenario:
Three teams (A, B, and C) share a Kubernetes cluster with 8 GPUs.

Goal	Strategy
Team A can use 4 GPUs	Create namespace “team-a” + GPU quota
Team B runs on separate nodes	Use nodeSelector + taints
Team C shares GPUs via MIG	Assign MIG slices to their pods

Result:
Each team runs independently without affecting each other.

Skills You Should Learn for Multi-Tenant Isolation

Skill	Description
Use MIG	Allocate safe, isolated GPU slices to users
Create Kubernetes namespaces	Separate workloads and configs by team or project
Set resource quotas (K8s)	Prevent any user/team from consuming too many GPUs
Configure Slurm partitions	Assign GPU nodes to specific users or groups
Enable fair-share scheduling (Slurm)	Automatically adjust priorities based on usage history
Use node selectors and taints	Pin workloads to specific GPU nodes or prevent job misplacement

4. Monitoring & Logging

Why Monitoring and Logging Matter

In AI infrastructure, it's not enough to just run jobs. You also need to:

Track performance: Are GPUs being fully used?
Catch failures early: Are jobs crashing due to memory limits?
Optimize resource usage: Are some nodes overloaded?
Audit and debug: Who ran what, and when?

Without proper monitoring and logging, you’re flying blind—especially in large, multi-user environments.

Monitoring Tools (GPU-Focused)

1. Prometheus

An open-source monitoring system
It scrapes metrics from services like DCGM (Data Center GPU Manager)
Stores time-series data (metrics over time)
Works with alert systems and Grafana dashboards

What It Monitors:

GPU utilization
Memory usage
Temperature
Power draw
Error events (e.g., ECC errors)

How It Works:

Each GPU node runs a DCGM exporter
Prometheus queries those exporters at regular intervals (e.g., every 15s)

2. Grafana

A dashboard tool that connects to Prometheus
Displays data visually: graphs, gauges, heatmaps, alerts

Common Dashboards:

Per-GPU usage over time
Job-by-job memory usage
Temperature heatmaps for racks of GPU servers
Alerts for failed jobs or hardware issues

Grafana helps operators, admins, and even users see what’s happening at a glance.

3. Kubectl Logs (Kubernetes)

Shows stdout/stderr logs from containerized AI jobs
Use it to debug crashes, errors, and training logs

Basic Command:

kubectl logs pod-name -n namespace

Use Cases:

Check why a training job failed
See accuracy/loss progress
Capture Python traceback for failed jobs

4. Slurm Logs

Every Slurm job has an output log file, defined with --output=...
By default, logs include:
- stdout
- stderr
- Job status (completed, failed, cancelled)

Advanced Logs:

/var/log/slurm/slurmctld.log: Scheduler events
/var/log/slurm/slurmd.log: Node-level execution details

Command to query job history:

sacct -j <job_id>

Key Metrics to Monitor (GPU + Job-Level)

Metric	Why It Matters
GPU Utilization	Low usage? Maybe your model is bottlenecked on CPU or I/O
Memory Usage	Frequent out-of-memory errors? Reduce batch size or adjust memory requests
Temperature	Overheating GPUs may throttle performance or even shut down
Power Draw	Unusually high power usage may signal runaway jobs or hardware failure
Job Duration	Track how long each job runs—compare against expected durations
Container Start/Stop Times	Slow starts could signal image pull issues or node pressure
Node Pressure	Track CPU, RAM, disk usage—important for co-scheduled AI jobs

Best Practices for AI Monitoring

Practice	Benefit
Deploy Prometheus + Grafana cluster-wide	Unified monitoring for all GPU nodes
Set GPU temperature/power alerts	Detect early signs of overheating or stress
Collect job-level logs	Debug model errors or resource issues
Export metrics to centralized dashboard	Help admins and users understand performance
Automate alerts	Email/Slack/Teams notifications on failure or critical events

Real-World Use Case

Scenario: A training job is crashing after 2 hours with no clear error message.

Monitoring Workflow:

Use kubectl logs (or Slurm job logs) to check for memory errors
Check Prometheus → see if GPU memory hit 100%
Grafana shows temperature spike before crash
Action: Reduce batch size + improve airflow in server rack

Skills You Should Learn for Monitoring & Logging

Skill	Description
Use `kubectl logs`	View logs for AI training/inference jobs
Set up DCGM exporter	Export GPU metrics for Prometheus
Configure Prometheus scraping	Collect GPU stats cluster-wide
Build Grafana dashboards	Visualize GPU usage, job health, and trends
Use Slurm’s logging and `sacct`	Analyze completed job history and debug failed runs
Alerting setup	Define conditions and notifications for GPU issues or job failures

5. Resource Quotas and Limits (Kubernetes)

Why Are Resource Quotas Important?

In shared AI infrastructure, it’s common for multiple teams or users to:

Submit jobs at the same time
Request GPUs, CPUs, memory, etc.
Accidentally overuse shared resources

Without quotas, one team could:

Use all the GPUs
Cause denial of service to others
Slow down or crash the system

Resource quotas help you ensure fairness and enforce usage limits per namespace or team.

Key Concepts

1. Namespace

Namespaces divide the cluster into logical sections, often used to group resources by:

Team (e.g., team-a, team-b)
Project
Environment (e.g., dev, prod)

Each namespace can have its own quota.

2. ResourceQuota Object

This is a Kubernetes object that sets a hard upper limit on what a namespace can use.

You can limit:

Number of pods
Total CPU or memory
Number of GPUs (nvidia.com/gpu)

Example: Limit a Namespace to 10 GPUs

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: team-a
spec:
  hard:
    requests.nvidia.com/gpu: "10"

This means:

Pods in the team-a namespace can collectively request up to 10 GPUs
If one pod asks for 4 GPUs, and another for 7 → second one will be denied

Example: Limit CPU, Memory, and GPU

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-limits
  namespace: mlops-team
spec:
  hard:
    requests.cpu: "40"
    requests.memory: "200Gi"
    requests.nvidia.com/gpu: "8"

This enforces:

Max 40 CPUs requested across all pods
Max 200 GiB of memory requested
Max 8 GPUs requested

How Kubernetes Enforces These Limits

Kubernetes tracks total resources requested in a namespace, not actual usage.

So if you have this pod spec:

resources:
  requests:
    nvidia.com/gpu: 2

Kubernetes:

Adds 2 GPUs to the total for the namespace
Rejects future pods if quota is exceeded

That’s why it’s important to always define resource requests in your pod specs.

If You Don’t Set Resource Requests…

The quota won’t be enforced properly
Kubernetes might allow pods to overload nodes
You lose fine-grained control

Always write this in your YAML:

resources:
  requests:
    nvidia.com/gpu: 1
  limits:
    nvidia.com/gpu: 1

Real-World Example

Scenario: You have 3 teams sharing a Kubernetes cluster:

Team A: Needs up to 8 GPUs
Team B: Needs up to 4 GPUs
Team C: Development only, 2 GPUs max

You:

Create 3 namespaces: team-a, team-b, team-c
Apply appropriate ResourceQuotas to each

Now the cluster can safely host jobs without overloading any team.

Skills You Should Learn for Resource Quotas

Skill	Description
Create namespaces	Group workloads by team/project
Apply ResourceQuota YAML	Enforce limits on GPU, CPU, memory
Define resource requests in pods	Ensure pods count toward the quota system
Audit resource usage	Use `kubectl describe quota` to check current usage
Adjust quotas as needed	Increase or decrease based on project or team needs

6. Job Lifecycle Automation

Why Automate AI Jobs?

Running AI jobs manually (e.g., sbatch, kubectl apply) works for one-off experiments.
But in production or at scale, you need to:

Run the same job daily, hourly, or on demand
Chain steps: preprocessing → training → evaluation → deployment
Handle failures, retries, logging, and monitoring
Track inputs, outputs, and metrics

This is where workflow automation tools come in.
They help you build repeatable, traceable, and scalable pipelines.

Popular Job Automation Tools for AI Workloads

1. Kubeflow Pipelines

Kubeflow is a Kubernetes-native MLOps platform.
Its Pipelines module lets you build end-to-end machine learning workflows.

Feature	Benefit
Step-based workflows	Break down jobs into modular components
Reusability	Build once, run many times
GPU scheduling	Native integration with Kubernetes + GPU plugin
Visualization	View pipeline DAGs and metrics in a UI

Example Workflow:

Data Ingestion →
Feature Engineering →
Model Training →
Evaluation →
Deployment

Each step is containerized and managed by Kubernetes.

2. Apache Airflow

Originally built for data engineering, Airflow is also widely used for:

Scheduling training jobs
Managing ETL pipelines
Tracking long-running AI workflows

Feature	Benefit
DAGs (Directed Acyclic Graphs)	Define tasks with dependencies
Rich scheduling	Cron-like control (e.g., every night at 2am)
Kubernetes integration	Launch tasks as Kubernetes pods

Use Case:

Train a model every hour
Trigger retraining when new data is uploaded to S3
Send an alert if the job fails

3. Argo Workflows

Argo is a Kubernetes-native workflow engine.

Feature	Benefit
Container-native	Each step runs in its own Kubernetes pod
GPU-aware	Fully supports `nvidia.com/gpu` resources
Scalable and lightweight	Ideal for microservices or model inference
YAML-based workflow definitions	Easy to store in Git for CI/CD workflows

Example Use Case:

Run distributed training on multiple pods
Automatically clean up intermediate files
Record run metadata and metrics

Core Concepts in Workflow Automation

Concept	Description
Workflow DAG	A directed graph defining task order and dependencies
Step Container	Each step in a workflow is a container that runs a defined script or command
Retry Policy	Automatically retry a failed step X times before failing the workflow
Parameters	Pass dynamic values (e.g., batch size, dataset path) to jobs
Artifacts	Intermediate files (e.g., trained models, logs) saved and passed between steps

Benefits of Automating AI Workflows

Benefit	Explanation
Reproducibility	Always run the same code, same config
Scalability	Can handle many jobs concurrently across multiple GPUs
Resilience	Retry failed steps automatically
Traceability	Track every step, log, metric, and artifact
Version Control + CI/CD	Integrate with Git to enable production-grade MLOps pipelines

Real-World Use Case

Scenario: A company trains a new NLP model every day at midnight using new data collected during the day.

Workflow:

Step 1: Download new data from S3
Step 2: Clean and tokenize data
Step 3: Train Transformer model (requires 2 GPUs)
Step 4: Evaluate performance
Step 5: If accuracy > 90%, deploy to production

Tools:

Airflow or Argo to automate scheduling and step execution
Kubernetes + GPU plugin to manage GPU allocation

Skills You Should Learn for Job Automation

Skill	Description
Write workflow YAML (Argo/Kubeflow)	Define steps, containers, resources, dependencies
Schedule jobs (Airflow)	Use DAGs and cron triggers to run workflows periodically
Use `nvidia.com/gpu` in automated pods	Ensure GPU resources are allocated properly
Monitor automated jobs	Use built-in dashboards and Prometheus/Grafana integration
Handle failures gracefully	Add retries, alerting, logging to each step

7. Real-World Scenarios You Must Understand

These scenarios simulate the types of problems and tasks you’ll face in real AI operations roles and on the NCP-AIO exam.

You should understand not just what to do, but why it works and how it interacts with Kubernetes, Slurm, GPU scheduling, monitoring, etc.

Scenario 1: Run AI Training on Kubernetes with GPUs

Problem: You need to deploy a containerized model training job that uses GPUs.

Solution:

Create a pod YAML:

resources:
  limits:
    nvidia.com/gpu: 2

Apply node selector to target GPU nodes:

nodeSelector:
  accelerator: nvidia

Submit using kubectl apply -f job.yaml

Concepts involved:

Kubernetes Pod spec
nvidia.com/gpu limits
GPU plugin
Node selectors

Scenario 2: Monitor GPU Jobs Using Prometheus & Grafana

Problem: You want real-time visibility into GPU usage per job.

Solution:

Deploy DCGM Exporter on each GPU node
Set up Prometheus to scrape metrics
Use Grafana to visualize:
- GPU temperature
- Memory usage
- Power draw
- ECC error counts

Concepts involved:

DCGM
Prometheus scraping
Grafana dashboard setup
Node metrics + per-job statistics

Scenario 3: Isolate Workloads from Multiple Teams

Problem: Team A and Team B need to run jobs on the same cluster but must not interfere with each other.

Kubernetes-based Solution:

Create separate namespaces: team-a, team-b
Assign ResourceQuotas for each namespace
Use RBAC to restrict access
Optionally assign MIG slices to teams

Slurm-based Solution:

Define partitions: team-a-partition, team-b-partition
Assign users to each partition using AllowUsers=

Concepts involved:

Namespaces
Resource quotas
MIG
Slurm partitions and access control

Scenario 4: Limit One Team to 5 GPUs in Kubernetes

Problem: You need to enforce a GPU usage limit for one team.

Solution:

Create a ResourceQuota in their namespace:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-limit
  namespace: team-x
spec:
  hard:
    requests.nvidia.com/gpu: "5"

Concepts involved:

ResourceQuota
Namespace scoping
GPU plugin resource requests

Scenario 5: Handle Failed Job Due to Out-of-Memory (OOM)

Problem: A training job fails after running for some time with a memory error.

Solution Steps:

Check logs:
- kubectl logs or Slurm log output
Identify error:
- "Killed: Out of Memory"
Fix:
- Reduce batch size
- Increase memory request:

resources:
  requests:
    memory: "32Gi"

Concepts involved:

Pod logs
Memory limits
Container lifecycle
Model optimization techniques

Additional Practical Challenges to Be Familiar With

Situation	You Should Know...
Job stuck in Pending state (K8s)	How to debug using `kubectl describe pod`, check GPU availability
Slurm job runs but uses no GPU	How to request GPUs correctly with `--gres=gpu:1`
Monitoring dashboard shows 0% GPU use	How to trace bottlenecks in CPU or I/O
Kubernetes job fails silently	Where to find pod logs and error status
One user using too many GPUs	How to apply quotas or fair-share limits

Final Summary: Skills You Must Master in Workload Management

Topic	Practical Skills
Kubernetes + GPU Plugin	Write GPU-aware pod specs, use node selectors, apply limits
Slurm	Submit jobs, manage partitions, enforce GPU usage policies
Monitoring	Install DCGM exporters, configure Prometheus/Grafana, build dashboards
Isolation Techniques	Use MIG, Kubernetes namespaces, and Slurm partitions to separate workloads
Scaling for Multi-Tenant	Set up quotas, fair-share scheduling, and RBAC rules
Automation Tools	Build workflows with Kubeflow, Argo, or Airflow for repeatable pipelines

Workload Management (Additional Content)

1. Kubernetes – GPU Workload Management

`nvidia-device-plugin` YAML Structure

The NVIDIA device plugin is typically deployed as a DaemonSet. Key YAML components include:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
spec:
  template:
    spec:
      containers:
      - name: nvidia-device-plugin-ctr
        image: nvcr.io/nvidia/k8s-device-plugin:latest
        args: ["--mig-strategy=single"]
        env:
        - name: FAIL_ON_INIT_ERROR
          value: "false"
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins

Key fields:

Field	Purpose
`--mig-strategy`	Controls MIG handling (`none`, `single`, `mixed`)
`FAIL_ON_INIT_ERROR`	Allows graceful start if no GPU exists
`volumeMounts`	Mount path for Kubernetes to access plugin sockets

Multi-Container Pod GPU Behavior

In Kubernetes, GPU resources are allocated at the Pod level:

All containers in the same Pod share the same GPU allocation.
There is no fine-grained control to split 1 GPU across containers in the same Pod (unless using MIG).
Default behavior: GPU is made visible to all containers via environment variable NVIDIA_VISIBLE_DEVICES.

Jobs vs. Pods

Type	Behavior
Pod	One-off unit, no restart on failure
Job	Handles retries, status tracking, suitable for training or evaluation tasks

Best practice: Use Job for batch inference or training workloads; only use Pod for debugging or non-repeatable jobs.

Fractional GPU / MIG Support in Kubernetes

MIG support is available via the NVIDIA device plugin with --mig-strategy.
Time-slicing (e.g., A10 GPU) is experimental with newer Kubernetes + NVIDIA driver versions.
Still no native fractional GPU (e.g., 0.5 GPU) support in Kubernetes without MIG.

GPU Request and Limit Behavior

Field	Effect
`resources.requests.nvidia.com/gpu`	Used for scheduling
`resources.limits.nvidia.com/gpu`	Used for runtime enforcement

Rules:

They must match exactly (1 and 1), or scheduling fails.
No overcommit allowed on GPU.
Omitting limits but setting requests → OK (GPU allocated, but no enforced limit).
Setting limits > requests → Invalid.

2. Slurm Workload Management

Job Arrays in Slurm

Used to submit multiple similar jobs with a single sbatch command:

sbatch --array=0-9 run_train.sh

Inside the script, access the job index with $SLURM_ARRAY_TASK_ID.

Use cases: Hyperparameter sweep, different seeds, model variant evaluation.

`srun` vs `sbatch`

Command	Usage
`sbatch`	Submit batch job, executes in background
`srun`	Run interactively or within sbatch for MPI/multinode

Interactive example:

srun --gres=gpu:1 --pty bash

`scontrol` for Job Management

Use scontrol to modify queued or running jobs:

Suspend: scontrol suspend <jobid>
Resume: scontrol resume <jobid>
Change time limit: scontrol update jobid=<id> TimeLimit=01:00:00

Partition Strategy Optimization

Partition Attribute	Behavior
`Priority`	Determines job execution order
`PreemptMode`	Allows interruption of low-priority jobs
`TimeLimit`	Controls job length per partition

Best practice:

Use short partitions for testing
Use preemption to enable high-priority job insertion

DCGM + Prometheus in Slurm

Deploy dcgm-exporter on each node.
Collect metrics via Prometheus scrape.
Map GPU usage to job/user via Slurm accounting + hostname mapping.

3. Multi-Tenancy and GPU Isolation

Slurm + MIG Binding

You can assign specific MIG slices to users via GRES and Job Constraints:

#gres.conf
Name=gpu Type=MIG-GPU-0 UUID=GPU-abc123 Count=1

#slurm.conf
NodeName=nodex Gres=gpu:MIG-GPU-0:1

#job script
#SBATCH --gres=gpu:MIG-GPU-0:1

GPU Usage Auditing

Use SlurmDBD to collect per-user job GPU metrics:

sacct -u <user> --format=JobID,Elapsed,ReqGRES,AllocGRES,MaxRSS

Integrate with Grafana, Elastic, or custom scripts for department-level billing.

Kubernetes `LimitRange` for GPU Control

Use LimitRange to control min/max GPU per Pod:

apiVersion: v1
kind: LimitRange
spec:
  limits:
  - type: Container
    max:
      nvidia.com/gpu: 2
    min:
      nvidia.com/gpu: 1

Pair with ResourceQuota to limit total GPU per Namespace.

4. Monitoring & Logging

DCGM Exporter Metrics (Prometheus)

Metric	Meaning
`dcgm_gpu_power_usage`	Instant power draw in watts
`dcgm_memory_temp`	Memory temperature (°C)
`dcgm_ecc_errors`	Accumulated ECC errors (corrected/uncorrected)

Centralized Pod Log Collection

Use tools like:

Fluentd, Fluent Bit → forward logs from container stdout
Logstash → log parsing
ELK/EFK stack → storage and visualization

Deploy as DaemonSet + ConfigMap.

Node Problem Detector Integration

Add hardware monitoring hooks via node-problem-detector:

Source: /var/log/syslog, dmesg
Reports GPU errors, thermal events as Kubernetes Node Conditions

Alertmanager GPU Alerts

Sample config:

groups:
- name: GPU overheating
  rules:
  - alert: GPUTemperatureTooHigh
    expr: dcgm_temperature_gpu > 85
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "GPU Temperature Too High"
      description: "Node {{ $labels.instance }} GPU exceeds 85°C"

5. Resource Quota and Scheduling

`LimitRange` vs `ResourceQuota`

Resource	Scope	Function
`LimitRange`	Per Pod/Container	Set min/max per resource type
`ResourceQuota`	Per Namespace	Set total aggregate limit (e.g., 8 GPUs)

Use both to enforce fair resource usage.

Namespace Hierarchy and Subquotas

Kubernetes supports v1beta1 hierarchical quotas (via plugins like HNC):

Allow subteam quotas: teamA-dev, teamA-prod
Schedulers can allocate based on sub-quota priorities

Scheduler Extenders

Advanced feature to inject custom scheduling logic (e.g., for GPU memory size or NUMA topology):

Write HTTP extender server
Modify kube-scheduler config to include:

extenders:
- urlPrefix: "http://localhost:12300/scheduler"
  filterVerb: "filter"
  prioritizeVerb: "prioritize"

Used by platforms like Volcano or Alibaba AI scheduler.

6. Workflow Automation Tools

Argo Workflows with GPU

Example:

containers:
- name: train
  image: pytorch/pytorch:2.0
  resources:
    limits:
      nvidia.com/gpu: 1

Argo will schedule the step only on GPU-capable nodes.

Katib for Hyperparameter Tuning

Example config for tuning learning rate:

objective:
  type: maximize
  goal: 0.9
  objectiveMetricName: accuracy
parameters:
- name: learning_rate
  parameterType: double
  feasibleSpace:
    min: "0.01"
    max: "0.1"

Runs trials using Kubernetes jobs with GPUs.

Airflow `KubernetesPodOperator`

DAG snippet:

KubernetesPodOperator(
    namespace='ai-pipeline',
    image='my-model:latest',
    cmds=["python", "train.py"],
    name="train-task",
    task_id="train-model",
    resources={"limit_memory": "4Gi", "limit_gpu": "1"}
)

Airflow dynamically creates GPU-enabled Pods.

Retries & Scheduling

In Argo/Kubeflow:

retryStrategy:
  limit: 3
  retryPolicy: "Always"

Supports backoff, retry delay, onFailure handlers for production reliability.

Shopping cart

Subtotal:

NCP-AIO Workload Management

Detailed list of NCP-AIO knowledge points

Workload Management Detailed Explanation

1. Kubernetes for AI Workload Management

Why Use Kubernetes for AI Workloads?

Containers + Kubernetes = Scalable AI

Key Kubernetes Concepts for AI

1. Pods & Nodes

2. DaemonSets

3. NVIDIA GPU Operator

4. CRI-O / containerd

NVIDIA Device Plugin

What It Does:

Example: Request 1 GPU in Pod YAML

GPU Scheduling Strategy in Kubernetes

Key Techniques:

Example: Pod Spec with Node Selector and GPU Limit

Beginner Tasks to Practice with Kubernetes

2. Slurm-Based Workload Management

Why Use Slurm?

How Slurm Works (Simple Explanation)

Slurm Job Lifecycle (Step-by-Step)

1. Write a Job Script

2. Submit the Job

3. Monitor the Queue

4. Check Job Logs

Important Slurm Commands

Advanced Features in Slurm

1. QoS (Quality of Service)

2. Partitions

3. Job Preemption

SlurmDBD (Slurm Database Daemon)

Real-World Example: Scheduling AI Jobs with Slurm

Beginner Tasks to Practice with Slurm

Summary: Kubernetes vs. Slurm

3. Multi-Tenancy and Isolation

What is Multi-Tenancy?

Why Isolation Is Important

Techniques for Isolation in AI Infrastructure

1. MIG (Multi-Instance GPU)

2. Kubernetes Namespaces

3. Slurm Partitions

4. Fair-Share Scheduling (Slurm)

5. Node Affinity and Taints (Kubernetes)

Real-World Use Case

Skills You Should Learn for Multi-Tenant Isolation

4. Monitoring & Logging

Why Monitoring and Logging Matter

Monitoring Tools (GPU-Focused)

1. Prometheus

2. Grafana

3. Kubectl Logs (Kubernetes)

4. Slurm Logs

Key Metrics to Monitor (GPU + Job-Level)

Best Practices for AI Monitoring

Real-World Use Case

Skills You Should Learn for Monitoring & Logging

5. Resource Quotas and Limits (Kubernetes)

Why Are Resource Quotas Important?

Key Concepts

1. Namespace

2. ResourceQuota Object

Example: Limit a Namespace to 10 GPUs

Example: Limit CPU, Memory, and GPU

How Kubernetes Enforces These Limits

If You Don’t Set Resource Requests…

Real-World Example

Skills You Should Learn for Resource Quotas

6. Job Lifecycle Automation

Why Automate AI Jobs?

Popular Job Automation Tools for AI Workloads

1. Kubeflow Pipelines

2. Apache Airflow

3. Argo Workflows

Core Concepts in Workflow Automation

Benefits of Automating AI Workflows

Real-World Use Case

`nvidia-device-plugin` YAML Structure

`srun` vs `sbatch`

`scontrol` for Job Management

Kubernetes `LimitRange` for GPU Control

`LimitRange` vs `ResourceQuota`

Airflow `KubernetesPodOperator`