Kubernetes (often shortened as K8s) is a powerful open-source container orchestration system.
In AI environments, it helps:
Deploy and manage containerized ML/AI applications
Scale workloads automatically
Ensure high availability and efficient use of GPUs
Run jobs on multi-node clusters
Kubernetes is especially valuable for MLOps pipelines, large-scale training, and inference systems.
Before Kubernetes, people ran scripts manually or used tools like Docker Compose. But in large environments, you need to manage:
Dozens of models
Thousands of jobs
Dynamic GPU availability
Failures and restarts
Kubernetes automates all of this.
| Term | Description |
|---|---|
| Pod | The smallest unit in Kubernetes; usually runs one container (e.g., a training job or inference service). |
| Node | A physical or virtual machine (e.g., a GPU server) that runs Pods. Nodes can be GPU-capable or CPU-only. |
Example: A pod might contain your train.py running on a Docker container. Kubernetes places this pod onto a node that has available GPU(s).
A DaemonSet is a special controller that runs a specific pod on every node.
Used for:
GPU monitoring tools like DCGM exporter
Logging agents
Node-level background processes
Think of it as “auto-install this tool on every machine in the cluster.”
Instead of installing GPU drivers and tools manually, NVIDIA provides a Kubernetes Operator to automate everything.
The NVIDIA GPU Operator installs:
GPU drivers
DCGM for GPU health monitoring
NVIDIA device plugin (exposes GPU to pods)
Container Runtime (e.g., nvidia-container-runtime)
It makes Kubernetes GPU-ready without manual setup.
These are lightweight container runtimes, alternatives to Docker.
In modern Kubernetes clusters:
Docker is being replaced by containerd or CRI-O
NVIDIA’s GPU integration works with all of them
You don’t need to know deep details, just know that Kubernetes uses these runtimes to launch containers.
This plugin connects NVIDIA GPUs to Kubernetes.
Detects GPU hardware
Tells Kubernetes how many GPUs are available
Allows users to request 1 or more GPUs per pod
resources:
limits:
nvidia.com/gpu: 1
This ensures Kubernetes schedules the pod on a node with at least 1 available GPU.
Since AI workloads are resource-hungry, Kubernetes lets you control exactly where and how GPU jobs are scheduled.
| Technique | What It Controls |
|---|---|
| Node Selectors | Only run on nodes with a GPU label (e.g., gpu=true) |
| Taints and Tolerations | Prevent regular jobs from using GPU nodes unless allowed |
| Affinity Rules | Co-locate jobs together (or spread them out) |
| Resource Requests & Limits | Ensure fair usage of CPU, RAM, and GPUs |
apiVersion: v1
kind: Pod
metadata:
name: ai-job
spec:
nodeSelector:
accelerator: nvidia
containers:
- name: trainer
image: myrepo/train:latest
resources:
limits:
nvidia.com/gpu: 1
This pod must run on a node labeled with accelerator=nvidia
It will be assigned exactly 1 GPU
| Task | How to Do It |
|---|---|
| Deploy a pod with GPU | Create a YAML file with nvidia.com/gpu: 1 in resources.limits |
| Label a node for GPU jobs | kubectl label node my-node accelerator=nvidia |
| View node resources | kubectl describe node <name> |
| Monitor pod status | kubectl get pods, kubectl describe pod <name> |
| Install GPU Operator (using Helm) | helm install nvidia-gpu-operator nvidia/gpu-operator |
Slurm (Simple Linux Utility for Resource Management) is a job scheduler used in High-Performance Computing (HPC) and on-premises AI clusters.
It is especially common in environments where:
Fine-grained control over GPU/CPU resources is needed
Batch jobs (e.g., long AI training runs) are common
You want predictable scheduling, even without containers
While Kubernetes excels at flexibility and containerization, Slurm is often preferred when you need deterministic control, especially in academic labs, national data centers, and tightly managed enterprise clusters.
Imagine a queue at a coffee shop:
People (jobs) arrive and line up
The barista (Slurm) takes the next person in line
If the espresso machine (GPU) is busy, the next person waits
That’s how Slurm queues jobs and assigns hardware.
Create a Bash script to define:
How many GPUs and CPUs you want
How long your job can run
What commands to run
Example: train_model.sh
#!/bin/bash
#SBATCH --job-name=training
#SBATCH --output=output.log
#SBATCH --gres=gpu:1
#SBATCH --mem=16G
#SBATCH --time=04:00:00
python train.py
sbatch train_model.sh
Your job is added to the queue and waits until resources are available.
squeue
See who is running, who is waiting, and which node each job is on.
Slurm writes stdout and stderr to the file you specify with --output=....
cat output.log
| Command | Description |
|---|---|
sbatch |
Submit a job script |
squeue |
View jobs in queue |
scancel |
Cancel a job |
sinfo |
Show node and partition status |
sacct |
View historical job info (after they run) |
Controls job priority, time limits, and resource limits
For example, "student jobs" may have lower priority than "production jobs"
Logical grouping of nodes (like “queues”)
You can define:
A “GPU” partition for nodes with GPUs
A “debug” partition for short test jobs
Each partition can have its own limits and policies
Example:
#SBATCH --partition=gpu
High-priority jobs can interrupt lower-priority ones
Useful when urgent tasks need to run immediately
Used to record job history, usage stats, and billing data
Helps track:
Which users use the most GPUs
How much compute time each project consumes
A data science team wants to train 3 models:
Model A needs 2 GPUs for 4 hours
Model B needs 1 GPU for 8 hours
Model C is just a test and needs 10 minutes
With Slurm:
Each team member writes a sbatch script
They submit jobs to the gpu partition
Slurm queues them and runs them as resources become available
Logs are saved and usage is recorded in SlurmDBD
| Task | Command or Action |
|---|---|
| Submit a GPU job | sbatch train.sh |
| Request specific resources | Use --gres=gpu:N, --mem=XXG, --cpus-per-task |
| Choose partition | --partition=gpu |
| View jobs | squeue, sacct |
| Cancel a job | scancel <job_id> |
| Feature | Kubernetes | Slurm |
|---|---|---|
| Container Support | Excellent (native) | Requires Docker/Singularity |
| Job Scheduling Model | Dynamic (on-demand) | Batch (queued) |
| GPU Awareness | Via device plugin | Built-in with --gres=gpu:N |
| Multi-user Clusters | Supported | Supported |
| Popular In | Cloud, DevOps, MLOps | HPC, research labs, on-prem AI |
Multi-tenancy means that multiple users or teams share the same infrastructure—like GPU clusters, Kubernetes clusters, or Slurm environments.
It’s common in:
AI research groups
Universities
Enterprise AI platforms
Public or hybrid cloud environments
Each “tenant” is like a separate user, team, or project.
Good infrastructure ensures they can co-exist without interfering with each other.
Without isolation, you might face:
| Problem | Example |
|---|---|
| One user uses all the GPUs | Slows down other users’ jobs |
| User A crashes User B’s job | Shared memory or job conflicts |
| Sensitive data is leaked | No secure boundaries between containers or users |
| Logs and metrics are mixed up | Hard to debug who caused what |
Already covered in the Administration section, but here’s a recap:
MIG divides a single physical GPU into multiple isolated instances
Each instance has dedicated memory, cache, compute cores
Multiple users can run AI jobs safely on the same GPU
Example:
1 user gets a 1g.5gb MIG slice
Another user gets a 2g.10gb slice
Each runs separately, no interference
A namespace in Kubernetes is like a virtual environment or “folder” for resources.
Each team or project can have:
Its own pods
Its own config
Its own resource quotas
Its own logs and metrics
Benefits:
Separation of users and workloads
Better access control (via RBAC)
Quota enforcement (e.g., max 5 GPUs)
Create a namespace:
kubectl create namespace team-a
Run a pod inside that namespace:
kubectl apply -f pod.yaml -n team-a
In Slurm, a partition is like a “queue” or “group of nodes.”
You can assign specific nodes to specific users or jobs:
Partition gpu-research: 4 A100 nodes for research team
Partition gpu-production: 8 RTX 6000 nodes for production team
Control access using ACLs (access control lists):
PartitionName=gpu-research Nodes=node[01-04] AllowUsers=alice,bob
Submit a job to that partition:
sbatch --partition=gpu-research script.sh
Used when many users submit jobs, but some are more active than others.
Slurm keeps track of how much compute time each user has consumed and adjusts job priorities accordingly.
| User | Used GPU Hours | Job Priority |
|---|---|---|
| Alice | 0 | HIGH |
| Bob | 100 | LOW |
Controlled by:
PriorityWeightFairshare
FairShare value in Slurm configuration
You can isolate workloads based on node labels.
Example:
kubectl label node gpu-node-1 team=mlops
nodeSelector:
team: mlops
Also:
Taints prevent jobs from running on certain nodes unless they "tolerate" them
Used for dedicated GPU nodes, security boundaries, or debug workloads
Scenario:
Three teams (A, B, and C) share a Kubernetes cluster with 8 GPUs.
| Goal | Strategy |
|---|---|
| Team A can use 4 GPUs | Create namespace “team-a” + GPU quota |
| Team B runs on separate nodes | Use nodeSelector + taints |
| Team C shares GPUs via MIG | Assign MIG slices to their pods |
Result:
Each team runs independently without affecting each other.
| Skill | Description |
|---|---|
| Use MIG | Allocate safe, isolated GPU slices to users |
| Create Kubernetes namespaces | Separate workloads and configs by team or project |
| Set resource quotas (K8s) | Prevent any user/team from consuming too many GPUs |
| Configure Slurm partitions | Assign GPU nodes to specific users or groups |
| Enable fair-share scheduling (Slurm) | Automatically adjust priorities based on usage history |
| Use node selectors and taints | Pin workloads to specific GPU nodes or prevent job misplacement |
In AI infrastructure, it's not enough to just run jobs. You also need to:
Track performance: Are GPUs being fully used?
Catch failures early: Are jobs crashing due to memory limits?
Optimize resource usage: Are some nodes overloaded?
Audit and debug: Who ran what, and when?
Without proper monitoring and logging, you’re flying blind—especially in large, multi-user environments.
An open-source monitoring system
It scrapes metrics from services like DCGM (Data Center GPU Manager)
Stores time-series data (metrics over time)
Works with alert systems and Grafana dashboards
What It Monitors:
GPU utilization
Memory usage
Temperature
Power draw
Error events (e.g., ECC errors)
How It Works:
Each GPU node runs a DCGM exporter
Prometheus queries those exporters at regular intervals (e.g., every 15s)
A dashboard tool that connects to Prometheus
Displays data visually: graphs, gauges, heatmaps, alerts
Common Dashboards:
Per-GPU usage over time
Job-by-job memory usage
Temperature heatmaps for racks of GPU servers
Alerts for failed jobs or hardware issues
Grafana helps operators, admins, and even users see what’s happening at a glance.
Shows stdout/stderr logs from containerized AI jobs
Use it to debug crashes, errors, and training logs
Basic Command:
kubectl logs pod-name -n namespace
Use Cases:
Check why a training job failed
See accuracy/loss progress
Capture Python traceback for failed jobs
Every Slurm job has an output log file, defined with --output=...
By default, logs include:
stdout
stderr
Job status (completed, failed, cancelled)
Advanced Logs:
/var/log/slurm/slurmctld.log: Scheduler events
/var/log/slurm/slurmd.log: Node-level execution details
Command to query job history:
sacct -j <job_id>
| Metric | Why It Matters |
|---|---|
| GPU Utilization | Low usage? Maybe your model is bottlenecked on CPU or I/O |
| Memory Usage | Frequent out-of-memory errors? Reduce batch size or adjust memory requests |
| Temperature | Overheating GPUs may throttle performance or even shut down |
| Power Draw | Unusually high power usage may signal runaway jobs or hardware failure |
| Job Duration | Track how long each job runs—compare against expected durations |
| Container Start/Stop Times | Slow starts could signal image pull issues or node pressure |
| Node Pressure | Track CPU, RAM, disk usage—important for co-scheduled AI jobs |
| Practice | Benefit |
|---|---|
| Deploy Prometheus + Grafana cluster-wide | Unified monitoring for all GPU nodes |
| Set GPU temperature/power alerts | Detect early signs of overheating or stress |
| Collect job-level logs | Debug model errors or resource issues |
| Export metrics to centralized dashboard | Help admins and users understand performance |
| Automate alerts | Email/Slack/Teams notifications on failure or critical events |
Scenario: A training job is crashing after 2 hours with no clear error message.
Monitoring Workflow:
Use kubectl logs (or Slurm job logs) to check for memory errors
Check Prometheus → see if GPU memory hit 100%
Grafana shows temperature spike before crash
Action: Reduce batch size + improve airflow in server rack
| Skill | Description |
|---|---|
Use kubectl logs |
View logs for AI training/inference jobs |
| Set up DCGM exporter | Export GPU metrics for Prometheus |
| Configure Prometheus scraping | Collect GPU stats cluster-wide |
| Build Grafana dashboards | Visualize GPU usage, job health, and trends |
Use Slurm’s logging and sacct |
Analyze completed job history and debug failed runs |
| Alerting setup | Define conditions and notifications for GPU issues or job failures |
In shared AI infrastructure, it’s common for multiple teams or users to:
Submit jobs at the same time
Request GPUs, CPUs, memory, etc.
Accidentally overuse shared resources
Without quotas, one team could:
Use all the GPUs
Cause denial of service to others
Slow down or crash the system
Resource quotas help you ensure fairness and enforce usage limits per namespace or team.
Namespaces divide the cluster into logical sections, often used to group resources by:
Team (e.g., team-a, team-b)
Project
Environment (e.g., dev, prod)
Each namespace can have its own quota.
This is a Kubernetes object that sets a hard upper limit on what a namespace can use.
You can limit:
Number of pods
Total CPU or memory
Number of GPUs (nvidia.com/gpu)
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: team-a
spec:
hard:
requests.nvidia.com/gpu: "10"
This means:
Pods in the team-a namespace can collectively request up to 10 GPUs
If one pod asks for 4 GPUs, and another for 7 → second one will be denied
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-limits
namespace: mlops-team
spec:
hard:
requests.cpu: "40"
requests.memory: "200Gi"
requests.nvidia.com/gpu: "8"
This enforces:
Max 40 CPUs requested across all pods
Max 200 GiB of memory requested
Max 8 GPUs requested
Kubernetes tracks total resources requested in a namespace, not actual usage.
So if you have this pod spec:
resources:
requests:
nvidia.com/gpu: 2
Kubernetes:
Adds 2 GPUs to the total for the namespace
Rejects future pods if quota is exceeded
That’s why it’s important to always define resource requests in your pod specs.
The quota won’t be enforced properly
Kubernetes might allow pods to overload nodes
You lose fine-grained control
Always write this in your YAML:
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
Scenario: You have 3 teams sharing a Kubernetes cluster:
Team A: Needs up to 8 GPUs
Team B: Needs up to 4 GPUs
Team C: Development only, 2 GPUs max
You:
Create 3 namespaces: team-a, team-b, team-c
Apply appropriate ResourceQuotas to each
Now the cluster can safely host jobs without overloading any team.
| Skill | Description |
|---|---|
| Create namespaces | Group workloads by team/project |
| Apply ResourceQuota YAML | Enforce limits on GPU, CPU, memory |
| Define resource requests in pods | Ensure pods count toward the quota system |
| Audit resource usage | Use kubectl describe quota to check current usage |
| Adjust quotas as needed | Increase or decrease based on project or team needs |
Running AI jobs manually (e.g., sbatch, kubectl apply) works for one-off experiments.
But in production or at scale, you need to:
Run the same job daily, hourly, or on demand
Chain steps: preprocessing → training → evaluation → deployment
Handle failures, retries, logging, and monitoring
Track inputs, outputs, and metrics
This is where workflow automation tools come in.
They help you build repeatable, traceable, and scalable pipelines.
Kubeflow is a Kubernetes-native MLOps platform.
Its Pipelines module lets you build end-to-end machine learning workflows.
| Feature | Benefit |
|---|---|
| Step-based workflows | Break down jobs into modular components |
| Reusability | Build once, run many times |
| GPU scheduling | Native integration with Kubernetes + GPU plugin |
| Visualization | View pipeline DAGs and metrics in a UI |
Example Workflow:
Data Ingestion →
Feature Engineering →
Model Training →
Evaluation →
Deployment
Each step is containerized and managed by Kubernetes.
Originally built for data engineering, Airflow is also widely used for:
Scheduling training jobs
Managing ETL pipelines
Tracking long-running AI workflows
| Feature | Benefit |
|---|---|
| DAGs (Directed Acyclic Graphs) | Define tasks with dependencies |
| Rich scheduling | Cron-like control (e.g., every night at 2am) |
| Kubernetes integration | Launch tasks as Kubernetes pods |
Use Case:
Train a model every hour
Trigger retraining when new data is uploaded to S3
Send an alert if the job fails
Argo is a Kubernetes-native workflow engine.
| Feature | Benefit |
|---|---|
| Container-native | Each step runs in its own Kubernetes pod |
| GPU-aware | Fully supports nvidia.com/gpu resources |
| Scalable and lightweight | Ideal for microservices or model inference |
| YAML-based workflow definitions | Easy to store in Git for CI/CD workflows |
Example Use Case:
Run distributed training on multiple pods
Automatically clean up intermediate files
Record run metadata and metrics
| Concept | Description |
|---|---|
| Workflow DAG | A directed graph defining task order and dependencies |
| Step Container | Each step in a workflow is a container that runs a defined script or command |
| Retry Policy | Automatically retry a failed step X times before failing the workflow |
| Parameters | Pass dynamic values (e.g., batch size, dataset path) to jobs |
| Artifacts | Intermediate files (e.g., trained models, logs) saved and passed between steps |
| Benefit | Explanation |
|---|---|
| Reproducibility | Always run the same code, same config |
| Scalability | Can handle many jobs concurrently across multiple GPUs |
| Resilience | Retry failed steps automatically |
| Traceability | Track every step, log, metric, and artifact |
| Version Control + CI/CD | Integrate with Git to enable production-grade MLOps pipelines |
Scenario: A company trains a new NLP model every day at midnight using new data collected during the day.
Workflow:
Step 1: Download new data from S3
Step 2: Clean and tokenize data
Step 3: Train Transformer model (requires 2 GPUs)
Step 4: Evaluate performance
Step 5: If accuracy > 90%, deploy to production
Tools:
Airflow or Argo to automate scheduling and step execution
Kubernetes + GPU plugin to manage GPU allocation
| Skill | Description |
|---|---|
| Write workflow YAML (Argo/Kubeflow) | Define steps, containers, resources, dependencies |
| Schedule jobs (Airflow) | Use DAGs and cron triggers to run workflows periodically |
Use nvidia.com/gpu in automated pods |
Ensure GPU resources are allocated properly |
| Monitor automated jobs | Use built-in dashboards and Prometheus/Grafana integration |
| Handle failures gracefully | Add retries, alerting, logging to each step |
These scenarios simulate the types of problems and tasks you’ll face in real AI operations roles and on the NCP-AIO exam.
You should understand not just what to do, but why it works and how it interacts with Kubernetes, Slurm, GPU scheduling, monitoring, etc.
Problem: You need to deploy a containerized model training job that uses GPUs.
resources:
limits:
nvidia.com/gpu: 2
nodeSelector:
accelerator: nvidia
kubectl apply -f job.yamlConcepts involved:
Kubernetes Pod spec
nvidia.com/gpu limits
GPU plugin
Node selectors
Problem: You want real-time visibility into GPU usage per job.
Deploy DCGM Exporter on each GPU node
Set up Prometheus to scrape metrics
Use Grafana to visualize:
GPU temperature
Memory usage
Power draw
ECC error counts
Concepts involved:
DCGM
Prometheus scraping
Grafana dashboard setup
Node metrics + per-job statistics
Problem: Team A and Team B need to run jobs on the same cluster but must not interfere with each other.
Create separate namespaces: team-a, team-b
Assign ResourceQuotas for each namespace
Use RBAC to restrict access
Optionally assign MIG slices to teams
Define partitions: team-a-partition, team-b-partition
Assign users to each partition using AllowUsers=
Concepts involved:
Namespaces
Resource quotas
MIG
Slurm partitions and access control
Problem: You need to enforce a GPU usage limit for one team.
Create a ResourceQuota in their namespace:
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-limit
namespace: team-x
spec:
hard:
requests.nvidia.com/gpu: "5"
Concepts involved:
ResourceQuota
Namespace scoping
GPU plugin resource requests
Problem: A training job fails after running for some time with a memory error.
Check logs:
kubectl logs or Slurm log outputIdentify error:
Fix:
Reduce batch size
Increase memory request:
resources:
requests:
memory: "32Gi"
Concepts involved:
Pod logs
Memory limits
Container lifecycle
Model optimization techniques
| Situation | You Should Know... |
|---|---|
| Job stuck in Pending state (K8s) | How to debug using kubectl describe pod, check GPU availability |
| Slurm job runs but uses no GPU | How to request GPUs correctly with --gres=gpu:1 |
| Monitoring dashboard shows 0% GPU use | How to trace bottlenecks in CPU or I/O |
| Kubernetes job fails silently | Where to find pod logs and error status |
| One user using too many GPUs | How to apply quotas or fair-share limits |
| Topic | Practical Skills |
|---|---|
| Kubernetes + GPU Plugin | Write GPU-aware pod specs, use node selectors, apply limits |
| Slurm | Submit jobs, manage partitions, enforce GPU usage policies |
| Monitoring | Install DCGM exporters, configure Prometheus/Grafana, build dashboards |
| Isolation Techniques | Use MIG, Kubernetes namespaces, and Slurm partitions to separate workloads |
| Scaling for Multi-Tenant | Set up quotas, fair-share scheduling, and RBAC rules |
| Automation Tools | Build workflows with Kubeflow, Argo, or Airflow for repeatable pipelines |
nvidia-device-plugin YAML StructureThe NVIDIA device plugin is typically deployed as a DaemonSet. Key YAML components include:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
spec:
template:
spec:
containers:
- name: nvidia-device-plugin-ctr
image: nvcr.io/nvidia/k8s-device-plugin:latest
args: ["--mig-strategy=single"]
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
Key fields:
| Field | Purpose |
|---|---|
--mig-strategy |
Controls MIG handling (none, single, mixed) |
FAIL_ON_INIT_ERROR |
Allows graceful start if no GPU exists |
volumeMounts |
Mount path for Kubernetes to access plugin sockets |
In Kubernetes, GPU resources are allocated at the Pod level:
All containers in the same Pod share the same GPU allocation.
There is no fine-grained control to split 1 GPU across containers in the same Pod (unless using MIG).
Default behavior: GPU is made visible to all containers via environment variable NVIDIA_VISIBLE_DEVICES.
| Type | Behavior |
|---|---|
| Pod | One-off unit, no restart on failure |
| Job | Handles retries, status tracking, suitable for training or evaluation tasks |
Best practice: Use Job for batch inference or training workloads; only use Pod for debugging or non-repeatable jobs.
MIG support is available via the NVIDIA device plugin with --mig-strategy.
Time-slicing (e.g., A10 GPU) is experimental with newer Kubernetes + NVIDIA driver versions.
Still no native fractional GPU (e.g., 0.5 GPU) support in Kubernetes without MIG.
| Field | Effect |
|---|---|
resources.requests.nvidia.com/gpu |
Used for scheduling |
resources.limits.nvidia.com/gpu |
Used for runtime enforcement |
Rules:
They must match exactly (1 and 1), or scheduling fails.
No overcommit allowed on GPU.
Omitting limits but setting requests → OK (GPU allocated, but no enforced limit).
Setting limits > requests → Invalid.
Used to submit multiple similar jobs with a single sbatch command:
sbatch --array=0-9 run_train.sh
Inside the script, access the job index with $SLURM_ARRAY_TASK_ID.
Use cases: Hyperparameter sweep, different seeds, model variant evaluation.
srun vs sbatch| Command | Usage |
|---|---|
sbatch |
Submit batch job, executes in background |
srun |
Run interactively or within sbatch for MPI/multinode |
Interactive example:
srun --gres=gpu:1 --pty bash
scontrol for Job ManagementUse scontrol to modify queued or running jobs:
Suspend: scontrol suspend <jobid>
Resume: scontrol resume <jobid>
Change time limit: scontrol update jobid=<id> TimeLimit=01:00:00
| Partition Attribute | Behavior |
|---|---|
Priority |
Determines job execution order |
PreemptMode |
Allows interruption of low-priority jobs |
TimeLimit |
Controls job length per partition |
Best practice:
Use short partitions for testing
Use preemption to enable high-priority job insertion
Deploy dcgm-exporter on each node.
Collect metrics via Prometheus scrape.
Map GPU usage to job/user via Slurm accounting + hostname mapping.
You can assign specific MIG slices to users via GRES and Job Constraints:
#gres.conf
Name=gpu Type=MIG-GPU-0 UUID=GPU-abc123 Count=1
#slurm.conf
NodeName=nodex Gres=gpu:MIG-GPU-0:1
#job script
#SBATCH --gres=gpu:MIG-GPU-0:1
Use SlurmDBD to collect per-user job GPU metrics:
sacct -u <user> --format=JobID,Elapsed,ReqGRES,AllocGRES,MaxRSS
Integrate with Grafana, Elastic, or custom scripts for department-level billing.
LimitRange for GPU ControlUse LimitRange to control min/max GPU per Pod:
apiVersion: v1
kind: LimitRange
spec:
limits:
- type: Container
max:
nvidia.com/gpu: 2
min:
nvidia.com/gpu: 1
Pair with ResourceQuota to limit total GPU per Namespace.
| Metric | Meaning |
|---|---|
dcgm_gpu_power_usage |
Instant power draw in watts |
dcgm_memory_temp |
Memory temperature (°C) |
dcgm_ecc_errors |
Accumulated ECC errors (corrected/uncorrected) |
Use tools like:
Fluentd, Fluent Bit → forward logs from container stdout
Logstash → log parsing
ELK/EFK stack → storage and visualization
Deploy as DaemonSet + ConfigMap.
Add hardware monitoring hooks via node-problem-detector:
Source: /var/log/syslog, dmesg
Reports GPU errors, thermal events as Kubernetes Node Conditions
Sample config:
groups:
- name: GPU overheating
rules:
- alert: GPUTemperatureTooHigh
expr: dcgm_temperature_gpu > 85
for: 2m
labels:
severity: critical
annotations:
summary: "GPU Temperature Too High"
description: "Node {{ $labels.instance }} GPU exceeds 85°C"
LimitRange vs ResourceQuota| Resource | Scope | Function |
|---|---|---|
LimitRange |
Per Pod/Container | Set min/max per resource type |
ResourceQuota |
Per Namespace | Set total aggregate limit (e.g., 8 GPUs) |
Use both to enforce fair resource usage.
Kubernetes supports v1beta1 hierarchical quotas (via plugins like HNC):
Allow subteam quotas: teamA-dev, teamA-prod
Schedulers can allocate based on sub-quota priorities
Advanced feature to inject custom scheduling logic (e.g., for GPU memory size or NUMA topology):
Write HTTP extender server
Modify kube-scheduler config to include:
extenders:
- urlPrefix: "http://localhost:12300/scheduler"
filterVerb: "filter"
prioritizeVerb: "prioritize"
Used by platforms like Volcano or Alibaba AI scheduler.
Example:
containers:
- name: train
image: pytorch/pytorch:2.0
resources:
limits:
nvidia.com/gpu: 1
Argo will schedule the step only on GPU-capable nodes.
Example config for tuning learning rate:
objective:
type: maximize
goal: 0.9
objectiveMetricName: accuracy
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.1"
Runs trials using Kubernetes jobs with GPUs.
KubernetesPodOperatorDAG snippet:
KubernetesPodOperator(
namespace='ai-pipeline',
image='my-model:latest',
cmds=["python", "train.py"],
name="train-task",
task_id="train-model",
resources={"limit_memory": "4Gi", "limit_gpu": "1"}
)
Airflow dynamically creates GPU-enabled Pods.
In Argo/Kubeflow:
retryStrategy:
limit: 3
retryPolicy: "Always"
Supports backoff, retry delay, onFailure handlers for production reliability.
How does Kubernetes schedule workloads that require NVIDIA GPUs?
Kubernetes schedules GPU workloads using device plugins that advertise GPU resources to the scheduler.
In GPU-enabled Kubernetes clusters, the NVIDIA device plugin exposes GPUs as schedulable resources. Nodes report available GPUs to the Kubernetes control plane, and workloads request them through resource limits in pod specifications. When a pod declares a GPU requirement, the scheduler places it on a node with sufficient GPU capacity. This prevents oversubscription and ensures GPU workloads run only where hardware is available. Administrators often combine device plugins with cluster monitoring to track GPU allocation and utilization. A common mistake is deploying containers without declaring GPU resource requests, which results in workloads running without GPU acceleration even though GPUs exist on the node.
Demand Score: 82
Exam Relevance Score: 87
Why might a containerized AI workload fail to detect GPUs even when the host machine has available GPUs?
This typically occurs when the container runtime lacks GPU support or the NVIDIA container toolkit is not properly configured.
GPU devices are not automatically accessible inside containers. Administrators must install and configure GPU-aware container runtimes so that GPU device files and driver libraries are exposed to containers. Without this configuration, frameworks such as TensorFlow or PyTorch cannot detect GPUs and will fall back to CPU execution. This problem frequently occurs when nodes are newly added to clusters without installing the necessary container toolkit or when runtime configuration files are missing. Verifying GPU visibility inside containers using diagnostic commands ensures workloads can properly access GPU resources.
Demand Score: 78
Exam Relevance Score: 88
Why is explicit GPU resource allocation required when scheduling AI workloads in orchestration platforms?
Explicit resource allocation prevents resource contention and ensures workloads receive dedicated GPU capacity.
In shared AI infrastructure, multiple workloads compete for GPU resources. Orchestration platforms rely on declared resource requests to determine how workloads are placed across nodes. When GPU requirements are explicitly specified, the scheduler ensures that workloads run only on nodes with available GPU capacity and prevents multiple workloads from unintentionally occupying the same GPU. Without these constraints, scheduling systems cannot enforce resource isolation, which may result in unpredictable performance or job failures. Proper resource specification is therefore critical for maintaining stable GPU utilization in multi-tenant environments.
Demand Score: 74
Exam Relevance Score: 83
What operational challenge occurs when multiple AI workloads attempt to use the same GPU simultaneously?
Resource contention can degrade performance and lead to memory allocation conflicts.
GPUs are designed to execute parallel workloads efficiently, but uncontrolled sharing may cause conflicts in compute scheduling and memory allocation. When multiple jobs attempt to allocate large memory regions simultaneously, workloads may fail or experience reduced performance. This issue is common in shared research clusters or ML development environments. Administrators often mitigate this risk by implementing scheduling policies, workload isolation mechanisms, or GPU partitioning strategies that ensure predictable resource allocation.
Demand Score: 76
Exam Relevance Score: 85
Why do AI operations teams monitor GPU workload execution after deployment?
Monitoring helps verify that workloads utilize GPU resources efficiently and detect abnormal execution behavior.
Once workloads are scheduled and running, administrators must observe runtime metrics to ensure GPUs are being used effectively. Metrics such as GPU utilization, memory usage, and job duration reveal whether workloads are performing as expected. Low utilization may indicate data pipeline bottlenecks, configuration issues, or misconfigured containers. Monitoring also helps identify workloads that monopolize GPU resources or consume excessive memory. By continuously analyzing these metrics, operations teams can adjust scheduling strategies and optimize overall cluster performance.
Demand Score: 71
Exam Relevance Score: 82