Troubleshooting and Optimization

Troubleshooting and Optimization Detailed Explanation

1. GPU Monitoring and Diagnostics

Why This Matters

In high-performance AI environments, GPUs are your most valuable and expensive resources.
You must ensure they are:

Healthy
Efficiently utilized
Properly monitored

Misconfiguration, overheating, ECC errors, or memory issues can silently reduce performance or cause job failures.

Tool 1: DCGM (Data Center GPU Manager)

DCGM is a low-overhead GPU telemetry and diagnostics framework from NVIDIA. It can be run:

Directly on bare-metal
Inside Docker containers
In Kubernetes environments as part of GPU Operator stack

Use Cases of DCGM:

Use Case	Example
Monitor GPU temperature	Alert if above 85°C
Track ECC errors	Detect GPU memory instability
Enforce power or thermal limits	Trigger alerts on breach
Monitor application behavior	Collect metrics like memory, SM, PCIe usage
Export metrics to Prometheus	Build dashboards with Grafana

Key DCGM Commands

List Available GPUs:

dcgmi discovery -l

Run Health Checks:

dcgmi health -c

This checks:

Temperature
PCIe connectivity
Power
ECC

Get Performance Stats:

dcgmi stats --groupId 0

This gives:

GPU utilization
Memory usage
PCIe throughput
SM clock rate
Power draw

Collect I/O and NVLink Counters:
DCGM supports advanced custom metrics:

DCGM_FI_DEV_NVLINK_THROUGHPUT_TX
DCGM_FI_DEV_POWER_USAGE
These can be scraped into Prometheus exporters.

Tool 2: `nvidia-smi` – Your GPU Swiss Army Knife

A command-line tool installed with the NVIDIA driver.

Basic Usage:

nvidia-smi

This displays:

Driver version
CUDA version
GPU name
Memory usage
Power draw
Temperature
Active processes

Advanced Commands

View Topology and NVLink Connectivity:

nvidia-smi topo -m

Outputs a matrix showing GPU interconnect speeds and NVLink links:

        GPU0    GPU1    CPU Affinity
GPU0     X      NV#           0-19
GPU1    NV#      X            0-19

Continuous Monitoring (like top for GPUs):

nvidia-smi dmon

Shows live output of:

GPU utilization
Memory usage
Encoder/decoder usage
PCIe stats

Check MIG Configurations:

nvidia-smi mig -lgi

List GPU instances created via MIG.

Example Scenario

Symptom: GPU memory is full, job crashes.

Diagnosis:

Use nvidia-smi to check memory usage in real-time.
Use dcgmi stats to observe if memory pressure is rising over time.
If running in Kubernetes: export metrics via DCGM exporter and observe in Grafana.

Action:

Reduce batch size
Switch to mixed precision (FP16)
Fix data loader (e.g., infinite loop)

Summary of Skills for GPU Monitoring

Tool	Purpose
`nvidia-smi`	General GPU monitoring and topology
`nvidia-smi dmon`	Real-time stats like GPU load & memory
DCGM CLI (`dcgmi`)	Deeper health metrics and diagnostics
Prometheus + DCGM	Dashboards for cluster-wide monitoring

2. Docker & Container Issues

Why Docker Integration Is Critical

Most AI workloads today run in containers to ensure:

Portability
Dependency isolation
Easy deployment

However, if Docker is not properly configured to interface with NVIDIA GPUs, containers may:

Fail to detect GPUs
Crash during runtime
Perform poorly (e.g., no GPU acceleration)

Common Docker GPU Failures

Symptom	Possible Cause
`CUDA device not found`	NVIDIA runtime not enabled
Container runs, but uses only CPU	Started without `--gpus all` flag
`nvidia-smi` fails in container	Missing drivers or runtime
MIG mode GPU not visible in container	MIG instance not assigned to container

Fix 1: Always Start Containers With GPU Access

When running containers, explicitly request GPUs:

docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

You can limit to 1 GPU with --gpus device=0
Or assign a MIG instance directly using --gpus '"device=UUID"'

Fix 2: Validate NVIDIA Container Runtime

Ensure Docker is configured to use nvidia runtime.

Check available runtimes:

docker info | grep -i runtime

You should see:

Runtimes: runc nvidia
Default Runtime: runc

If nvidia is missing, install NVIDIA Container Toolkit.

To test runtime:

docker run --rm --runtime=nvidia nvidia/cuda:12.0-base nvidia-smi

If that works, the runtime is installed correctly.

Fix 3: Check NVIDIA Container Toolkit Installation

Steps for Ubuntu:

sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

Also verify:

Kernel modules loaded: nvidia, nvidia_uvm, nvidia_drm
Docker version ≥ 19.03

Fix 4: Debugging MIG Containers

Problem: MIG instance not accessible in container

Fix:

Make sure MIG instance is created using nvidia-smi mig -cgi
Assign instance to container with exact GPU UUID:

docker run --gpus '"device=GPU-UUID"' ...

Use this command to list instances:

nvidia-smi -L

Real-World Debugging Example

Symptom: A container launches but shows no GPU available

Step-by-step:

Inside container:

nvidia-smi

→ Fails

On host:

nvidia-smi        # works
docker info       # check runtime

Fix:

sudo apt install nvidia-container-toolkit
sudo systemctl restart docker

Re-run:

docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

Summary of Skills for Container Troubleshooting

Skill	Tool / Command
Confirm GPU access in container	`nvidia-smi` inside container
Use GPU-enabled container	`docker run --gpus all ...`
Check runtime support	`docker info
Assign MIG GPU instance	Use `--gpus "device=UUID"` and `nvidia-smi -L`
Install NVIDIA runtime properly	`nvidia-container-toolkit`, restart Docker

3. Kubernetes Troubleshooting

Why GPU Troubleshooting in Kubernetes Is Tricky

Kubernetes abstracts away hardware details. So when something goes wrong, such as:

Pods not seeing GPUs
Jobs stuck in Pending
GPU metrics not being collected

You’ll need to dig into multiple layers:

Node configuration
Pod YAML
NVIDIA plugins
Cluster resource scheduling

Common GPU-Related Kubernetes Issues

Symptom	Likely Root Cause
Pod does not detect GPU	`nvidia-device-plugin` not running or misconfigured
Pod stuck in `Pending`	No node with available GPU or incorrect `resources` spec
Device plugin DaemonSet failed	NodeSelector or taint mismatch, missing runtime support
Metrics not showing in Prometheus	`dcgm-exporter` not running or not scraped properly

Fix 1: Check if GPU Plugin Is Running

NVIDIA Device Plugin is deployed as a DaemonSet. Check its status:

kubectl get daemonset -n gpu-operator

You should see:

nvidia-device-plugin-daemonset   DESIRED: X   READY: X

Check logs:

kubectl logs -n gpu-operator <device-plugin-pod-name>

If it fails to start:

Check nvidia-smi on the node
Check Docker runtime (nvidia must be available)

Fix 2: Verify Pod Resource Requests

Kubernetes won’t schedule a GPU pod unless you request the GPU correctly:

Example Pod YAML:

resources:
  limits:
    nvidia.com/gpu: 1

Missing this will result in the pod being scheduled without GPU access.

Also confirm:

The node has available GPUs
You are not exceeding resource quotas

Fix 3: Check Node Labels and Taints

Device Plugin only works on nodes that:

Have GPU hardware
Are correctly labeled (nvidia.com/gpu.present=true)
Do not have taints that block the DaemonSet (or have matching tolerations)

Check node label:

kubectl get nodes --show-labels

Check taints:

kubectl describe node <node-name> | grep Taint

If there's a taint like:

nvidia.com/gpu=present:NoSchedule

Make sure your pod has:

tolerations:
- key: "nvidia.com/gpu"
  operator: "Exists"
  effect: "NoSchedule"

Fix 4: Troubleshoot with `kubectl describe` and Logs

If a pod fails to start:

kubectl describe pod <pod-name>

Look for:

Failed scheduling
Events at the bottom
Misconfigured requests

To inspect logs:

kubectl logs <pod-name>

To debug the container runtime interface:

crictl ps -a

Real-World Debugging Scenario

Symptom: Pod is in Pending state with GPU request.

Diagnosis:

kubectl describe pod train-gpu-job

Output:

0/4 nodes are available: 4 Insufficient nvidia.com/gpu.

→ No node has free GPU

Fix:

Free GPU on a node
Add new GPU-enabled node
Reduce nvidia.com/gpu request

Summary of Kubernetes Troubleshooting Skills

Task	Command / Method
Check GPU DaemonSet	`kubectl get daemonset -n gpu-operator`
Inspect GPU plugin logs	`kubectl logs` on plugin pod
Validate resource requests	Pod YAML must include `limits.nvidia.com/gpu: 1`
Check node labels and taints	`kubectl describe node`
Troubleshoot pod scheduling failures	`kubectl describe pod`, review Events section
Confirm runtime is GPU-compatible	`docker info`, `nvidia-smi` on the node

4. Performance Bottlenecks

Overview

Even when infrastructure is working correctly, AI workloads can suffer from:

Low GPU utilization
High job runtime
Unexpected memory errors
Slow I/O throughput

To fix this, you must profile all layers of the stack:

CPU
Memory
GPU
Storage
Interconnect

1. CPU Bottlenecks

Symptoms:

GPU is idle or lightly loaded
Data preprocessing is slow
CPU usage near 100% on one or few threads

Tools:

htop: View CPU cores, load balance
top: General process-level view
mpstat -P ALL: Per-core CPU usage
Prometheus + Grafana: Cluster-wide CPU visualization

Fixes:

Increase data loading workers (e.g., num_workers in PyTorch)
Move preprocessing to GPU (e.g., use NVIDIA DALI)
Balance threads across cores

2. Memory Bottlenecks

Symptoms:

Pod or container killed with OOMKilled
Frequent swapping or paging
dmesg shows out-of-memory (OOM) killer activity

Diagnostic Commands:

dmesg | grep -i oom
free -h
vmstat 1

Fixes:

Increase memory allocation in Kubernetes:

resources:
  requests:
    memory: "16Gi"

Optimize memory usage in code
Cache data properly instead of reading repeatedly

3. GPU Bottlenecks

Symptoms:

GPU is underutilized (< 30%) despite active training
Training job is slow or stuck on data loading

Tools:

nvidia-smi dmon: View real-time GPU stats
dcgmi stats: Monitor streaming multiprocessor (SM) activity
Nsight Systems: Timeline of kernel execution and memory copies

Common Causes:

Small batch sizes
Data loader too slow
Poor kernel performance (e.g., low occupancy)

Fixes:

Use larger batches (if memory allows)
Use mixed precision (FP16) for better throughput
Optimize data pipeline
Use faster augmentation tools (e.g., DALI)

4. I/O Bottlenecks

Symptoms:

GPU spends time waiting for data
Job is I/O-bound, not compute-bound
Logs show data not loading in time

Diagnostic Tools:

iostat, iotop: Disk read/write stats
nvidia-smi dmon: Look for spikes in memory throughput
DCGM custom metrics for I/O counters
Nsight Systems: Shows memory copy stalls

Fixes:

Use parallel file systems (Lustre, BeeGFS)
Enable RDMA if supported
Prefetch data in code
Use TFRecords, LMDB, or WebDataset for optimized storage formats

Real-World Scenario

Symptom: GPU utilization is only 20% during training

Diagnosis:

nvidia-smi dmon → GPU not active
top → Python process active, CPU at 100%
Python log shows: “Loading batch…”

Cause: Data loading is the bottleneck

Fix:

Increase num_workers in DataLoader
Move preprocessing to GPU with NVIDIA DALI

Summary: Common Bottlenecks & Fixes

Bottleneck Type	Symptom	Tool(s)	Common Fixes
CPU	High wait time, low GPU usage	`htop`, `mpstat`, Prometheus	Tune data loader, use GPU-accelerated preprocessing
Memory	OOMKilled, swap, slow system	`dmesg`, `free`, `vmstat`	Allocate more memory, optimize memory usage
GPU	Low utilization, slow training	`nvidia-smi`, `Nsight Systems`	Increase batch size, use FP16, optimize kernels
I/O	Data stalls, slow dataset access	`iostat`, `Nsight`, DCGM	Use RDMA, faster storage, caching, better formats

5. Interconnect Analysis (NVLink & NVSwitch)

Why GPU Interconnects Matter

In multi-GPU systems (like DGX/HGX servers), GPUs need to exchange data at high speeds—especially during:

Distributed training (e.g., gradient sharing)
Model parallelism (splitting parts of models across GPUs)
Tensor or batch parallel computation

Slow or misconfigured interconnects lead to:

High latency
Communication bottlenecks
Uneven performance across GPUs

NVLink: High-Speed GPU-to-GPU Communication

What is NVLink?

A direct, high-bandwidth connection between NVIDIA GPUs
Faster than PCIe (up to 600 GB/s bidirectional on newer models)
Used in most modern HGX and DGX systems

Key Features:

Feature	Description
Peer-to-peer access	GPUs can directly access each other’s memory
Transparent to user	No code changes needed if using NCCL or CUDA
Scalable	Supports topologies up to 8 GPUs interconnected

View NVLink Topology

Use:

nvidia-smi topo -m

Example output:

        GPU0    GPU1    GPU2
GPU0     X      NV1     NV2
GPU1    NV1      X      SYS
GPU2    NV2     SYS      X

Legend:

NV1, NV2: Connected via NVLink
SYS: Communication via PCIe (slower)

Interpret This Output:

Direct NVLink between GPUs is best
If traffic goes over SYS, performance will degrade

NVSwitch: Full-Bandwidth All-to-All Fabric

What is NVSwitch?

A switching fabric that allows any GPU to talk to any other GPU at full NVLink speed
Used in systems like DGX A100, DGX H100, or HGX-8/16
Replaces the limited point-to-point NVLink architecture

Why it Matters:

Uniform communication bandwidth
Reduces “topology bottlenecks”
No need to worry about which GPU talks to which

NCCL Tests: Measure Interconnect Performance

NCCL (NVIDIA Collective Communication Library) is optimized for:

AllReduce, Broadcast, Gather, Scatter
Multi-GPU training frameworks like PyTorch, TensorFlow, Horovod

To Run Bandwidth/Latency Tests:

Clone and build:

git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1

Run a simple test:

mpirun -np 4 ./build/all_reduce_perf -b 8 -e 512M -f 2 -g 1

This will show:

Bandwidth (GB/s)
Latency per GPU pair

Tip:

Run tests between GPUs with and without NVLink to compare performance.

Real-World Debugging Example

Symptom: Multi-GPU training is slower on some nodes.

Steps:

Run nvidia-smi topo -m
→ GPU2 communicates over SYS, others via NVLink
Run NCCL tests:
→ Lower bandwidth between GPU2 and others
Fix:
- Rewire NVLink bridge (if physical)
- Reassign workload to NVLink-connected GPUs only

Summary: Interconnect Troubleshooting

Tool / Method	Use Case
`nvidia-smi topo -m`	View GPU connectivity and NVLink layout
NCCL Tests	Benchmark GPU-to-GPU bandwidth
Nsight Systems	Identify communication stalls
DCGM Bandwidth Metrics	Export NVLink/PCIe stats over time
Hardware Check	Confirm NVLink bridges or NVSwitch status

Best Practices

Best Practice	Why It Matters
Place jobs on NVLink-connected GPUs	Ensures fast peer-to-peer communication
Monitor GPU interconnect bandwidth	Avoid silent slowdowns during training
Use NVSwitch if available	Guarantees full-bandwidth across all GPUs
Run NCCL benchmarks after cluster deployment	Validate performance before real training begins

6. Nsight Systems & Nsight Compute

Why Use Profiling Tools?

Even if your code runs and uses GPUs, performance may still be poor due to:

Inefficient kernel launches
Poor memory access patterns
Synchronization overhead
I/O stalls or CPU bottlenecks

To identify these issues, NVIDIA offers two powerful tools:

Tool	Focus Level	What It Analyzes
Nsight Systems	System-wide	Timeline of CPU, GPU, memory, I/O, and kernels
Nsight Compute	Kernel-level (CUDA)	Thread occupancy, memory coalescing, warps

Nsight Systems

Purpose:

Provides a global view of your application
Helps spot issues across CPU, GPU, and system resources

Key Features:

Timeline view of:
- Kernel launches
- Memory copies
- CPU thread activity
- API calls (e.g., cuMemcpy, cudaLaunchKernel)
Zoom into specific regions to spot stalls

Launch via CLI:

nsys profile -o output_name ./your_script.sh

Or for Python:

nsys profile -o tf_model python train.py

Output:

Generates .qdrep and .nsys-rep files, viewable in:

Nsight Systems GUI
Or CLI summary:

nsys stats output_name.qdrep

Common Issues Nsight Systems Reveals:

Symptom	Interpretation
Kernel launches delayed	CPU bottleneck or sync blocking
Long memory copies	Poor data locality or missing page pinning
GPU idle while CPU active	Data not arriving fast enough
Lots of small kernel launches	Launch overheads outweigh compute gains

Nsight Compute

Purpose:

Dive into individual kernel performance
Understand what limits your GPU execution

Key Metrics:

Metric	What It Tells You
Occupancy	GPU thread block utilization
Memory Coalescing	Whether threads access memory efficiently
Warp Execution Efficiency	Fraction of warps that are active
Shared Memory Utilization	Helps tune for memory-bound kernels

Launch via CLI:

nsys profile -t cuda ./app

To analyze a single kernel:

nv-nsight-cu-cli ./app

Or open the .ncu-rep in Nsight Compute GUI for visual profiling.

Common Issues Nsight Compute Detects:

Issue	Description	Suggested Fix
Low occupancy	Blocks too small, too few threads	Increase thread count or adjust block size
Poor memory coalescing	Unaligned memory access	Use contiguous memory layouts
Branch divergence	Threads follow different code paths	Refactor kernel logic
High register pressure	Too many registers per thread	Optimize variable usage or split kernels

Example Use Case: PyTorch Model is Slow

Run:

nsys profile -o myrun python train.py

Inspect Timeline:

GPU is idle most of the time
cudaMemcpyAsync takes 80% of time

→ Suggests data is not on GPU or input pipeline is too slow

Fix:

Move preprocessing to GPU (e.g., DALI)
Use pinned memory for data transfers
Batch memory copies

Summary: When to Use Each Tool

Tool	Use Case
Nsight Systems	End-to-end profiling, timeline view
Nsight Compute	In-depth kernel analysis, tuning CUDA performance

Practical Tips

Tip	Why It Helps
Always profile after major code changes	Prevent hidden regressions
Use Nsight in early development, not just final	Catch architectural mistakes early
Pair with Prometheus + Grafana for correlation	Relate kernel issues with system-level metrics
Use multiple runs to find consistent patterns	Avoid noise from one-off spikes or jitter

7. Training Optimization Strategies

Why Optimization Matters

Training large AI models is expensive and time-consuming. Without optimization:

Training takes days instead of hours
GPU resources are underutilized
Scaling across GPUs becomes inefficient

By optimizing training, you can:

Reduce training time
Lower compute cost
Improve model convergence stability

1. Data Pipeline Optimization

Symptoms of Poor Data Pipeline:

GPU idle time is high
Training slow despite low CPU load
I/O waits visible in profiling tools (e.g., Nsight Systems)

Optimization Techniques:

Technique	Benefit
Data prefetching	Fetch next batch while current is training
Parallel data loading	Use `num_workers` > 0 in PyTorch
Efficient formats	Use TFRecords, WebDataset, or LMDB
GPU-based decoding	Use NVIDIA DALI for image/video data
Data sharding	Split data across nodes to reduce I/O

PyTorch Example:

DataLoader(dataset, batch_size=32, num_workers=8, prefetch_factor=2)

2. I/O Optimization

Slow data loading from storage (especially NFS) can bottleneck training.

Solutions:

Use parallel file systems (Lustre, BeeGFS, GPFS)
Use local NVMe SSDs for temporary caching
Enable RDMA for remote file access
Use GPUDirect Storage if supported

3. GPU Usage Optimization

Most modern GPUs support mixed precision training (e.g., FP16 or BF16), which improves performance.

Techniques:

Optimization	Benefit
Mixed precision	Use Tensor Cores, reduce memory
Larger batch sizes	Better GPU utilization
Gradient accumulation	Simulate large batches on small GPUs

PyTorch Example:

from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()
for data in loader:
    with autocast():
        output = model(data)
        loss = loss_fn(output, target)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

4. Distributed Communication Optimization

When using multiple GPUs/nodes, communication can dominate training time.

Techniques:

Method	Tool / Library	Benefit
AllReduce optimization	NCCL	Faster gradient sync
Horovod Fusion Buffer	Horovod	Reduces network overhead
Overlapping compute & comm	Native PyTorch DDP or Horovod	Better GPU utilization
Use NVLink/NVSwitch	Hardware	Low-latency inter-GPU traffic

5. Scheduling Strategy

If multiple jobs or users share a cluster, poor scheduling can result in:

Resource starvation
GPU fragmentation
Overhead from context switching

Best Practices:

Technique	Benefit
Bind jobs to NVLink-connected GPUs	Maximizes inter-GPU speed
Set job affinity (K8s, Slurm)	Avoids placement on overloaded nodes
Use MIG (on A100/H100)	Enables fair GPU partitioning
Prioritize critical jobs	Use Slurm QOS or K8s priority class

Summary: Training Optimization Techniques

Area	Method	Tool/Tech Used
Data	Prefetch, shard, parallel load	DALI, PyTorch DataLoader
I/O	Parallel FS, RDMA, caching	BeeGFS, GPUDirect Storage
GPU	FP16, batch size, accumulation	PyTorch AMP, Tensor Cores
Communication	AllReduce, NCCL, overlapping transfer	NCCL, Horovod, DDP
Scheduling	Topology-aware scheduling	NVLink, MIG, Kubernetes, Slurm

8. Logs, Debugging Tools & Common Error Types

1. Why Logs Are Essential

In production environments, logs are your first line of defense when something breaks.

They help answer:

Why did my job crash?
Was it an infrastructure issue or a code bug?
Did the system run out of memory, GPU, or file descriptors?

2. Key Logging and Debugging Tools

Tool	Use Case
`journalctl`	System-wide logs (including Docker and Slurm)
`docker logs <id>`	Container-specific logs
`kubectl logs <pod>`	Kubernetes container logs
Slurm logs	Scheduler + compute node logs
`dmesg`	Kernel-level logs (e.g., OOM errors)

System Logs

journalctl -xe

Use for checking driver issues, hardware failures, or runtime crashes.

Docker Logs

docker ps -a
docker logs <container-id>

Check for:

CUDA errors
Missing libraries
Runtime crashes

Kubernetes Logs

Check the logs of a pod:

kubectl logs <pod-name>

For multi-container pods:

kubectl logs <pod-name> -c <container-name>

Describe pod to see failure reasons:

kubectl describe pod <pod-name>

Slurm Logs

Slurm logs are usually found in:

/var/log/slurm/slurmctld.log
/var/log/slurm/slurmd.log

Useful when:

Jobs are not starting
Node goes offline
GPU resources not allocated correctly

3. Common Error Types You Should Recognize

Here’s a cheat sheet of the most frequent runtime errors you’ll see and how to interpret them.

GPU Not Found

Error Message:

CUDA device not available

Diagnosis:

NVIDIA driver not loaded
Docker container lacks --gpus flag
MIG instance not assigned

Fix:

Run nvidia-smi on host and in container
Check runtime settings
Use --gpus all or --gpus "device=UUID"

Out-of-Memory (OOM)

Error Message:

Killed: Out of memory

Diagnosis:

Container memory request too small
Job exceeded available RAM or GPU memory

Fix:

Increase resources.requests.memory in K8s
Tune batch size or model size
Use mixed precision (FP16)

GPU Device Busy

Error Message:

GPU already in use

Diagnosis:

Another process is using GPU
Container trying to access unavailable MIG partition

Fix:

Use nvidia-smi to inspect usage
Kill unused processes
Reschedule job to another node

Job Hangs / No Output

Symptoms:

Job appears to run but no logs or model output

Diagnosis:

Inter-GPU communication bottleneck
Infinite loop or deadlock in code
Blocked I/O

Fix:

Profile with Nsight Systems
Add debug-level logging
Test components in isolation

Final Summary: Must-Have Debugging Skills

Issue Type	Tool / Method
Crash Diagnosis	`journalctl`, `docker logs`, `kubectl logs`
GPU Access	`nvidia-smi`, container runtime settings
Memory Issues	`dmesg`, Prometheus memory metrics
Slurm Failures	`slurmctld.log`, `squeue`, `sinfo`
Stuck Jobs	Profiling tools + log inspection