Base Command Manager (BCM) is a tool developed by NVIDIA that helps manage GPU clusters.
Think of it like the “control tower” for a team of GPU-powered machines. It helps system administrators:
Add new machines (nodes) into the cluster
Monitor GPU usage and health
Control who can access what (user roles)
Coordinate with other tools like Slurm to run jobs on GPUs
If you have many servers with GPUs, BCM helps you manage them easily and efficiently.
A cluster is a group of connected computers that work together. A GPU cluster is when those computers have powerful NVIDIA GPUs installed.
Why clusters?
AI workloads—especially deep learning—need massive computation power, and that often means multiple GPUs across multiple machines.
| Feature | What It Means in Practice |
|---|---|
| Node registration | You can add new GPU machines to your cluster and track their health. |
| GPU usage monitoring | You can see how busy each GPU is (how much memory is used, how hot it is, etc.). |
| User/group access control | You decide which users are allowed to do what (e.g., submit jobs, view logs). |
| Built-in tools | BCM comes with other helpful tools like Slurm (for job scheduling) and DCGM. |
You can use two main interfaces:
BCM CLI (Command-Line Interface)
Type commands in a terminal
Example: bcmsystem status
Useful for scripting and automation
BCM Web UI
A webpage where you can see your cluster status visually
Easier for beginners
bcminitThis command is used to initialize a new node (server) and connect it to BCM. It sets up the necessary software and prepares the machine for management.
BCM doesn’t run AI jobs directly—it passes them to Slurm, which acts like the job manager.
Think of it like this:
BCM is the system manager.
Slurm is the task manager.
BCM makes sure Slurm knows where and how to run tasks.
BCM works with DCGM (Data Center GPU Manager) to make sure all GPUs are:
Running properly
Not overheating
Not reporting memory errors
Let’s imagine your company has 5 GPU servers.
You install BCM on one server as the “controller,” then:
Use bcminit to connect the other 4 GPU servers
Now all 5 servers are visible in the BCM Web UI
You can check GPU usage in real time
You create user accounts: Alice (Admin), Bob (Data Scientist)
You install Slurm so Bob can submit AI training jobs to the GPU servers
Congratulations! You now have a fully functioning GPU cluster.
AI research labs managing training jobs across 20+ GPU nodes
Data centers optimizing GPU resource usage
MLOps engineers scheduling AI workloads in production
| Skill | Description |
|---|---|
| Install BCM | Set up BCM on a server and connect GPU nodes |
Use bcminit |
Add new GPU nodes into the cluster |
| Access the Web UI | View GPU status, memory, and node health visually |
| Integrate with Slurm | Schedule jobs using Slurm after setting up the cluster |
| Monitor with DCGM | Ensure GPU health (temperature, memory errors, etc.) |
| Create user roles | Define permissions using BCM’s access control features |
Slurm stands for Simple Linux Utility for Resource Management.
It is an open-source job scheduler used to manage and allocate compute resources (like GPUs and CPUs) in a cluster environment.
Think of Slurm as the “job dispatcher” in your GPU cluster.
When a user wants to run an AI training job, Slurm decides where and when that job should run.
When you have multiple users and limited resources (say, 10 GPUs shared across a team), you need a smart system to:
Decide which job runs first
Make sure resources are used efficiently
Avoid conflicts (e.g., two users trying to use the same GPU)
Keep a history of all submitted jobs
This is exactly what Slurm does.
| Use Case | Description |
|---|---|
| AI/ML job scheduling | Users can submit training/inference tasks to run on GPU nodes |
| Resource allocation | Slurm can assign specific number of GPUs, CPUs, and memory per job |
| Job queue management | Handles jobs waiting for available resources |
| Time and resource limits | Prevents one user from using all resources forever |
Let’s break down some of the most important parts of Slurm:
This is the central brain of Slurm.
It manages job queues, resource allocation, and scheduling decisions.
Runs on every compute node.
It receives instructions from slurmctld and executes jobs.
The main configuration file.
It defines all nodes, partitions, limits, and user settings.
sbatch train_model.sh
squeue
sinfo
When you install Base Command Manager (BCM), it includes Slurm as the default scheduler.
Here’s the interaction:
BCM handles hardware, GPU drivers, and monitoring.
Slurm manages jobs and decides where and when to run them.
So, for a complete AI infrastructure:
BCM = manages the machines
Slurm = manages the jobs
Let’s walk through a basic example of how a user runs a job using Slurm.
Create a file named train_model.sh:
#!/bin/bash
#SBATCH --job-name=ai_training
#SBATCH --output=output.log
#SBATCH --gres=gpu:1
#SBATCH --time=02:00:00
#SBATCH --mem=16G
python train.py
Explanation:
--job-name: Name of the job
--output: Where to write the output logs
--gres=gpu:1: Request 1 GPU
--time: Maximum time the job can run
--mem: How much RAM to request
sbatch train_model.sh
squeue
When the job starts, logs will be written to output.log.
| Task | Command or Tool |
|---|---|
| Submit a job | sbatch script.sh |
| Cancel a job | scancel <job_id> |
| View the queue | squeue |
| View node status | sinfo |
| Debug a failed job | Check output log or use sacct for job info |
| Option | Use Case |
|---|---|
--gres=gpu:N |
Request N GPUs |
--mem=XG |
Request memory (in GB) |
--cpus-per-task=N |
Assign CPU cores for data preprocessing |
--partition=NAME |
Choose a specific partition (queue) |
--nodelist=HOSTNAME |
Run job on a specific machine |
| Skill | Description |
|---|---|
| Install and configure Slurm | Set up slurmctld, slurmd, and slurm.conf |
| Submit jobs | Use sbatch with proper job scripts |
| Monitor jobs | Use squeue, sinfo, sacct to track and analyze job states |
| Manage GPU resources | Set --gres=gpu:N and other limits to share GPUs efficiently |
| Cancel or requeue jobs | Use scancel, scontrol requeue as needed |
Fleet Command is NVIDIA’s cloud-based platform designed to deploy, manage, and monitor AI applications at the edge.
“Edge” means computing done close to the data source—for example:
A smart camera in a store
A robot in a factory
A medical device in a hospital
An AI server in a delivery truck
Instead of sending all data to a central data center, edge devices process data locally for faster decisions.
Fleet Command helps you:
Register remote edge devices (like Jetson, A100 servers, etc.)
Deploy AI software (like object detection models)
Update applications securely
Monitor system health and performance remotely
Troubleshoot without going on-site
Think of Fleet Command as a remote control system for managing a “fleet” of AI devices spread across different locations.
| Without Fleet Command | With Fleet Command |
|---|---|
| Manually configure each device | Centralized dashboard to manage everything |
| Travel to troubleshoot | Remote troubleshooting and logging |
| Inconsistent deployments | Standardized container-based deployments |
| Hard to monitor performance | Real-time monitoring of GPU health and usage |
Here’s how a typical edge deployment works:
You install a Fleet Command agent on the remote device.
It connects to the Fleet Command dashboard via the cloud.
You define an application using containers.
You can use NVIDIA’s NGC containers, or build your own (using Docker).
From the Fleet Command web interface, you select:
The device(s) you want to deploy to
The container to run
Fleet Command remotely installs and starts the app.
GPU temperature, usage, health, and logs are visible on the dashboard.
You can restart or stop applications anytime.
Let’s say a retail company wants to use AI cameras in 100 stores to count customers and monitor stock.
With Fleet Command:
All 100 stores register their Jetson devices via a secure token.
The AI model is packaged into a Docker container.
The company deploys the container to all devices in one click.
If any camera fails, the support team can see logs remotely and restart the app—without visiting the store.
| Feature | What It Means |
|---|---|
| Secure provisioning | Devices are added using secure keys and authentication |
| Remote deployment | You can install/update software on devices from anywhere |
| Container support | Applications run in Docker containers |
| Monitoring & logging | GPU metrics, system status, logs—all in one dashboard |
| RBAC support | Different users (admins, operators) have different permissions |
A physical device (Jetson, EGX, etc.) registered into Fleet Command
These are the “workers” that run AI applications
Applications are deployed in containers
You can start, stop, restart, and update containers anytime
Admins: Full access to all resources
Operators: Can deploy and monitor, but not change configs
Viewers: Read-only access to dashboards
| Skill | Description |
|---|---|
| Register edge systems | Add new devices securely into Fleet Command |
| Build containerized applications | Use Docker to package AI models and inference services |
| Deploy containers remotely | Use the dashboard to launch apps on remote edge nodes |
| Monitor GPU and system status | View health, usage, and errors in real-time |
| Use RBAC | Define roles and responsibilities for different team members |
| Troubleshoot remotely | Read logs, restart containers, fix issues without physical access |
MIG stands for Multi-Instance GPU.
It is a hardware-based virtualization feature available on NVIDIA Ampere (A100) and newer GPUs that allows a single physical GPU to be split into multiple isolated GPU instances, called MIG instances.
Without MIG:
One user might use the whole GPU, even if their task only needs a small portion.
Multiple users sharing the same GPU might interfere with each other’s workloads.
With MIG:
A single GPU can be divided into smaller "slices".
Each slice is isolated, with its own memory, cores, and cache.
Different jobs (from different users or containers) can run at the same time without conflict.
| Benefit | Description |
|---|---|
| Isolation | Each MIG instance runs independently, like its own mini-GPU |
| Resource Efficiency | Allows small or medium jobs to run without wasting full-GPU resources |
| Multi-Tenancy | Supports multiple users or apps on a single GPU |
| Improved Security | Processes cannot access each other’s memory or compute |
MIG profiles define how a GPU is split.
For example, on an NVIDIA A100 40GB GPU, you can choose:
| MIG Profile | Compute Cores | Memory | Description |
|---|---|---|---|
| 1g.5gb | 1/7 of GPU | ~5 GB | Lightweight jobs or inference |
| 2g.10gb | 2/7 of GPU | ~10 GB | Medium-sized training |
| 3g.20gb | 3/7 of GPU | ~20 GB | Large inference or training |
| 7g.40gb (no MIG) | 100% of GPU | 40 GB | Full access – not split |
You choose a profile based on the job’s size and memory needs.
These steps are usually done on the command line using nvidia-smi.
Check whether MIG is enabled:
nvidia-smi -L
To enable MIG:
sudo nvidia-smi -mig 1
sudo reboot
Use nvidia-smi mig to list and create instances.
Example: Create 3 instances of 1g.5gb profile:
sudo nvidia-smi mig -cgi 0,1,2 -C
-cgi means: “Create GPU Instances”
Each number represents a profile (e.g., 0 = 1g.5gb)
nvidia-smi
You will see multiple MIG instances listed.
Each one behaves like a separate GPU device.
In Docker:
docker run --gpus '"device=0"' nvidia/cuda:12.0-base nvidia-smi
In Kubernetes:
MIG devices are discovered by the NVIDIA device plugin
You can schedule pods onto MIG instances using device IDs
DCGM (Data Center GPU Manager) is fully compatible with MIG.
You can:
Track individual MIG instance usage
Monitor temperature, memory, and errors
Set alerts and usage limits
This is helpful when many users share one GPU.
| Use Case | Example |
|---|---|
| Multi-user notebook platform | Give each user a 1g.5gb slice of the A100 GPU |
| AI inference farm | Run multiple AI models on the same GPU simultaneously |
| Cloud GPU service provider | Sell smaller "GPU units" to different customers securely |
| Developer test environments | Let multiple developers test models on their own virtual GPU slice |
| Skill | Description |
|---|---|
| Enable MIG mode | Use nvidia-smi to switch the GPU to MIG-capable mode |
| Create GPU and compute instances | Use nvidia-smi mig -cgi and profile numbers |
| Assign MIG instances | Assign specific MIG devices to Docker containers or Kubernetes pods |
| Monitor MIG performance | Use DCGM to track usage, errors, and health of each instance |
| Match workloads to profiles | Choose correct MIG profile based on job memory and compute needs |
DCGM stands for Data Center GPU Manager.
It’s a set of tools, libraries, and APIs provided by NVIDIA that helps system administrators:
Monitor GPU performance and health
Detect hardware issues
Diagnose problems early
Track usage over time
Think of DCGM as a “GPU health monitoring system” for data centers and large-scale GPU deployments.
When managing dozens or hundreds of GPUs, it’s hard to know:
Which GPUs are running hot?
Are any GPUs producing memory errors?
Is any job using more power than allowed?
Which GPUs are underused?
DCGM solves this by collecting real-time telemetry and providing alerting and logging.
| Metric Type | Example Metrics |
|---|---|
| Health | Overall GPU health status, ECC memory errors, clock throttling |
| Performance | GPU utilization, memory usage, power draw, SM occupancy |
| Temperature & Power | GPU temperature, power limits, voltage |
| PCIe/NVLink | Communication speed and error counters |
| MIG Instances | Each instance’s performance, health, and error rate |
+-------------------------+ +-----------------------------+
| GPU Hardware Layer | ---> | DCGM Engine (on each node) |
+-------------------------+ +-----------------------------+
|
v
+---------------------------+
| CLI / API / Exporters |
+---------------------------+
The DCGM Engine runs on each GPU node
You interact with it using:
Command-line tools
API calls (C/C++/Python bindings)
Prometheus exporters for dashboards
dcgmiThis is the command-line interface that allows administrators to interact with DCGM easily.
dcgmi discovery -l
dcgmi health -c
dcgmi stats --show
dcgmi diag
DCGM can automatically detect issues like:
Memory errors (ECC errors)
Power/thermal throttling
Driver issues or crashes
MIG misconfigurations
It assigns a Health Status:
OK: All good
Warning: Possible issues
Critical: Hardware likely failing or unstable
You can configure DCGM to trigger alerts when:
A GPU’s temperature exceeds 85°C
Power usage spikes
ECC errors happen more than once
These alerts can be:
Logged locally
Sent to external systems (like Grafana dashboards)
Used to automatically pause or migrate jobs
To visualize GPU status on dashboards and track historical usage.
DCGM Exporter runs on each node
It exposes metrics to Prometheus in the correct format
Grafana queries Prometheus and displays visual dashboards
dcgm_gpu_utilization
dcgm_memory_used
dcgm_temperature
dcgm_power_usage
Per-node GPU usage
Cluster-wide temperature trends
Alerts when any GPU reaches thermal limit
DCGM works well in environments like:
Kubernetes: With the NVIDIA GPU Operator, DCGM runs as a DaemonSet
Slurm: You can pull DCGM stats before scheduling jobs or after completion
| Skill | Description |
|---|---|
| Install and run DCGM | Set up DCGM Engine on GPU nodes |
Use dcgmi |
Run diagnostics, check health, view metrics |
| Configure alerts | Set thresholds for temperature, ECC, and other hardware faults |
| Integrate with Grafana/Prometheus | Use exporters to visualize GPU metrics |
| Monitor MIG instances | Track performance and errors per MIG slice |
| Use with Kubernetes/Slurm | Integrate with modern cluster schedulers for unified GPU health tracking |
Role-Based Access Control (RBAC) is a security system that lets you control who can do what inside your infrastructure.
Instead of giving full access to everyone, RBAC lets you:
Create roles (e.g., Admin, Operator, Viewer)
Assign permissions to each role
Assign users to those roles
This protects your infrastructure from accidental or malicious actions.
| Without RBAC | With RBAC |
|---|---|
| Anyone could shut down GPU jobs | Only Admins can stop/restart jobs |
| Sensitive logs could be exposed | Viewers can’t access sensitive data |
| Unsafe code could be deployed | Only Operators/Admins can deploy apps |
RBAC is critical for:
Multi-user systems
Edge environments (e.g., Fleet Command)
Kubernetes clusters
Cloud GPU infrastructure
| Role | Permissions |
|---|---|
| Admin | Full access: add/remove nodes, assign roles, restart services |
| Operator | Deploy applications, monitor systems, but can’t change settings |
| Viewer | Read-only access to dashboards, logs, GPU status |
Web UI provides role-based login
Each user is assigned one of the 3 roles
Role permissions are configured during user account setup
You define:
Who can run jobs
Who can access node logs
Who can change Slurm or MIG settings
RBAC is native to Kubernetes.
Example: A role that lets users read pods, but not delete them.
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: gpu-jobs
name: viewer-role
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
You can then assign that role using a RoleBinding.
| Skill | Description |
|---|---|
| Assign user roles in BCM or Fleet | Create user accounts and set their permissions |
| Configure Kubernetes RBAC | Write and apply roles and bindings for GPU workload access |
| Restrict sensitive actions | Allow only Admins to stop GPUs or change MIG setup |
| Combine with monitoring tools | Limit who can see logs or usage metrics |
Telemetry means collecting real-time data from your system so you can:
Monitor performance
Detect problems
Make decisions based on usage patterns
For GPUs, telemetry includes:
Utilization
Temperature
Power consumption
Memory usage
Error states (like ECC)
| Problem | What Monitoring Reveals |
|---|---|
| GPUs overheating | See temperature rise and take action early |
| Low GPU usage | Spot underused resources and optimize workloads |
| Repeated ECC errors | Detect hardware issues and schedule replacements |
| Power spikes | Identify power-hungry workloads |
Here’s how they work together:
+----------------+ +----------------+ +------------------+
| DCGM Exporter| ---> | Prometheus | ---> | Grafana |
+----------------+ +----------------+ +------------------+
(GPU data) (stores metrics) (visual dashboard)
Installed on each GPU node
Collects GPU stats and exposes them to Prometheus
Periodically scrapes GPU metrics from exporters
Stores time-series data
Can trigger alerts
Connects to Prometheus
Builds dashboards with charts, tables, gauges
Supports alert rules and notifications (email, Slack, etc.)
| Metric Name | Meaning |
|---|---|
dcgm_gpu_utilization |
Percentage of GPU in use |
dcgm_memory_used |
GPU memory currently allocated |
dcgm_temperature |
Current GPU temperature |
dcgm_ecc_errors |
Memory error count |
dcgm_power_usage |
Current power draw in watts |
If you're using Kubernetes with NVIDIA GPUs:
Use the NVIDIA GPU Operator to install:
DCGM exporter
Device plugin
Monitoring agents
You can also deploy the kube-prometheus-stack to get:
Pre-built dashboards
Node metrics
Cluster health alerts
| Skill | Description |
|---|---|
| Install DCGM exporter | Expose GPU metrics on each node |
| Deploy Prometheus and Grafana | Collect and visualize telemetry |
| Create dashboards | Show temperature, usage, and errors in real-time |
| Set alert rules | Get notified when GPUs overheat or show critical errors |
| Use with Kubernetes or Slurm | Extend monitoring to job-level visibility |
| Tool/Concept | Key Skills |
|---|---|
| BCM | Cluster management, node registration, Slurm integration |
| Slurm | GPU job scheduling, partitions, queue management |
| Fleet Command | Edge AI deployment, remote monitoring, app updates |
| MIG | GPU slicing, isolation, configuration, and assignment |
| DCGM | GPU health checks, metrics, diagnostics |
| RBAC | User role control for safe and secure multi-user systems |
| Telemetry Tools | Prometheus, Grafana, alerts, historical tracking |
The "Administration" domain typically tests foundational GPU system skills using standard exam formats. Here's what you can expect:
| Format Type | Description | Example |
|---|---|---|
| Multiple Choice | Select the best command or explanation | "Which command lists all GPUs?" |
| Command Identification | Identify what a command does | "What does dcgmi health -c do?" |
| Scenario-Based | Real-world situation, pick the right response | "A node has ECC errors, what tool do you use?" |
| Concept Definition | Explain a service or system behavior | "What is the persistence daemon used for?" |
Recognizing the output and use of nvidia-smi, dcgmi, systemctl
Verifying NVIDIA services like nvidia-persistenced
Diagnosing node availability using GPU health/status tools
Navigating GPU environment variables or configuration files
| Command | Function | Memory Aid |
|---|---|---|
dcgmi discovery -l |
Lists all visible GPUs | “D for discovery, L for list” |
dcgmi health -c |
Health check of all devices | “C = Check” |
nvidia-smi |
View driver, GPU usage | Think “System Monitoring Interface” |
nvidia-smi topo -m |
Shows inter-GPU topology | “topo = topology”, output = matrix |
systemctl status nvidia-persistenced |
Shows daemon status | Always use systemctl status for services |
Use flashcards with command on one side, and purpose + memory hook on the other:
Front:dcgmi stats --groupId 0
Back:
Purpose: Display performance metrics (utilization, PCIe, memory)
Mnemonic: Stats for Group 0 = Node-wide statistics
Tools like Anki make digital flashcards efficient for spaced repetition.
Organize your study by function to retain better:
| Category | Key Commands |
|---|---|
| GPU Discovery & Health | dcgmi discovery -l, dcgmi health -c |
| Monitoring | nvidia-smi, nvidia-smi dmon |
| Topology | nvidia-smi topo -m |
| Service Management | systemctl status nvidia-persistenced |
How can administrators monitor which processes are consuming GPU resources on an NVIDIA AI cluster?
Administrators typically use the nvidia-smi utility to monitor GPU utilization and running processes.
The nvidia-smi command provides real-time information about GPU usage, memory consumption, temperature, and running compute processes. Administrators can run it locally or remotely to inspect GPUs on a server. The process list shows the PID, memory usage, and compute context of applications using the GPU. For clusters with multiple nodes, monitoring systems often integrate nvidia-smi metrics into centralized dashboards such as Prometheus or Grafana. A common mistake is assuming CPU monitoring tools reveal GPU usage; GPU workloads must be inspected with NVIDIA-specific utilities. Continuous monitoring helps detect runaway jobs, resource contention, or misconfigured workloads that monopolize GPU memory.
Demand Score: 76
Exam Relevance Score: 84
What is the recommended method for identifying which user launched a GPU-intensive process in a shared AI environment?
Use nvidia-smi combined with standard Linux process inspection tools such as ps or top.
nvidia-smi lists the PID of each GPU process but does not directly display the owning user. Administrators correlate the PID with Linux process data using commands like ps -fp <PID> or top -p <PID>. This reveals the username, command path, and execution details. In shared environments such as research clusters or ML platforms, this approach allows administrators to identify users responsible for excessive GPU consumption. Without correlating system processes, administrators may misattribute workloads or fail to enforce usage policies. Some organizations automate this mapping within cluster monitoring pipelines so administrators can quickly identify resource ownership.
Demand Score: 68
Exam Relevance Score: 81
Why do AI operations teams often centralize GPU metrics instead of relying only on node-level monitoring tools?
Centralized monitoring enables cluster-wide visibility of GPU utilization and workload behavior.
AI infrastructure typically consists of multiple GPU nodes running distributed training or inference workloads. Node-level tools such as nvidia-smi only provide local insights, which makes it difficult to detect cluster-wide bottlenecks or scheduling inefficiencies. By exporting GPU telemetry to monitoring systems such as Prometheus, administrators can track metrics like GPU utilization, memory usage, power draw, and temperature across all nodes. This centralized view allows operators to detect underutilized GPUs, identify overloaded nodes, and adjust scheduling policies. A common mistake is relying solely on individual server monitoring, which prevents teams from understanding overall infrastructure efficiency.
Demand Score: 64
Exam Relevance Score: 79
What administrative risks arise when GPU monitoring is not implemented in AI infrastructure?
Lack of monitoring can lead to resource contention, undetected failures, and inefficient GPU utilization.
GPU resources are expensive and often shared across many AI workloads. Without monitoring tools, administrators cannot detect scenarios such as idle GPUs, runaway processes, memory exhaustion, or thermal throttling. These issues may degrade model training performance or cause job failures. Monitoring also helps identify abnormal usage patterns that may indicate misconfiguration or infrastructure instability. For example, a training job repeatedly restarting due to GPU memory errors could go unnoticed without metrics and logs. Implementing monitoring pipelines ensures administrators maintain visibility into operational health and can proactively respond to anomalies before they impact production workloads.
Demand Score: 60
Exam Relevance Score: 76