Administration

Administration Detailed Explanation

1. Introduction to Base Command Manager (BCM)

What is BCM?

Base Command Manager (BCM) is a tool developed by NVIDIA that helps manage GPU clusters.

Think of it like the “control tower” for a team of GPU-powered machines. It helps system administrators:

Add new machines (nodes) into the cluster
Monitor GPU usage and health
Control who can access what (user roles)
Coordinate with other tools like Slurm to run jobs on GPUs

If you have many servers with GPUs, BCM helps you manage them easily and efficiently.

What is a GPU Cluster?

A cluster is a group of connected computers that work together. A GPU cluster is when those computers have powerful NVIDIA GPUs installed.

Why clusters?

AI workloads—especially deep learning—need massive computation power, and that often means multiple GPUs across multiple machines.

Key Features of BCM (Explained Simply)

Feature	What It Means in Practice
Node registration	You can add new GPU machines to your cluster and track their health.
GPU usage monitoring	You can see how busy each GPU is (how much memory is used, how hot it is, etc.).
User/group access control	You decide which users are allowed to do what (e.g., submit jobs, view logs).
Built-in tools	BCM comes with other helpful tools like Slurm (for job scheduling) and DCGM.

How You Interact With BCM

You can use two main interfaces:

BCM CLI (Command-Line Interface)
- Type commands in a terminal
- Example: bcmsystem status
- Useful for scripting and automation
BCM Web UI
- A webpage where you can see your cluster status visually
- Easier for beginners

Important Concepts

1. `bcminit`

This command is used to initialize a new node (server) and connect it to BCM. It sets up the necessary software and prepares the machine for management.

2. Slurm Integration

BCM doesn’t run AI jobs directly—it passes them to Slurm, which acts like the job manager.

Think of it like this:

BCM is the system manager.

Slurm is the task manager.

BCM makes sure Slurm knows where and how to run tasks.

3. Health Checks

BCM works with DCGM (Data Center GPU Manager) to make sure all GPUs are:

Running properly
Not overheating
Not reporting memory errors

A Simple Example Scenario

Let’s imagine your company has 5 GPU servers.

You install BCM on one server as the “controller,” then:

Use bcminit to connect the other 4 GPU servers
Now all 5 servers are visible in the BCM Web UI
You can check GPU usage in real time
You create user accounts: Alice (Admin), Bob (Data Scientist)
You install Slurm so Bob can submit AI training jobs to the GPU servers

Congratulations! You now have a fully functioning GPU cluster.

Real-World Use Cases of BCM

AI research labs managing training jobs across 20+ GPU nodes
Data centers optimizing GPU resource usage
MLOps engineers scheduling AI workloads in production

Skills You Should Aim to Learn for BCM

Skill	Description
Install BCM	Set up BCM on a server and connect GPU nodes
Use `bcminit`	Add new GPU nodes into the cluster
Access the Web UI	View GPU status, memory, and node health visually
Integrate with Slurm	Schedule jobs using Slurm after setting up the cluster
Monitor with DCGM	Ensure GPU health (temperature, memory errors, etc.)
Create user roles	Define permissions using BCM’s access control features

2. Introduction to Slurm Workload Manager

What is Slurm?

Slurm stands for Simple Linux Utility for Resource Management.

It is an open-source job scheduler used to manage and allocate compute resources (like GPUs and CPUs) in a cluster environment.

Think of Slurm as the “job dispatcher” in your GPU cluster.
When a user wants to run an AI training job, Slurm decides where and when that job should run.

Why Do We Need Slurm?

When you have multiple users and limited resources (say, 10 GPUs shared across a team), you need a smart system to:

Decide which job runs first
Make sure resources are used efficiently
Avoid conflicts (e.g., two users trying to use the same GPU)
Keep a history of all submitted jobs

This is exactly what Slurm does.

Key Use Cases of Slurm in AI Workloads

Use Case	Description
AI/ML job scheduling	Users can submit training/inference tasks to run on GPU nodes
Resource allocation	Slurm can assign specific number of GPUs, CPUs, and memory per job
Job queue management	Handles jobs waiting for available resources
Time and resource limits	Prevents one user from using all resources forever

Important Components & Terminology

Let’s break down some of the most important parts of Slurm:

1. slurmctld (Slurm Controller Daemon)

This is the central brain of Slurm.
It manages job queues, resource allocation, and scheduling decisions.

2. slurmd (Slurm Daemon)

Runs on every compute node.
It receives instructions from slurmctld and executes jobs.

3. slurm.conf

The main configuration file.
It defines all nodes, partitions, limits, and user settings.

4. sbatch

Command to submit a job script (usually a shell script with job instructions).

sbatch train_model.sh

5. squeue

Shows all jobs in the queue, both running and waiting.

squeue

6. sinfo

Displays information about available compute nodes.

sinfo

How Slurm Works with BCM

When you install Base Command Manager (BCM), it includes Slurm as the default scheduler.

Here’s the interaction:

BCM handles hardware, GPU drivers, and monitoring.
Slurm manages jobs and decides where and when to run them.

So, for a complete AI infrastructure:

BCM = manages the machines
Slurm = manages the jobs

A Beginner-Friendly Job Submission Workflow

Let’s walk through a basic example of how a user runs a job using Slurm.

Step 1: Create a Job Script

Create a file named train_model.sh:

#!/bin/bash
#SBATCH --job-name=ai_training
#SBATCH --output=output.log
#SBATCH --gres=gpu:1
#SBATCH --time=02:00:00
#SBATCH --mem=16G

python train.py

Explanation:

--job-name: Name of the job
--output: Where to write the output logs
--gres=gpu:1: Request 1 GPU
--time: Maximum time the job can run
--mem: How much RAM to request

Step 2: Submit the Job

sbatch train_model.sh

Step 3: Check the Queue

squeue

Step 4: Monitor Output

When the job starts, logs will be written to output.log.

Tips for Beginners

Task	Command or Tool
Submit a job	`sbatch script.sh`
Cancel a job	`scancel <job_id>`
View the queue	`squeue`
View node status	`sinfo`
Debug a failed job	Check output log or use `sacct` for job info

Useful Job Options for AI Workloads

Option	Use Case
`--gres=gpu:N`	Request N GPUs
`--mem=XG`	Request memory (in GB)
`--cpus-per-task=N`	Assign CPU cores for data preprocessing
`--partition=NAME`	Choose a specific partition (queue)
`--nodelist=HOSTNAME`	Run job on a specific machine

Skills You Should Aim to Learn for Slurm

Skill	Description
Install and configure Slurm	Set up `slurmctld`, `slurmd`, and `slurm.conf`
Submit jobs	Use `sbatch` with proper job scripts
Monitor jobs	Use `squeue`, `sinfo`, `sacct` to track and analyze job states
Manage GPU resources	Set `--gres=gpu:N` and other limits to share GPUs efficiently
Cancel or requeue jobs	Use `scancel`, `scontrol requeue` as needed

3. Introduction to Fleet Command

What is Fleet Command?

Fleet Command is NVIDIA’s cloud-based platform designed to deploy, manage, and monitor AI applications at the edge.

What is the Edge?

“Edge” means computing done close to the data source—for example:

A smart camera in a store
A robot in a factory
A medical device in a hospital
An AI server in a delivery truck

Instead of sending all data to a central data center, edge devices process data locally for faster decisions.

What Does Fleet Command Do?

Fleet Command helps you:

Register remote edge devices (like Jetson, A100 servers, etc.)
Deploy AI software (like object detection models)
Update applications securely
Monitor system health and performance remotely
Troubleshoot without going on-site

Think of Fleet Command as a remote control system for managing a “fleet” of AI devices spread across different locations.

Why Use Fleet Command?

Without Fleet Command	With Fleet Command
Manually configure each device	Centralized dashboard to manage everything
Travel to troubleshoot	Remote troubleshooting and logging
Inconsistent deployments	Standardized container-based deployments
Hard to monitor performance	Real-time monitoring of GPU health and usage

Fleet Command Workflow (Simplified)

Here’s how a typical edge deployment works:

1. Register an Edge System

You install a Fleet Command agent on the remote device.
It connects to the Fleet Command dashboard via the cloud.

2. Create an Application

You define an application using containers.
You can use NVIDIA’s NGC containers, or build your own (using Docker).

3. Deploy the Application

From the Fleet Command web interface, you select:
- The device(s) you want to deploy to
- The container to run
Fleet Command remotely installs and starts the app.

4. Monitor the System

GPU temperature, usage, health, and logs are visible on the dashboard.
You can restart or stop applications anytime.

Example: Deploying a Smart Camera AI System

Let’s say a retail company wants to use AI cameras in 100 stores to count customers and monitor stock.

With Fleet Command:

All 100 stores register their Jetson devices via a secure token.
The AI model is packaged into a Docker container.
The company deploys the container to all devices in one click.
If any camera fails, the support team can see logs remotely and restart the app—without visiting the store.

Key Features

Feature	What It Means
Secure provisioning	Devices are added using secure keys and authentication
Remote deployment	You can install/update software on devices from anywhere
Container support	Applications run in Docker containers
Monitoring & logging	GPU metrics, system status, logs—all in one dashboard
RBAC support	Different users (admins, operators) have different permissions

Important Concepts

1. Edge Nodes

A physical device (Jetson, EGX, etc.) registered into Fleet Command
These are the “workers” that run AI applications

2. Container Lifecycle

Applications are deployed in containers
You can start, stop, restart, and update containers anytime

3. RBAC (Role-Based Access Control)

Admins: Full access to all resources
Operators: Can deploy and monitor, but not change configs
Viewers: Read-only access to dashboards

Skills You Should Learn for Fleet Command

Skill	Description
Register edge systems	Add new devices securely into Fleet Command
Build containerized applications	Use Docker to package AI models and inference services
Deploy containers remotely	Use the dashboard to launch apps on remote edge nodes
Monitor GPU and system status	View health, usage, and errors in real-time
Use RBAC	Define roles and responsibilities for different team members
Troubleshoot remotely	Read logs, restart containers, fix issues without physical access

4. Introduction to MIG (Multi-Instance GPU)

What is MIG?

MIG stands for Multi-Instance GPU.

It is a hardware-based virtualization feature available on NVIDIA Ampere (A100) and newer GPUs that allows a single physical GPU to be split into multiple isolated GPU instances, called MIG instances.

Why Use MIG?

Without MIG:

One user might use the whole GPU, even if their task only needs a small portion.
Multiple users sharing the same GPU might interfere with each other’s workloads.

With MIG:

A single GPU can be divided into smaller "slices".
Each slice is isolated, with its own memory, cores, and cache.
Different jobs (from different users or containers) can run at the same time without conflict.

Key Benefits of MIG

Benefit	Description
Isolation	Each MIG instance runs independently, like its own mini-GPU
Resource Efficiency	Allows small or medium jobs to run without wasting full-GPU resources
Multi-Tenancy	Supports multiple users or apps on a single GPU
Improved Security	Processes cannot access each other’s memory or compute

MIG Profiles (GPU "Slices")

MIG profiles define how a GPU is split.

For example, on an NVIDIA A100 40GB GPU, you can choose:

MIG Profile	Compute Cores	Memory	Description
1g.5gb	1/7 of GPU	~5 GB	Lightweight jobs or inference
2g.10gb	2/7 of GPU	~10 GB	Medium-sized training
3g.20gb	3/7 of GPU	~20 GB	Large inference or training
7g.40gb (no MIG)	100% of GPU	40 GB	Full access – not split

You choose a profile based on the job’s size and memory needs.

MIG Setup Workflow (Step-by-Step)

These steps are usually done on the command line using nvidia-smi.

Step 1: Enable MIG Mode

Check whether MIG is enabled:

nvidia-smi -L

To enable MIG:

sudo nvidia-smi -mig 1
sudo reboot

Step 2: Create MIG Instances

Use nvidia-smi mig to list and create instances.

Example: Create 3 instances of 1g.5gb profile:

sudo nvidia-smi mig -cgi 0,1,2 -C

-cgi means: “Create GPU Instances”
Each number represents a profile (e.g., 0 = 1g.5gb)

Step 3: View Active Instances

nvidia-smi

You will see multiple MIG instances listed.

Each one behaves like a separate GPU device.

Step 4: Assign MIG Devices to Containers or Jobs

In Docker:

docker run --gpus '"device=0"' nvidia/cuda:12.0-base nvidia-smi

In Kubernetes:

MIG devices are discovered by the NVIDIA device plugin
You can schedule pods onto MIG instances using device IDs

Monitoring MIG with DCGM

DCGM (Data Center GPU Manager) is fully compatible with MIG.

You can:

Track individual MIG instance usage
Monitor temperature, memory, and errors
Set alerts and usage limits

This is helpful when many users share one GPU.

Common MIG Use Cases

Use Case	Example
Multi-user notebook platform	Give each user a 1g.5gb slice of the A100 GPU
AI inference farm	Run multiple AI models on the same GPU simultaneously
Cloud GPU service provider	Sell smaller "GPU units" to different customers securely
Developer test environments	Let multiple developers test models on their own virtual GPU slice

Skills You Should Learn for MIG

Skill	Description
Enable MIG mode	Use `nvidia-smi` to switch the GPU to MIG-capable mode
Create GPU and compute instances	Use `nvidia-smi mig -cgi` and profile numbers
Assign MIG instances	Assign specific MIG devices to Docker containers or Kubernetes pods
Monitor MIG performance	Use DCGM to track usage, errors, and health of each instance
Match workloads to profiles	Choose correct MIG profile based on job memory and compute needs

5. Introduction to DCGM (Data Center GPU Manager)

What is DCGM?

DCGM stands for Data Center GPU Manager.

It’s a set of tools, libraries, and APIs provided by NVIDIA that helps system administrators:

Monitor GPU performance and health
Detect hardware issues
Diagnose problems early
Track usage over time

Think of DCGM as a “GPU health monitoring system” for data centers and large-scale GPU deployments.

Why Use DCGM?

When managing dozens or hundreds of GPUs, it’s hard to know:

Which GPUs are running hot?
Are any GPUs producing memory errors?
Is any job using more power than allowed?
Which GPUs are underused?

DCGM solves this by collecting real-time telemetry and providing alerting and logging.

What Can DCGM Monitor?

Metric Type	Example Metrics
Health	Overall GPU health status, ECC memory errors, clock throttling
Performance	GPU utilization, memory usage, power draw, SM occupancy
Temperature & Power	GPU temperature, power limits, voltage
PCIe/NVLink	Communication speed and error counters
MIG Instances	Each instance’s performance, health, and error rate

DCGM Architecture

+-------------------------+      +-----------------------------+
|   GPU Hardware Layer    | ---> |  DCGM Engine (on each node) |
+-------------------------+      +-----------------------------+
                                        |
                                        v
                            +---------------------------+
                            |   CLI / API / Exporters   |
                            +---------------------------+

The DCGM Engine runs on each GPU node
You interact with it using:
- Command-line tools
- API calls (C/C++/Python bindings)
- Prometheus exporters for dashboards

DCGM CLI Tool: `dcgmi`

This is the command-line interface that allows administrators to interact with DCGM easily.

Common Commands

dcgmi discovery -l

Lists GPUs on the system

dcgmi health -c

Runs a health diagnostic test on all GPUs

dcgmi stats --show

Shows real-time stats like GPU utilization, memory use, temperature

dcgmi diag

Performs in-depth diagnostics to check hardware integrity

Health Monitoring with DCGM

DCGM can automatically detect issues like:

Memory errors (ECC errors)
Power/thermal throttling
Driver issues or crashes
MIG misconfigurations

It assigns a Health Status:

OK: All good
Warning: Possible issues
Critical: Hardware likely failing or unstable

Custom Alerts and Thresholds

You can configure DCGM to trigger alerts when:

A GPU’s temperature exceeds 85°C
Power usage spikes
ECC errors happen more than once

These alerts can be:

Logged locally
Sent to external systems (like Grafana dashboards)
Used to automatically pause or migrate jobs

Integration with Prometheus and Grafana

Why integrate?

To visualize GPU status on dashboards and track historical usage.

How it works:

DCGM Exporter runs on each node
It exposes metrics to Prometheus in the correct format
Grafana queries Prometheus and displays visual dashboards

Example Metrics:

dcgm_gpu_utilization
dcgm_memory_used
dcgm_temperature
dcgm_power_usage

Example Grafana Dashboard:

Per-node GPU usage
Cluster-wide temperature trends
Alerts when any GPU reaches thermal limit

DCGM with Kubernetes or Slurm

DCGM works well in environments like:

Kubernetes: With the NVIDIA GPU Operator, DCGM runs as a DaemonSet
Slurm: You can pull DCGM stats before scheduling jobs or after completion

Skills You Should Learn for DCGM

Skill	Description
Install and run DCGM	Set up DCGM Engine on GPU nodes
Use `dcgmi`	Run diagnostics, check health, view metrics
Configure alerts	Set thresholds for temperature, ECC, and other hardware faults
Integrate with Grafana/Prometheus	Use exporters to visualize GPU metrics
Monitor MIG instances	Track performance and errors per MIG slice
Use with Kubernetes/Slurm	Integrate with modern cluster schedulers for unified GPU health tracking

6. Role-Based Access Control (RBAC)

What is RBAC?

Role-Based Access Control (RBAC) is a security system that lets you control who can do what inside your infrastructure.

Instead of giving full access to everyone, RBAC lets you:

Create roles (e.g., Admin, Operator, Viewer)
Assign permissions to each role
Assign users to those roles

This protects your infrastructure from accidental or malicious actions.

Why Is RBAC Important?

Without RBAC	With RBAC
Anyone could shut down GPU jobs	Only Admins can stop/restart jobs
Sensitive logs could be exposed	Viewers can’t access sensitive data
Unsafe code could be deployed	Only Operators/Admins can deploy apps

RBAC is critical for:

Multi-user systems
Edge environments (e.g., Fleet Command)
Kubernetes clusters
Cloud GPU infrastructure

Typical Roles in GPU/AI Environments

Role	Permissions
Admin	Full access: add/remove nodes, assign roles, restart services
Operator	Deploy applications, monitor systems, but can’t change settings
Viewer	Read-only access to dashboards, logs, GPU status

RBAC in Different Platforms

In Fleet Command:

Web UI provides role-based login
Each user is assigned one of the 3 roles

In BCM:

Role permissions are configured during user account setup
You define:
- Who can run jobs
- Who can access node logs
- Who can change Slurm or MIG settings

In Kubernetes:

RBAC is native to Kubernetes.

Example: A role that lets users read pods, but not delete them.

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: gpu-jobs
  name: viewer-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

You can then assign that role using a RoleBinding.

Skills You Should Learn for RBAC

Skill	Description
Assign user roles in BCM or Fleet	Create user accounts and set their permissions
Configure Kubernetes RBAC	Write and apply roles and bindings for GPU workload access
Restrict sensitive actions	Allow only Admins to stop GPUs or change MIG setup
Combine with monitoring tools	Limit who can see logs or usage metrics

7. GPU Telemetry & Monitoring Integration

What is GPU Telemetry?

Telemetry means collecting real-time data from your system so you can:

Monitor performance
Detect problems
Make decisions based on usage patterns

For GPUs, telemetry includes:

Utilization
Temperature
Power consumption
Memory usage
Error states (like ECC)

Why GPU Monitoring Matters

Problem	What Monitoring Reveals
GPUs overheating	See temperature rise and take action early
Low GPU usage	Spot underused resources and optimize workloads
Repeated ECC errors	Detect hardware issues and schedule replacements
Power spikes	Identify power-hungry workloads

Monitoring Stack: DCGM + Prometheus + Grafana

Here’s how they work together:

+----------------+      +----------------+      +------------------+
|   DCGM Exporter| ---> |   Prometheus   | ---> |     Grafana      |
+----------------+      +----------------+      +------------------+
     (GPU data)            (stores metrics)         (visual dashboard)

1. DCGM Exporter

Installed on each GPU node
Collects GPU stats and exposes them to Prometheus

2. Prometheus

Periodically scrapes GPU metrics from exporters
Stores time-series data
Can trigger alerts

3. Grafana

Connects to Prometheus
Builds dashboards with charts, tables, gauges
Supports alert rules and notifications (email, Slack, etc.)

Sample Metrics to Track

Metric Name	Meaning
`dcgm_gpu_utilization`	Percentage of GPU in use
`dcgm_memory_used`	GPU memory currently allocated
`dcgm_temperature`	Current GPU temperature
`dcgm_ecc_errors`	Memory error count
`dcgm_power_usage`	Current power draw in watts

Kubernetes Integration

If you're using Kubernetes with NVIDIA GPUs:

Use the NVIDIA GPU Operator to install:
- DCGM exporter
- Device plugin
- Monitoring agents
You can also deploy the kube-prometheus-stack to get:
- Pre-built dashboards
- Node metrics
- Cluster health alerts

Skills You Should Learn for GPU Monitoring

Skill	Description
Install DCGM exporter	Expose GPU metrics on each node
Deploy Prometheus and Grafana	Collect and visualize telemetry
Create dashboards	Show temperature, usage, and errors in real-time
Set alert rules	Get notified when GPUs overheat or show critical errors
Use with Kubernetes or Slurm	Extend monitoring to job-level visibility

Final Summary: What You’ve Learned in "Administration"

Tool/Concept	Key Skills
BCM	Cluster management, node registration, Slurm integration
Slurm	GPU job scheduling, partitions, queue management
Fleet Command	Edge AI deployment, remote monitoring, app updates
MIG	GPU slicing, isolation, configuration, and assignment
DCGM	GPU health checks, metrics, diagnostics
RBAC	User role control for safe and secure multi-user systems
Telemetry Tools	Prometheus, Grafana, alerts, historical tracking

Administration (Additional Content)

1. What to Expect in the Exam (Question Style Overview)

The "Administration" domain typically tests foundational GPU system skills using standard exam formats. Here's what you can expect:

Typical Question Formats

Format Type	Description	Example
Multiple Choice	Select the best command or explanation	"Which command lists all GPUs?"
Command Identification	Identify what a command does	"What does `dcgmi health -c` do?"
Scenario-Based	Real-world situation, pick the right response	"A node has ECC errors, what tool do you use?"
Concept Definition	Explain a service or system behavior	"What is the persistence daemon used for?"

Commonly Tested Skills

Recognizing the output and use of nvidia-smi, dcgmi, systemctl
Verifying NVIDIA services like nvidia-persistenced
Diagnosing node availability using GPU health/status tools
Navigating GPU environment variables or configuration files

2. Command Mnemonics and Memorization Techniques

Command Mnemonics

Command	Function	Memory Aid
`dcgmi discovery -l`	Lists all visible GPUs	“D for discovery, L for list”
`dcgmi health -c`	Health check of all devices	“C = Check”
`nvidia-smi`	View driver, GPU usage	Think “System Monitoring Interface”
`nvidia-smi topo -m`	Shows inter-GPU topology	“topo = topology”, output = matrix
`systemctl status nvidia-persistenced`	Shows daemon status	Always use `systemctl status` for services

Flashcard Style Learning

Use flashcards with command on one side, and purpose + memory hook on the other:

Front:
dcgmi stats --groupId 0
Back:
Purpose: Display performance metrics (utilization, PCIe, memory)
Mnemonic: Stats for Group 0 = Node-wide statistics

Tools like Anki make digital flashcards efficient for spaced repetition.

Categorized Command Sets

Organize your study by function to retain better:

Category	Key Commands
GPU Discovery & Health	`dcgmi discovery -l`, `dcgmi health -c`
Monitoring	`nvidia-smi`, `nvidia-smi dmon`
Topology	`nvidia-smi topo -m`
Service Management	`systemctl status nvidia-persistenced`

Shopping cart

Subtotal:

NCP-AIO Administration

Detailed list of NCP-AIO knowledge points

Administration Detailed Explanation

1. Introduction to Base Command Manager (BCM)

What is BCM?

What is a GPU Cluster?

Key Features of BCM (Explained Simply)

How You Interact With BCM

Important Concepts

1. bcminit

2. Slurm Integration

3. Health Checks

A Simple Example Scenario

Real-World Use Cases of BCM

Skills You Should Aim to Learn for BCM

2. Introduction to Slurm Workload Manager

What is Slurm?

Why Do We Need Slurm?

Key Use Cases of Slurm in AI Workloads

Important Components & Terminology

1. slurmctld (Slurm Controller Daemon)

2. slurmd (Slurm Daemon)

3. slurm.conf

4. sbatch

5. squeue

6. sinfo

How Slurm Works with BCM

A Beginner-Friendly Job Submission Workflow

Step 1: Create a Job Script

Step 2: Submit the Job

Step 3: Check the Queue

Step 4: Monitor Output

Tips for Beginners

Useful Job Options for AI Workloads

Skills You Should Aim to Learn for Slurm

3. Introduction to Fleet Command

What is Fleet Command?

What is the Edge?

What Does Fleet Command Do?

Why Use Fleet Command?

Fleet Command Workflow (Simplified)

1. Register an Edge System

2. Create an Application

3. Deploy the Application

4. Monitor the System

Example: Deploying a Smart Camera AI System

Key Features

Important Concepts

1. Edge Nodes

2. Container Lifecycle

3. RBAC (Role-Based Access Control)

Skills You Should Learn for Fleet Command

4. Introduction to MIG (Multi-Instance GPU)

What is MIG?

Why Use MIG?

Key Benefits of MIG

MIG Profiles (GPU "Slices")

MIG Setup Workflow (Step-by-Step)

Step 1: Enable MIG Mode

Step 2: Create MIG Instances

Step 3: View Active Instances

Step 4: Assign MIG Devices to Containers or Jobs

Monitoring MIG with DCGM

Common MIG Use Cases

Skills You Should Learn for MIG

5. Introduction to DCGM (Data Center GPU Manager)

What is DCGM?

Why Use DCGM?

What Can DCGM Monitor?

DCGM Architecture

DCGM CLI Tool: dcgmi

Common Commands

Health Monitoring with DCGM

Custom Alerts and Thresholds

Integration with Prometheus and Grafana

Why integrate?

How it works:

1. `bcminit`

DCGM CLI Tool: `dcgmi`