Manage Cluster, Nodes, and Features

Manage Cluster, Nodes, and Features Detailed Explanation

1.1 Initial Cluster Deployment and Configuration

What is a Nutanix Cluster?

A Nutanix cluster is a group of servers (also called nodes) connected together. These nodes share their compute, storage, and networking resources to run workloads (like Virtual Machines). Nutanix ensures these resources are managed efficiently as a single unit.

Nutanix Cluster Deployment

Step 1: Use the Foundation Tool

The Foundation tool is Nutanix's installation software. It simplifies the process of setting up a Nutanix cluster.
The tool helps you install the operating system (AOS - Acropolis Operating System) and configure the basic components.
You can use it via a web interface or a standalone application.

Step 2: Tasks Performed During Cluster Deployment

Here’s a detailed explanation of the tasks:

Configuring CVMs (Controller Virtual Machines):
- What are CVMs?
  CVMs are special virtual machines that run Nutanix services (like data storage, management, and monitoring).
  - One CVM per node: Every node in the cluster has one CVM running.
- During deployment, the Foundation tool automatically installs and configures the CVMs.
Assigning IP Addresses:
- Why are IP addresses needed?
  IP addresses ensure that nodes and CVMs can communicate with each other.
- Tasks include:
  - Assigning an IP address to each CVM.
  - Assigning an IP address to each host hypervisor (the underlying software like AHV, ESXi, or Hyper-V).
  - Assigning cluster service IPs for management traffic.
- Example:
  Imagine you have a cluster with 3 nodes. You might assign:
  - CVM 1 → 10.0.0.11
  - CVM 2 → 10.0.0.12
  - CVM 3 → 10.0.0.13
  - Management IP → 10.0.0.100
Configuring DNS and NTP Services:
- DNS (Domain Name System):
  - DNS resolves names to IP addresses, so you can access the cluster using names like cluster1.nutanix.com instead of typing an IP address.
- NTP (Network Time Protocol):
  - NTP synchronizes the time across all nodes in the cluster.
  - Consistent time is critical for data synchronization, logs, and troubleshooting.
- Tasks:
  - Add the IP addresses of DNS servers and NTP servers during deployment.

Step 3: Cluster Initialization

Once the Foundation tool sets up the nodes and CVMs, you proceed with Cluster Initialization:

Access Prism Element:
- Prism Element is Nutanix’s web-based interface for managing a single cluster.
- How to Access: Open a browser and type the Cluster Virtual IP (e.g., https://10.0.0.100).
Verify Cluster Health:
- Check if all nodes are up and running.
- Verify CVMs are communicating with each other.
Create Storage Pools:
- What are Storage Pools?
  - A storage pool combines the physical disks (SSD/HDD) of all nodes into a single logical storage unit.
- Steps:
  - Navigate to Prism → Storage → Storage Pool → Create Pool.
  - Add available disks to the storage pool.
Configure Containers:
- What are Containers?
  - Containers are logical units within a storage pool that hold the data for virtual machines.
- Steps:
  - In Prism, go to Storage → Containers → Create Container.
  - Enable features like compression, deduplication, or erasure coding (optional).

Beginner-Friendly Example

Imagine you are setting up a 3-node cluster. Here’s what happens step by step:

Foundation Tool:
- You connect to the nodes (servers) using the Foundation tool.
- It installs AOS, configures CVMs, and assigns IP addresses.
Configuration Tasks:
- CVM IPs:
  - Node 1 → CVM IP: 10.0.0.11
  - Node 2 → CVM IP: 10.0.0.12
  - Node 3 → CVM IP: 10.0.0.13
- Cluster Virtual IP: 10.0.0.100
- DNS: Use DNS server 8.8.8.8 (Google DNS).
- NTP: Use NTP server time.google.com.
Cluster Initialization:
- Access Prism Element at https://10.0.0.100.
- Verify nodes are healthy.
- Create a storage pool and add disks.
- Configure a container named VM_Storage to hold the VMs.

Summary of Initial Cluster Deployment

The Foundation tool simplifies deployment.
Tasks include:
- Configuring CVMs.
- Assigning IPs.
- Setting DNS and NTP.
Initialization involves verifying health, creating storage pools, and setting up containers.

1.2 Cluster Nodes Management

In this section, we’ll explain the detailed processes of managing nodes within a Nutanix cluster. Managing nodes involves adding nodes to expand your cluster, safely removing nodes, monitoring node health, and performing upgrades to ensure your cluster remains healthy and up to date.

1.2.1 Node Addition

Adding nodes is essential for expanding the capacity (compute, storage, and networking) of a Nutanix cluster. Nodes can be added without downtime, and Nutanix ensures the cluster automatically redistributes workloads and data.

Steps to Add a Node:

Prepare the New Node:
- Ensure the new server (node) meets hardware requirements.
- Physically connect the node to the same network as the existing cluster.
Node Discovery:
- Go to Prism Element or Prism Central → Hardware → Nodes.
- Nutanix automatically detects new nodes available in the network.
- Nodes will appear in a “discovered” state.
Configure the Node:
- Install Hypervisor: Choose the hypervisor to install on the new node:
  - AHV (Nutanix Hypervisor).
  - ESXi (VMware).
  - Hyper-V (Microsoft).
- Assign IP Addresses:
  - IP for the CVM.
  - IP for the host hypervisor.
- Verify Time Synchronization: Configure NTP for the new node.
Join the Node to the Cluster:
- From Prism, select the discovered node and click Add to Cluster.
- Nutanix will automatically:
  - Install necessary software on the new node.
  - Rebalance data and workloads across all nodes in the cluster.
Verify the Node Status:
- Check the node’s health and status in Prism under Hardware → Nodes.
- Confirm the following:
  - The node is in an active state.
  - The node is contributing storage, compute, and network resources.

Key Benefits of Adding Nodes:

Scalability: Expand storage and compute capacity linearly.
No Downtime: New nodes are added seamlessly without interrupting existing workloads.
Automatic Balancing: Nutanix distributes data and workloads across the expanded cluster automatically.

1.2.2 Node Removal

Nodes may need to be removed for various reasons, such as hardware maintenance, decommissioning, or upgrading to new hardware. Safe removal ensures data is protected and workloads are not disrupted.

Steps to Remove a Node:

Prepare for Node Removal:
- Ensure that the node is not running critical workloads.
- Verify there are sufficient resources (compute and storage) in the remaining nodes to handle the workloads.
Evacuate Data:
- Nutanix automatically migrates the node’s data to other nodes in the cluster.
- This ensures data resilience and prevents data loss.
Put the Node into Maintenance Mode:
- In Prism, go to Hardware → Nodes.
- Select the node and click Enter Maintenance Mode.
- Maintenance mode ensures the node is isolated, and data migration occurs.
Remove the Node:
- Once data migration completes, select Remove Node from Prism.
- The node will be safely decommissioned from the cluster.
Verify the Cluster Health:
- Check the cluster’s health under Prism → Health.
- Ensure there are no warnings or alerts.

Best Practices for Node Removal:

Always ensure sufficient resources are available in other nodes before removal.
Allow time for Nutanix to migrate data (data evacuation depends on the data size).
Verify that VMs hosted on the node have been migrated to other nodes.

1.2.3 Cluster Health and Diagnostics

Nutanix provides robust tools to monitor the health of your cluster, nodes, and workloads.

Monitoring Node Health

Prism Dashboard:
- View a summary of node health, including CPU, memory, storage, and network usage.
- Access alerts and diagnostics for any failures.
Heatmap View:
- Visualize resource usage across nodes (e.g., CPU hotspots, high memory utilization).
Alerts:
- Review any warnings or critical alerts regarding node failures, resource contention, or hardware issues.

Performing Health Checks

Use Nutanix’s NCC (Nutanix Cluster Check) tool to perform automated health checks.

NCC Tool: Runs comprehensive diagnostics on:
- Hardware (disks, NICs, etc.).
- Storage performance and availability.
- Cluster and node health.
How to Run NCC:
- Log into the CVM using SSH.
- Run the command:
```
ncc health_checks run_all
```
- Review the output to identify and resolve issues.

1.2.4 Node Upgrades

To ensure the cluster remains secure, performant, and compatible, it’s critical to perform upgrades on nodes regularly.

Types of Node Upgrades:

Nutanix AOS (Acropolis Operating System): Upgrades the core software that runs the cluster.
Firmware Upgrades: Updates hardware firmware (like disk controllers, NICs, and BIOS).
Hypervisor Upgrades: Updates the hypervisor running on the nodes (e.g., AHV, ESXi, or Hyper-V).

Lifecycle Management (LCM) for Node Upgrades

LCM automates the upgrade process for both software and hardware.

Steps to Perform Node Upgrades:

Run Inventory Check:
- In Prism, navigate to LCM (Lifecycle Management).
- Perform an inventory check to identify the current versions of firmware, hypervisors, and AOS.
Pre-Upgrade Checks:
- LCM validates hardware compatibility and checks if all nodes are ready for the upgrade.
Perform the Upgrade:
- Start the upgrade process from LCM.
- LCM automates the upgrade node by node.
Verify the Upgrade:
- Once completed, check the Health Dashboard to confirm the cluster’s health.

Benefits of LCM for Node Upgrades:

Fully automated process, minimizing manual intervention.
Zero downtime as upgrades happen one node at a time.
Ensures cluster components are updated consistently.

Summary of Cluster Node Management

Node Addition expands cluster capacity seamlessly with automatic workload balancing.
Node Removal safely evacuates data before decommissioning nodes.
Cluster Health monitoring tools ensure nodes and workloads run optimally.
Upgrades are automated using LCM, keeping the system up to date.

1.3 Lifecycle Management (LCM)

What is Lifecycle Management (LCM)?

Lifecycle Management (LCM) is a Nutanix framework that automates the process of upgrading software and firmware components across your Nutanix cluster. This tool simplifies upgrades while minimizing disruption, ensuring your environment remains up-to-date, secure, and efficient.

1.3.1 Purpose of LCM

Automates Upgrades: Streamlines updates for:
- AOS (Acropolis Operating System).
- Hardware Firmware (BIOS, disks, NICs, etc.).
- Hypervisors (AHV, ESXi, Hyper-V).
Minimizes Downtime: Updates happen node-by-node, ensuring continuous cluster operation.
Ensures Compatibility: Pre-upgrade checks validate hardware and software compatibility.
Simplifies Management: Provides a single interface (Prism) for all upgrade tasks.

1.3.2 LCM Workflow

The LCM process follows a structured and automated workflow:

Step 1: Run Inventory Check

Purpose: LCM performs a scan to identify:
- Current versions of installed software and firmware.
- Components that require updates.
How to Run Inventory Check:
- Log into Prism Element or Prism Central.
- Navigate to LCM → Inventory.
- Click "Perform Inventory".
Output:
- A list of components (software/firmware) with their current versions and upgrade availability.

Step 2: Pre-Upgrade Compatibility Checks

Purpose: LCM validates the following before starting the upgrade:
- Hardware compatibility.
- Software interdependencies (e.g., hypervisor versions compatible with AOS).
- Cluster health status.
What is Checked?
- Disk health.
- Node performance and availability.
- CVM health and cluster services.

Step 3: Execute the Upgrade

Once the inventory and compatibility checks are complete, you can start the upgrade.

Upgrade Process:
- LCM updates one node at a time to ensure minimal disruption.
- For each node:
  1. Put the Node in Maintenance Mode: Ensures no active workloads are running.
  2. Perform the Upgrade: Installs the update.
  3. Rebalance the Cluster: Redistributes data and workloads to ensure balance.
  4. Exit Maintenance Mode: Node returns to the cluster.
- This process repeats for all nodes in the cluster.

Step 4: Verify the Upgrade

Once the upgrade completes, verify:
- All nodes are back online and contributing to the cluster.
- No critical alerts or warnings are reported in Prism.
How to Check:
- Go to Prism → LCM → Upgrade History.
- Review the status of the upgrade for each component.

1.3.3 LCM Component Upgrades

1. AOS (Acropolis Operating System)

AOS is the core software that powers the Nutanix cluster.
Upgrading AOS ensures you benefit from the latest features, performance improvements, and security fixes.
Upgrade Steps:
- Run LCM to check for available AOS upgrades.
- Download and install the latest version.

2. Hardware Firmware

Firmware refers to low-level software running on hardware components (e.g., disks, NICs, BIOS).
LCM automates firmware upgrades for components like:
- Disk Controllers: Ensure storage performance and reliability.
- BIOS: Optimize server hardware functionality.
- NIC Firmware: Improve network connectivity and throughput.

3. Hypervisor Upgrades

The hypervisor is the virtualization layer running on the Nutanix nodes. Supported hypervisors include:
- AHV (Acropolis Hypervisor – Nutanix’s native hypervisor).
- ESXi (VMware).
- Hyper-V (Microsoft).
LCM simplifies hypervisor upgrades by ensuring compatibility with the Nutanix platform.

1.3.4 Benefits of Using LCM

Fully Automated: No need for manual interventions during upgrades.
Minimal Downtime: Nodes are updated sequentially, ensuring workloads remain available.
Compatibility Validation: Prevents errors by checking interdependencies and hardware/software compatibility.
Cluster Health Assurance: Ensures nodes are in a healthy state before proceeding with updates.
Centralized Management: All upgrades (software and hardware) are managed from Prism.

1.3.5 Common Issues and Troubleshooting

1. Inventory Check Failure

Issue: LCM cannot fetch inventory details for components.
Resolution:
- Verify the CVM can access the internet for updates.
- Ensure nodes have correct DNS and NTP configurations.

2. Pre-Upgrade Check Warnings

Issue: LCM reports issues during pre-upgrade checks.
Resolution:
- Review cluster health to address problems (e.g., disk failures, network errors).
- Use NCC (Nutanix Cluster Check) to identify and resolve issues.

3. Upgrade Failure

Issue: The upgrade process fails for one or more nodes.
Resolution:
- Check the LCM logs for specific error messages.
- Retry the upgrade process for the affected node.

Summary of Lifecycle Management (LCM)

LCM automates upgrades for AOS, hardware firmware, and hypervisors.
It ensures upgrades are performed node-by-node without downtime.
Pre-upgrade checks validate hardware/software compatibility, reducing errors.
Post-upgrade verification ensures the cluster returns to a healthy state.

1.4 Cluster Features Management

In this section, we will dive deep into the features of a Nutanix cluster, specifically focusing on High Availability (HA), Data Resilience and Protection, and Cluster Maintenance. Each feature plays a critical role in ensuring the reliability, availability, and durability of your Nutanix environment.

1.4.1 High Availability (HA)

What is High Availability?

High Availability (HA) ensures that Virtual Machines (VMs) are automatically restarted on healthy nodes in the event of node or component failure. This guarantees business continuity and minimizes downtime.

How HA Works

Each VM in the cluster is protected by HA. If a node fails, Nutanix HA automatically:
1. Detects the node failure.
2. Identifies which VMs were running on the failed node.
3. Restarts those VMs on the remaining healthy nodes in the cluster.
Key Concept: Nutanix uses CVM heartbeats and hypervisor heartbeats to monitor node health.

Steps to Configure HA

Enable HA in Prism Element:
- Go to Prism → Settings → Availability Zones.
- Enable HA and configure failover settings.
Verify Resource Capacity:
- Ensure the cluster has sufficient resources (CPU, memory, and storage) to host VMs after a failover.
- This is known as the failover capacity.
Check VM Placement:
- Ensure VMs are distributed evenly across the nodes to avoid resource contention after failover.

Prerequisites for HA

Adequate Cluster Resources:
- The cluster must have enough resources to restart VMs on remaining nodes.
- Example: If Node 1 fails, Nodes 2 and 3 must have enough capacity to take over the VMs.
Proper VM Placement:
- Avoid placing all critical VMs on a single node (use Affinity Rules to balance workloads).

Benefits of HA

Automatic Recovery: VMs are restarted without manual intervention.
Minimized Downtime: Reduces business disruptions caused by node or hardware failures.
Reliability: Ensures continuous operation of critical workloads.

1.4.2 Data Resilience and Protection

Why is Data Resilience Important?

Data resilience ensures that your data remains available and protected even in case of hardware (disk or node) failures. Nutanix achieves this through Replication Factor (RF) and advanced data protection mechanisms.

Replication Factor (RF)

What is RF?

Replication Factor determines how many copies of data are stored across the cluster:

RF-2: Two copies of data are stored. Ensures data availability if a single node or disk fails.
RF-3: Three copies of data are stored. Ensures data availability even if two nodes or disks fail.

How RF Works

Nutanix automatically distributes copies of data across nodes in the cluster.
Example:
- Node 1 stores a piece of data.
- A replica of that data is stored on Node 2 (for RF-2) or on Nodes 2 and 3 (for RF-3).

Steps to Configure RF

Default Configuration:
- By default, Nutanix uses RF-2 for most workloads.
Change Replication Factor:
- Access Prism → Storage → Storage Containers.
- Select the container and set the desired Replication Factor (RF-2 or RF-3).
Verify Resilience:
- Use the Prism Health Dashboard to check data resilience and replication status.

Self-Healing and Data Availability

Automatic Rebuilds:
- If a node or disk fails, Nutanix automatically detects the failure and recreates missing data replicas on healthy nodes.
Data Path Redundancy:
- Ensures that multiple paths are available for reading and writing data, preventing bottlenecks.

Benefits of Data Resilience

Protection Against Failures: Ensures no data loss even if hardware fails.
Automatic Recovery: Data replicas are rebuilt automatically.
Flexible Policies: Choose RF-2 or RF-3 based on your workload’s criticality.

1.4.3 Cluster Maintenance

Cluster maintenance ensures the Nutanix environment remains healthy and operational during upgrades, repairs, or troubleshooting.

1. Maintenance Mode

What is Maintenance Mode?

Maintenance Mode is used to safely isolate a node for tasks such as:

Hardware replacements.
Firmware updates.
AOS or hypervisor upgrades.

Steps to Enable Maintenance Mode

Log into Prism:
- Go to Prism → Hardware → Nodes.
Enter Maintenance Mode:
- Select the node and click Enter Maintenance Mode.
Evacuate Data:
- Nutanix automatically migrates data and workloads to other nodes.
Perform Maintenance:
- Perform required upgrades, repairs, or hardware replacements.
Exit Maintenance Mode:
- Once maintenance is complete, exit Maintenance Mode.
Verify Cluster Health:
- Check Prism’s Health Dashboard to confirm all nodes are operational.

2. Data Rebalancing

Nutanix uses automatic data rebalancing to redistribute data and workloads across nodes after:
- Adding a new node.
- Removing a node.
- Exiting Maintenance Mode.
Benefits of Automatic Rebalancing:
- Prevents data skew (uneven data distribution).
- Ensures optimal resource utilization across the cluster.

3. Monitoring During Maintenance

Monitor the following metrics in Prism during cluster maintenance:
- Data Migration Progress: How much data has been migrated to other nodes.
- Resource Usage: Ensure CPU, memory, and storage usage remain within safe limits.
- Health Status: Verify there are no alerts or warnings.

Summary of Cluster Features Management

High Availability (HA) ensures automatic VM recovery during node failures.
Data Resilience (RF-2 and RF-3) protects data by creating multiple copies and rebuilding lost data automatically.
Cluster Maintenance allows you to safely isolate nodes, perform upgrades, and rebalance data without impacting workloads.

1.5 User Management and Role-Based Access Control (RBAC)

In this section, we will explore User Management and RBAC (Role-Based Access Control), two crucial components for securely managing access and permissions in a Nutanix cluster. Understanding these topics ensures that administrators can control who has access to the system and what actions they are allowed to perform.

1.5.1 Understanding Role-Based Access Control (RBAC)

What is RBAC?

RBAC is a security model that allows you to assign specific roles to users or groups.
Each role grants a set of permissions that determine what actions a user can perform within the Nutanix environment.

Key Benefits of RBAC

Security:
- Prevent unauthorized access to critical systems and data.
Granular Control:
- Assign fine-grained permissions based on job responsibilities.
Simplified Management:
- Easily manage roles for individual users or groups of users.
Auditability:
- Track user activities to meet compliance and security requirements.

1.5.2 Default User Roles in Nutanix

Nutanix provides a set of predefined roles that simplify access management. Below are the default roles and their associated permissions:

Role	Permissions
Admin	Full access to all features in the cluster (create, modify, delete).
Cluster Admin	Manage the cluster but no access to Prism Central or RBAC settings.
Storage Admin	Manage storage settings, including Storage Pools, Containers, and policies.
Network Admin	Manage network settings (VLANs, NICs, security policies, etc.).
Viewer	Read-only access to view the cluster’s configuration and status.
Self-Service Admin	Manage their own VMs and resources without interfering with other users.
Custom Role	Roles created by administrators to meet specific requirements.

1.5.3 Adding and Managing Users

To effectively manage users, you can create local accounts or integrate with external directory services like Active Directory (AD) or LDAP.

Step 1: Add Local Users

Local users are managed directly within Prism.

Access User Management:
- Log into Prism Element or Prism Central.
- Navigate to Settings → User Management.
Create a New User:
- Click Add User.
- Enter the user details:
  - Username
  - Password
  - Email (optional, for notifications).
- Assign a role from the predefined list (Admin, Viewer, etc.).
Verify the User Account:
- Confirm the user can log in with the correct permissions.

Step 2: Integrate with Active Directory (AD) or LDAP

Integration with an external directory service allows centralized user management, reducing the need to manually create and manage user accounts.

Steps to Configure AD/LDAP Integration:

Enable Directory Services:
- Go to Prism → Settings → Directory Services.
- Select Active Directory or LDAP.
Provide Configuration Details:
- Server IP/Hostname: Address of the AD/LDAP server.
- Base DN (Distinguished Name): The starting point for user queries (e.g., dc=example,dc=com).
- Service Account: A user account with read permissions for AD/LDAP.
Test Connectivity:
- Verify the cluster can connect to the directory service.
Map Roles to AD/LDAP Groups:
- Assign Nutanix roles (e.g., Admin, Storage Admin) to existing AD/LDAP groups.
- Example: AD Group IT_Admins → Nutanix Role Admin.
Verify User Access:
- Log in as an AD/LDAP user to ensure the permissions are applied correctly.

1.5.4 Assigning Roles and Privileges

Steps to Assign Roles:

Navigate to User Management:
- Go to Prism → User Management.
Select the User or Group:
- Choose an existing user account or group.
Assign Roles:
- Select a role from the predefined list (e.g., Viewer, Storage Admin, Network Admin).
- For custom roles, create a role with specific permissions tailored to business needs.
Save and Verify:
- Confirm the user’s permissions are reflected accurately.

1.5.5 Creating Custom Roles

Custom roles allow administrators to define specific permissions for users, beyond the default roles.

Steps to Create a Custom Role

Access Role Management:
- Go to Prism → Settings → Roles.
Create a New Role:
- Click Create Custom Role.
- Provide a name and a description for the role.
Select Permissions:
- Choose the exact permissions for the role:
  - Cluster Operations: Start, stop, or modify the cluster.
  - Storage Management: Configure storage pools and policies.
  - Network Management: Modify VLANs and network settings.
  - Read Access: Grant view-only access for monitoring.
Assign the Role:
- Assign the custom role to users or groups.
Verify the Role:
- Test the user’s access to confirm the custom permissions are enforced.

1.5.6 Auditing User Activities

Nutanix provides auditing tools to track user activities and changes made to the cluster.

Audit Logs:
- Access Prism → Events.
- Filter logs by User Activities to see:
  - Who logged in.
  - Actions performed (e.g., storage changes, VM modifications).
Export Logs:
- Export logs for compliance or troubleshooting purposes.

Best Practices for User Management and RBAC

Follow the Principle of Least Privilege:
- Assign only the necessary permissions to users.
Use Groups for Role Management:
- Map AD/LDAP groups to Nutanix roles for efficient management.
Audit User Activities Regularly:
- Review audit logs to identify any unauthorized changes.
Avoid Sharing Admin Accounts:
- Use individual user accounts for accountability.
Enable Multi-Factor Authentication (MFA):
- For enhanced security, configure MFA for user accounts.

Summary of User Management and RBAC

RBAC ensures secure and granular access control in Nutanix.
Predefined roles simplify management, while custom roles offer flexibility.
Integrating with AD/LDAP enables centralized user management.
Audit logs provide visibility into user activities for compliance and security.

Manage Cluster, Nodes, and Features (Additional Content)

This section provides an enhanced explanation of the Nutanix cluster management, node administration, lifecycle management, cluster features, and user access control. These contents cover Phoenix recovery, CVM self-healing, node failure handling, data rebalancing, upgrade troubleshooting, VM affinity rules, Nutanix Move, MFA security, and RBAC differences between Prism Central and Prism Element.

1. Initial Cluster Deployment and Configuration

1.1 Phoenix - Nutanix Cluster Recovery Tool

Phoenix is Nutanix's low-level recovery tool used to reinstall, recover, or re-image cluster nodes. It is helpful when nodes become unresponsive due to corruption, failed upgrades, or software issues.

Use Cases for Phoenix

Reinstall a node if the operating system is corrupted.
Recover a failed cluster that is unbootable.
Manually re-image a node for troubleshooting.

Steps to Use Phoenix for Node Recovery

Boot into Phoenix Mode

Connect to the node using IPMI (Intelligent Platform Management Interface).
Select Phoenix recovery from the boot options.
Boot from the Nutanix Phoenix ISO (downloadable from Nutanix support).

Reinstall Nutanix Software

Follow the on-screen wizard to reinstall AHV and AOS.
Reconfigure networking (IP, DNS, NTP settings).
Rejoin the node to the cluster using Foundation or manual configuration.

Verify Cluster Health

Run ncc health_checks run_all to confirm the node is functioning correctly.
Check Prism Health Dashboard for warnings or missing configurations.

1.2 CVM Self-Healing (Controller Virtual Machine Recovery)

Nutanix CVM Self-Healing ensures that if a Controller VM (CVM) crashes, it will automatically restart. If the failure is severe, the cluster fails over to another CVM.

CVM Self-Healing Process

Automatic Restart:

If a CVM crashes, Nutanix automatically restarts it.
Admins can also manually restart it using:
```
cvm_shutdown -r now
```

Failover Mechanism:

If a CVM is permanently down, the cluster shifts its responsibilities to another CVM.
Data and metadata remain intact, ensuring no downtime for workloads.

Manual CVM Restart (If Automatic Recovery Fails)

Log in to another CVM in the cluster via SSH.
Identify the failed CVM using:
```
cluster status
```
Restart the affected CVM:
```
cvm_shutdown -r now
```

2. Cluster Node Management

2.1 Handling Node Failures

When a Nutanix node fails, it is automatically detected by Prism and can be either recovered or removed.

Steps to Handle Node Failures

Verify the Failure

Check Prism Dashboard → Hardware → Nodes to confirm a node failure.
Run NCC health checks:
```
ncc health_checks run_all
```

Attempt Node Recovery

Reboot the node and verify if it rejoins the cluster.
If it does not respond, access via IPMI console and check logs.

Manually Evacuate the Node
If a node cannot be recovered, migrate its workloads:

ncli host evacuate id=<NODE_UUID>

Remove the Node from the Cluster (If Necessary)

If a failed node needs removal, first evacuate its data:
```
ncli cluster remove-node id=<NODE_UUID>
```

2.2 Data Rebalancing

Nutanix automatically rebalances data when nodes are added or removed.

How Data Rebalancing Works

Curator: Periodically scans the cluster and redistributes data.
Stargate: Handles real-time I/O operations and data placement.

Manually Trigger a Data Rebalance

Run the following command in CVM:

ncli cluster rebalance start

Check Data Distribution

ncli cluster get-replication-factor

3. Lifecycle Management (LCM) Enhancements

3.1 Hypervisor and AOS Compatibility Checks

Before upgrading, Nutanix checks whether hypervisor and AOS versions are compatible.

Run Compatibility Checks

ncli cluster get-upgrade-status

3.2 Handling Upgrade Failures (Rollback & Troubleshooting)

If an upgrade fails, follow these steps:

Check LCM Logs in Prism

Prism Central → LCM → View Logs for failure reasons.

Manually Retry the Upgrade

ncli cluster upgrade start

Roll Back to the Previous Version (If Required)

ncli cluster rollback-upgrade

4. Cluster Features Management

4.1 VM Affinity & Anti-Affinity Rules

Nutanix supports Affinity Rules to group or separate VMs.

Use Cases

Affinity Rule: Keep related VMs together (e.g., an application and its database).
Anti-Affinity Rule: Ensure critical VMs run on separate nodes to improve availability.

Configure Affinity in Prism

Go to Prism Central → VM Management → Affinity Rules.
Create a new rule and define:

VMs to keep together (Affinity).
VMs to separate (Anti-Affinity).

4.2 Nutanix Move - VM Migration

Nutanix Move enables cross-hypervisor VM migrations.

Supported Migrations

VMware ESXi → AHV
Hyper-V → AHV
AWS/Azure → Nutanix Cloud

Steps to Use Nutanix Move

Install Nutanix Move.
Add Source and Destination Clusters.
Select VMs to migrate and start the process.

5. User Management and RBAC (Role-Based Access Control)

5.1 Implementing MFA (Multi-Factor Authentication)

MFA enhances security by requiring a second authentication factor.

Enable MFA in Nutanix

Go to Prism Central → Authentication Settings.
Enable MFA with Google Authenticator or Duo Security.
Configure user roles and MFA enforcement policies.

5.2 RBAC Differences: Prism Central vs. Prism Element

Feature	Prism Central	Prism Element
Scope	Manages multiple clusters	Manages a single cluster
User Roles	Granular permissions across clusters	Role-based permissions per cluster
Integration	Enterprise-wide AD/LDAP	Local and AD-based authentication

Best Practices for Role Assignment

Use Prism Central for large-scale environments.
Assign RBAC permissions based on user responsibility.

Final Summary

Knowledge Area	Key Enhancements
Cluster Deployment	Added Phoenix Recovery and CVM Self-Healing
Node Management	Covered Node Failure Handling and Data Rebalancing
Lifecycle Management	Explained Upgrade Compatibility and Rollback
Cluster Features	Added Affinity Rules and Nutanix Move
User Management	Included MFA Setup and RBAC Differences

Shopping cart

Subtotal:

NCP-MCI-6.5 Manage Cluster, Nodes, and Features

Detailed list of NCP-MCI-6.5 knowledge points