Exam Radar
Core Priority: Examining the underlying task engine and Java-based service logs when automated workflows fail.
High Frequency: Inspecting the vcf-operations-manager.log to identify specific stage error codes.
Confusion Alert: Differentiating between an in-progress task that needs a manual database cleanup versus a restartable workflow.
Scenario Logic: Utilizing psql utilities to query the SDDC Manager inventory for mismatched states.
Version Delta: Using the SDDC Manager API to clear stuck tasks and return the system to a known good state.
Atomic Deconstruction
Actionable: Navigate to /var/log/vmware/vcf/sddc-manager/vcf-operations-manager.log to trace the execution history of a failed host commissioning or domain creation task.
Parametric: Query the SDDC Manager inventory database using psql to identify resources stuck in a "PENDING" or "IN_PROGRESS" state.
Causal: Identifying the specific stage of failure within a workflow allows for targeted remediation, preventing the need for a full system rollback.
SKILLS.md Matrix
| Target | Log/Tool | Troubleshooting Focus |
|---|---|---|
| Workflows | vcf-operations-manager.log | Stage-specific error codes |
| Inventory | psql Utility | Mismatched resource states |
| API | SDDC Manager API | Task cleanup and restart logic |
Exam Radar
Core Priority: Resolving failures during the initial integration of physical hosts into the SDDC.
High Frequency: Troubleshooting SSH connectivity failures due to expired certificates or public key mismatches.
Scenario Logic: Using esxcli and ping with the "Do Not Fragment" bit to detect VLAN or MTU (Jumbo Frame) mismatches.
Version Delta: Validating host prerequisites such as clean partition tables and compatible hypervisor versions via Cloud Builder logs.
Atomic Deconstruction
Actionable: Run esxcli network ip netstack to verify that the management and vMotion stacks are correctly configured on the target host.
Parametric: Perform an MTU test using ping -d -s 8972 to confirm the network can support the Jumbo Frames required for vSAN and Overlay traffic.
Causal: Mismatched MTU settings across the physical switch and virtual switch lead to silent packet loss, which frequently causes host commissioning to hang or fail.
SKILLS.md Matrix
| Element | Logic | Atomic Detail |
|---|---|---|
| Connectivity | SSH/Public Key | Management plane access verification |
| Network | MTU/VLAN | Jumbo Frame and tagging validation |
| Prerequisites | Partition Table | Disk cleanup and hypervisor alignment |
Exam Radar
Core Priority: Deconstructing health service alerts to prevent data loss or unavailability in hyper-converged clusters.
High Frequency: Identifying "Reduced Availability" states using esxcli vsan debug object list.
Scenario Logic: Resolving hardware compatibility or disk balance alerts within the vSphere Client.
Version Delta: Procedures for replacing failed NVMe or SSD drives while maintaining active data redundancy.
Atomic Deconstruction
Actionable: Use the vSphere Client Health Service to identify specific disk groups or physical disks reporting "Network Congestion" or "Hardware Failure".
Parametric: Initiate a manual resynchronization for vSAN objects that have fallen out of compliance with their assigned storage policy.
Causal: Prompt replacement of failed physical disks prevents a single-node failure from escalating into a cluster-wide data unavailability event.
SKILLS.md Matrix
| Category | Atomic Requirement | Operational Detail |
|---|---|---|
| Health Alert | vSAN Health Service | Disk balance and compatibility checks |
| Debugging | esxcli vsan debug | Object-level repair and status |
| Recovery | Disk Replacement | Maintaining redundancy during hardware swap |
Exam Radar
Core Priority: Isolating the Geneve encapsulation layer from the logical routing layer to resolve "silent packet loss".
High Frequency: Verifying Tunnel Endpoint (TEP) connectivity using specific diagnostic commands.
Confusion Alert: Distinguishing between a Distributed Firewall (DFW) drop and a physical switch MTU configuration error.
Scenario Logic: Utilizing Traceflow to visually map packet paths through logical switches and Tier-0/Tier-1 gateways.
Version Delta: Troubleshooting BGP peering failures with physical Top-of-Rack (ToR) switches, focusing on AS numbers and MD5 mismatches.
Atomic Deconstruction
Actionable: Run esxcli network diag ping to verify VTEP-to-VTEP connectivity and identify potential MTU or VLAN tagging issues.
Parametric: Execute a Traceflow analysis from the NSX Manager to identify the exact logical component or firewall rule dropping a packet.
Causal: Failure in TEP communication typically stems from a physical network that does not support the 1600+ byte MTU required for overlay encapsulation.
SKILLS.md Matrix
| Element | Tool | Troubleshooting Logic |
|---|---|---|
| Overlay | TEP Diagnostics | VTEP-to-VTEP ping and MTU check |
| Logical Path | Traceflow | Identifying drops in routing/firewall |
| BGP Peering | Routing Table Audit | AS number and MD5 password verification |
Exam Radar
Core Priority: Ensuring the health of Edge nodes for North-South traffic and centralized services like NAT or Load Balancing.
High Frequency: Using get functional-state on the Edge CLI to verify routing and management service status.
Scenario Logic: Analyzing syslog entries to determine triggers for Active/Standby failover, such as keepalive timeouts.
Version Delta: Identifying packet drops caused by datapath buffer exhaustion during sudden traffic spikes.
Atomic Deconstruction
Actionable: Log in to the Edge CLI and execute get dataplane stats to check for high CPU utilization on specific cores or packet drops.
Parametric: Review /var/log/syslog on the Edge node to investigate physical link flaps or heartbeat failures that triggered a service migration.
Causal: A crash in the datapath process on an Edge node will immediately halt North-South traffic, necessitating a failover to the Standby node.
SKILLS.md Matrix
| Target | Command | Diagnostic Goal |
|---|---|---|
| Service State | get functional-state | Management and Routing service health |
| Data Plane | get dataplane stats | CPU core usage and buffer drops |
| Logs | /var/log/syslog | Root cause of failover events |
Exam Radar
Core Priority: Maintaining the management heartbeat between the Avi Controller and vCenter/NSX-T.
High Frequency: Identifying "Yellow" or "Red" cloud states in the Avi UI caused by API timeouts or authentication failures.
Confusion Alert: Differentiating between an SE failing to reach the Controller versus a Cloud Connector failing to reach the infrastructure API.
Scenario Logic: Troubleshooting SE discovery issues where new Service Engines do not appear in the inventory.
Version Delta: Investigating Java exceptions in the portal-webapp.log for management plane errors.
Atomic Deconstruction
Actionable: Generate a tech-support bundle from the Controller CLI to analyze the cloud-connector.log for infrastructure communication errors.
Parametric: Verify that Service Engines can reach the Controller’s management IP over ports 443 and 8443 to complete the discovery process.
Causal: An invalid SSL certificate or expired credentials for the vCenter/NSX service account will break the Cloud Connector, preventing any new SE scaling.
SKILLS.md Matrix
| Component | Logic | Troubleshooting Detail |
|---|---|---|
| Cloud Link | Cloud Status Tab | API connectivity and auth health |
| Logs | cloud-connector.log | Deep dive into infrastructure API errors |
| Discovery | Ports 443/8443 | SE-to-Controller communication path |
Exam Radar
Core Priority: Resolving issues where the Service Engine is active but failing to pass application traffic.
High Frequency: Utilizing built-in packet capture utilities to identify TCP handshake failures or RST packets.
Confusion Alert: Distinguishing between a backend server reset and a Service Engine dropping packets due to resource exhaustion.
Scenario Logic: Troubleshooting MAC address and ARP resolution issues in complex VLAN or Gratuitous ARP environments.
Version Delta: Monitoring internal SE resource usage for "CPU Hog" processes that cause packet drops.
Atomic Deconstruction
Actionable: Perform a real-time packet capture through the Avi UI for a specific Virtual Service to observe the traffic flow between the client, SE, and backend.
Parametric: Verify that the Service Engine has correctly resolved the ARP entries for its default gateway and all backend pool members.
Causal: High CPU utilization on an SE's dispatcher core often leads to packet drops, indicating a need for horizontal scale-out or resource reservation adjustments.
SKILLS.md Matrix
| Target | Tool | Troubleshooting Goal |
|---|---|---|
| Packet Flow | UI Packet Capture | Identifying TCP/SSL handshake resets |
| Layer 2 | ARP Table Audit | Verification of gateway and member MACs |
| SE Health | Resource Monitor | Detecting CPU hogging or memory leaks |
Exam Radar
Core Priority: Deconstructing the Virtual Service health score to find the specific point of failure.
High Frequency: Debugging backend servers marked "Down" by checking for HTTP 4xx/5xx errors or incorrect "Expect" strings.
Scenario Logic: Identifying pool member "flapping" caused by aggressive monitor timers or network congestion.
Version Delta: Troubleshooting connection multiplexing failures where the backend limits connections from the SE's SNIP.
Atomic Deconstruction
Actionable: Inspect the health monitor logs for a "Down" pool member to see the exact response received from the application (e.g., status code or timeout).
Parametric: Adjust health monitor intervals and thresholds to prevent false positives during periods of transient network latency.
Causal: If a backend server limits concurrent connections from the SE IP, the application may experience queuing and increased latency even if the server is "Up".
SKILLS.md Matrix
| Element | Logic | Atomic Detail |
|---|---|---|
| Health Score | 4 Pillars Analysis | Performance, Resource, Anomaly, Security |
| Monitoring | Monitor Debugging | Expect string and status code validation |
| Performance | Connection Queuing | Backend limit vs. SE SNIP concurrency |
Exam Radar
Core Priority: Using the Avi analytics engine for post-mortem analysis of failed requests.
High Frequency: Using end-to-end timing metrics to determine if latency is in the network, the SE, or the server.
Confusion Alert: Identifying if a 503 error was generated by the Avi SE (no servers) or passed through from the backend.
Scenario Logic: Analyzing WAF logs to identify and remediate false positives that are blocking legitimate traffic.
Version Delta: Utilizing Significant Log forensics to filter for specific response codes during troubleshooting windows.
Atomic Deconstruction
Actionable: Review the "Client-to-Avi" and "Avi-to-Server" timing breakdown to isolate whether latency is occurring on the WAN or the internal LAN.
Parametric: Filter logs for 503 Service Unavailable errors and check the "Significance" field to see if any pool members were available during the request.
Causal: WAF false positives can be remediated by creating specific exclusion rules for security signatures that are incorrectly flagging valid application traffic.
SKILLS.md Matrix
| Category | Atomic Requirement | Operational Detail |
|---|---|---|
| Forensics | End-to-End Timing | Network vs. App server latency split |
| Error Analysis | Log Filtering | Distinguishing local vs. remote errors |
| Security | WAF Log Review | Signature exclusion and rule tuning |
Exam Radar
Core Priority: Restoring functionality when the SDDC Manager's internal services or database become inconsistent.
High Frequency: Restarting the "vcf-operations-manager" service to resolve unresponsive UI or API issues.
Confusion Alert: Distinguishing between a service-level failure and a Postgres database corruption.
Scenario Logic: Manually updating the database state of a decommissioned host that remains in the inventory.
Version Delta: Utilizing the sos (Supportability and Operations Strategy) tool to gather diagnostic data for VMware support.
Atomic Deconstruction
Actionable: Use the systemctl restart command on the SDDC Manager appliance to recycle the operations manager and inventory services.
Parametric: Check the health of the PostgreSQL database by verifying the service status and disk space on the /storage/db partition.
Causal: Inconsistent database records regarding host status will prevent new workload domain creation, requiring manual SQL entry correction to match physical reality.
SKILLS.md Matrix
| Target | Action | Tool/Command |
|---|---|---|
| Service Health | Restart UI/API | systemctl restart vcf-operations-manager |
| DB Integrity | Query Status | psql -h localhost -U vcf |
| Support | Log Collection | sos --log-collect |
Exam Radar
Core Priority: Recovering the management plane from catastrophic failure using file-based backups.
High Frequency: Restoring the Avi Controller configuration JSON to a fresh cluster to recover virtual services.
Scenario Logic: Identifying the correct sequence to restore vCenter and NSX Managers to ensure environment synchronization.
Version Delta: Using SDDC Manager to restore the inventory state from an external SFTP/FTP backup repository.
Atomic Deconstruction
Actionable: Deploy a new Avi Controller node and use the "Restore Configuration" feature to upload the latest JSON backup file.
Parametric: Verify that the SFTP backup server for SDDC Manager is reachable and contains valid daily configuration exports.
Causal: Restoring a stale backup of the NSX Manager can result in a mismatch with the actual state of the Edge nodes, requiring manual data plane synchronization.
SKILLS.md Matrix
| Component | Recovery Method | Key Artifact |
|---|---|---|
| Avi Controller | JSON Import | Config Backup File |
| SDDC Manager | SFTP Restore | Inventory DB Backup |
| NSX-T | Manager Restore | Management Plane Snapshot |
Exam Radar
Core Priority: Confirming that remediation actions have successfully returned the system to an operational state.
High Frequency: Running the "Global Pre-check" in SDDC Manager post-repair to ensure no residual errors remain.
Scenario Logic: Verifying that Service Engines have re-established heartbeats with the Controller after a management network repair.
Version Delta: Re-syncing the Cloud Connector to ensure the infrastructure inventory (vCenter/NSX) is up to date.
Atomic Deconstruction
Actionable: Trigger a manual inventory refresh in the Avi Controller to confirm that all segments, T1s, and pools are correctly discovered post-repair.
Parametric: Verify the "Elastic HA" status of Service Engine Groups to ensure the system can once again scale out in response to traffic.
Causal: A successful "Green" health score in the Avi UI indicates that the data plane, management plane, and backend health monitors are all synchronized.
SKILLS.md Matrix
| Category | Atomic Requirement | Operational Detail |
|---|---|---|
| Verification | SDDC Pre-check | System-wide health certification |
| Resiliency | HA Status Check | Confirming scale-out capability |
| Integration | Cloud Sync | Refreshing infrastructure object discovery |
What is the first component to check if a Virtual Service is not responding?
Verify that the Service Engine hosting the Virtual Service is operational.
Virtual Services run on Service Engines. If the associated Service Engine is down or unreachable, the Virtual Service will not accept connections.
Administrators should check:
Service Engine health status
network connectivity
resource utilization
If the Service Engine fails, the Controller may redeploy or migrate the service to another engine.
Demand Score: 93
Exam Relevance Score: 94
Why might backend servers appear as DOWN in Avi?
Because health monitor checks are failing.
Avi uses health monitors to verify backend server availability.
If a server fails the health check, it is marked DOWN and removed from the load balancing pool.
Common causes include:
incorrect health monitor configuration
firewall blocking health check traffic
application service not responding
Demand Score: 88
Exam Relevance Score: 91
What tool within Avi helps diagnose application performance problems?
The Avi analytics dashboard.
The analytics dashboard provides detailed metrics including:
request latency
server response times
error rates
connection statistics
Administrators can use these metrics to quickly identify application bottlenecks.
Demand Score: 79
Exam Relevance Score: 88
What could cause a Service Engine to fail deployment?
Infrastructure integration problems such as vCenter permissions or resource limitations.
The Controller relies on infrastructure APIs to deploy Service Engines.
If permissions or resource availability are insufficient, deployment fails.
Administrators should verify:
vCenter credentials
datastore availability
network mappings
Demand Score: 82
Exam Relevance Score: 90
Why would clients experience intermittent connectivity to a Virtual Service?
Because of Service Engine resource exhaustion or network instability.
If a Service Engine reaches CPU or memory limits, connection handling may degrade.
Monitoring resource utilization can reveal the issue.
Deploying additional Service Engines often resolves the problem.
Demand Score: 75
Exam Relevance Score: 87
What diagnostic step should be taken if health monitors fail unexpectedly?
Verify monitor configuration and application availability.
Health monitors rely on correct protocol, port, and path configuration.
If the monitor does not match the application response, servers will incorrectly appear DOWN.
Administrators should confirm:
protocol settings
port numbers
response codes
Demand Score: 84
Exam Relevance Score: 91