This section covers the operational expertise required to diagnose issues, restore service, and fine-tune a VMware Cloud Foundation (VCF) + vSphere with Tanzu (VKS) platform.
Effective troubleshooting requires a structured and repeatable workflow. VMware environments are highly integrated, so a systematic method ensures faster resolution and reduces risk.
Begin by clearly identifying the issue:
What exactly is broken?
Who or what is impacted?
When did the issue start?
What are the symptoms?
Is the issue isolated or widespread?
A well-defined problem statement reduces unnecessary investigation paths.
Most issues originate from changes such as:
Patches or upgrades
Network modifications
Configuration updates
Deployment of new clusters, workloads, or policies
Always review the environment’s recent activity, including tasks and events in vCenter and NSX.
Follow a bottom-up or top-down layered model:
Physical layer – servers, NICs, switches, cabling, power
Virtualization layer – ESXi hosts, VMs, vSAN storage
OS/Node layer – guest OS or Kubernetes node VMs
Platform layer – Supervisor Cluster, TKC components, NSX
Application layer – workloads, services, pods
This approach helps isolate root causes quickly and prevents overlooking underlying issues.
Ask: “What is working vs. what is broken?”
For example:
If one TKC cluster fails but others succeed → likely config or resource issue
If only certain pods fail → may indicate storage, quota, or RBAC issues
Comparison is one of the simplest and most effective troubleshooting tools.
Troubleshooting requires the ability to gather accurate, detailed information from the platform.
vSphere Client
ESXi Host Client & ESXi Shell
Local troubleshooting when vCenter is unavailable
Useful for log inspection and host-level diagnostics
Show sequence of operations
Reveal failures in DRS, HA restarts, provisioning, storage operations
Help correlate issues with user or automated actions
NSX Manager UI
Traceflow
Port mirroring and packet captures when needed
kubectl logs <pod> – application logs
kubectl describe <pod/node> – detailed object information
kubectl get events – recent cluster events
These commands provide insight into pod scheduling issues, container crashes, and node failures.
Dashboards for performance, capacity, and anomalies
Unified log search across vSphere, NSX, and Kubernetes
Alerting for early detection of problems
Observability is essential for fast root-cause resolution.
Compute failures affect VMs and Kubernetes nodes, potentially impacting entire clusters.
Investigate:
Root cause of host crash
HA behavior and whether VMs restarted successfully
Admission control settings and capacity for failover scenarios
If HA fails to restart certain VMs, check:
VM restart priority
Resource reservations
Placement rules
Symptoms include:
High CPU Ready Time
Memory ballooning or swapping
Low application throughput
For resolution:
Evaluate resource pools and limits
Remove overly restrictive reservations
Add hosts or rebalance clusters
Review overcommit levels
Resource contention is one of the most common performance problems in vSphere.
Storage problems can affect both VM performance and Kubernetes workloads.
Typical issues include:
Disk failures
Network connectivity issues between vSAN nodes
Resync or rebuild operations consuming bandwidth
Inconsistent firmware/driver versions
Use vSAN Health Service to analyze warnings and identify root causes.
Common indicators:
Datastore running out of space
Objects stuck in “reduced availability”
Failed policy compliance
Mitigation:
Enable thin provisioning
Reclaim unused space
Perform cluster rebalance
Review and adjust storage policies
Storage shortages can bring down entire clusters—capacity planning is essential.
Networking is a multi-layer system in VCF, so failures can manifest in many ways.
Common causes:
MTU mismatch between physical and virtual networks
Incorrect VLAN tags
Switch trunk configuration errors
Misconfigured NIC teaming
Symptoms may include:
vMotion failures
Host isolation
vSAN object resync delays
Potential issues:
Edge node failures causing routing interruptions
Incorrect Tier-0/Tier-1 route advertisement
BGP misconfigurations
GENEVE encapsulation MTU problems
Troubleshooting NSX often requires a combination of API queries, traceflow, and controller diagnostics.
Symptoms and checks:
kubectl cannot connect
Validate control plane VIP
Check certificates and authentication
Verify NSX routing and firewall rules
Control plane node NotReady
Inspect Supervisor VMs on ESXi hosts
Validate storage for etcd and control plane volumes
Check host health and vSAN connectivity
Common issues:
Nodes in NotReady or Unknown
Underlying VM failure
Host networking issues
Cloud Provider integration problems
Pods stuck in Pending
Causes include:
Not enough CPU/memory
StorageClass not available
PVC provisioning failures
Namespace quota limits
Guest cluster troubleshooting often overlaps vSphere and Kubernetes problem spaces.
If quotas are exceeded:
Deployments fail with scheduling or admission errors
Pods remain pending or evicted
PersistentVolumeClaims fail to bind
Admins must adjust Namespace resources or guide teams to optimize usage.
Common symptoms:
“Access Denied” errors for developers
Service accounts lacking required permissions
Inability to deploy workloads into a Namespace
Resolving RBAC problems often requires coordination between vSphere and Kubernetes administrators.
Application connectivity failures may result from:
Strict Kubernetes NetworkPolicies
NSX Distributed Firewall blocking pod-to-pod or pod-to-VM paths
Traceflow and packet captures help identify policy-related drops.
Right-size VMs and K8s nodes to prevent over-allocation
Align large VMs with NUMA boundaries
Use DRS automation to balance workloads
Avoid excessive CPU/vCPU ratios for critical workloads
Proper compute tuning significantly improves throughput and stability.
Tune vSAN policies based on application needs
Optimize cache usage for high-IOPS workloads
Consider separate vSAN policies for different performance tiers
Monitor rebuild impact and disk balancing
Well-adjusted storage designs improve resilience and reduce latency.
Set appropriate resource requests and limits to prevent node overcommit
Use Horizontal Pod Autoscaler (HPA) for dynamic scaling
Enable cluster autoscaling (if supported) for capacity elasticity
Cloud-native optimization requires observability and iterative tuning.
Use VMware Aria Operations to analyze:
Current utilization
Projected growth
“What-if” scenarios for hardware changes
Impact of adding or removing clusters
Capacity management ensures sustained performance and cost efficiency.
Regularly remove:
Unused VMs
Orphaned disks
Zombie PVs
Failed or abandoned TKCs
Consolidate underutilized clusters when possible—while maintaining:
HA host failure tolerance
vSAN storage availability
Workload separation requirements
Balancing efficiency with resilience ensures sustainable long-term operations.
VCF Lifecycle Management revolves around SDDC Manager orchestrating upgrades, patches, and domain operations. Troubleshooting in this area is about understanding how VCF expects things to look and what happens when reality does not match.
After Cloud Builder has completed initial bring-up, additional failures can appear when:
Registering SDDC Manager with vCenter or NSX
Adding or configuring management components
Running the first lifecycle operations
Typical troubleshooting approach:
Confirm DNS, NTP, and certificate settings are correct for all components
Check SDDC Manager logs for failed API calls to vCenter or NSX
Validate that management VMs (vCenter, NSX Manager, SDDC Manager) are up, healthy, and not resource-constrained
Ensure Cloud Builder’s configuration (IP ranges, hostnames, VLANs) matches the actual environment
Post-bring-up failures often point to underlying configuration inconsistencies.
Before applying a lifecycle bundle, SDDC Manager runs prechecks:
Verifies versions of existing components
Confirms cluster health (vSAN, HA, DRS)
Checks for NSX and vCenter connectivity
Validates BOM compatibility
Common precheck failures include:
vSAN issues (degraded objects, resync in progress)
Unresponsive or disconnected hosts
NSX Manager or vCenter not at expected version
Troubleshooting means clearing these issues first, then re-running prechecks.
Bundles may fail if:
A required earlier bundle was not applied
Domains are on mismatched versions relative to the management domain
Component versions drifted due to manual updates
You troubleshoot by:
Reviewing the VCF BOM and the bundle documentation
Ensuring you are following the prescribed sequence (for example Management Domain before Workload Domains)
Verifying no component has been manually upgraded outside SDDC Manager
Workload Domain creation can fail at steps such as:
Host validation
vCenter deployment
NSX configuration
vSAN cluster creation
Key checks:
All hosts to be added must be commissioned and compliant with the vLCM image
Network configuration (VLANs, MTU, routing) must match the domain design
vSAN disks must be properly claimed and not in an unexpected state
Logs in SDDC Manager and vCenter will show which phase failed.
Commissioning can fail due to:
Unsupported firmware or driver versions
Incorrect network configuration (VLANs, MTU, NIC mapping)
HCL or image incompatibility
Decommissioning can fail if:
VMs or services still run on the host
vSAN cannot evacuate data based on storage policies
NSX configuration still references the host
Troubleshooting means checking:
Hardware compatibility
vSAN evacuation status
NSX transport node state
Whether host is still part of any cluster or service
Drift occurs when:
A host is patched outside SDDC Manager or vLCM
Someone changes configuration manually (for example, installing a different driver)
A component fails to upgrade while others succeed
You troubleshoot drift by:
Using SDDC Manager compliance checks and vLCM reports to see which components are out of sync
Identifying whether the drift is intentional or accidental
Planning remediation windows where vLCM can remediate the host back to the defined image
Designing predictable operations means reducing manual changes and relying on automated tools.
vLCM focuses on keeping cluster hosts aligned to a defined image. Troubleshooting is largely about understanding why a host does not or cannot match that image.
Mismatches appear when:
A host’s driver or firmware differs from the cluster image
Vendor tools or manual updates changed firmware outside vLCM
You use:
vLCM compliance reports to see which components differ
Vendor documentation to confirm supported driver/firmware combinations
Fixing it usually involves letting vLCM remediate the host back to the image or updating the image to include the new firmware in a controlled way.
Remediation may fail because:
Host cannot enter maintenance mode (insufficient cluster capacity)
vSAN cannot evacuate data due to lack of space or incompatible storage policies
Reboots or package installs fail due to hardware issues
Troubleshooting steps:
Check vSphere tasks and host logs for maintenance mode and vSAN evacuation failures
Confirm sufficient free capacity and N+1/N+2 design
Review vLCM logs for specific error codes during remediation
When converting from baselines to images:
Some hosts may have extra VIBs not accounted for in the image
Hardware may not support the new image components
You troubleshoot by:
Identifying non-standard VIBs and either removing them or adding them as vendor add-ons in the image
Validating HCL again for any newly enforced constraints
The goal is to converge on a consistent, supported image for all hosts.
At the cluster level:
Desired state is the defined image
Actual state is what each host is really running
Drift analysis means:
Comparing each host’s ESXi version, driver set, firmware, and add-ons
Identifying patterns (for example, all hosts of a certain hardware model have a particular mismatch)
Deciding whether to change the image or remediate hosts
If the depot is not synchronized:
New patches or images may not appear
Bundles may show as corrupt or incompatible
Troubleshooting involves:
Checking connectivity to the online depot (for online mode)
Verifying the integrity and source of offline bundles
Re-importing or updating depots as needed
If remediation fails partway:
A host might be in an intermediate state
It may boot with a partial update or fail to boot at all
Recovery approaches:
Use hardware console to check host status
Boot to a known good ESXi image if necessary
Re-run vLCM remediation with corrected image or packages
Documenting rollback procedures is essential for safe operations.
Edge nodes, routing, and load balancers are critical for north–south traffic and Kubernetes ingress.
TEPs (Tunnel Endpoints) allow overlay traffic between hosts and Edges.
Common problems:
Wrong TEP VLAN
Incorrect IP pools
MTU mismatch between Edges and hosts
Troubleshooting:
Use ping and trace tools between TEPs
Verify transport zone and uplink profile assignments
Check that the underlay network routes and MTU support overlay traffic correctly
Discrepancies show as:
Routes missing in upstream routers
Internal segments not reachable from outside
Overlapping or incorrect route advertisements
You troubleshoot by:
Inspecting Tier-0/Tier-1 route tables
Checking BGP configuration and route filters
Confirming which networks are set to be advertised
BGP/BFD issues present as:
Unstable neighbor relationships
Frequent flapping
Missing routes
Checks include:
IP and ASN configuration on both sides
Interface and MTU status
BFD timers and misalignment
Logging on NSX and physical routers
Symptoms:
VIP not reachable
Application unreachable despite pods or VMs being healthy
Troubleshooting steps:
Verify VIP is bound to the correct interface and advertised upstream
Check health monitors for pool members
Confirm firewall rules allow traffic to VIP and pool member networks
Ensure application responds to health probes as expected
Misconfigured NAT rules can cause:
Asymmetric routing
Unexpected source IPs
Broken return flows
To troubleshoot:
Review NAT configuration at Tier-0 and Tier-1
Use traceflow to visualize path and NAT translations
Confirm that address pools do not overlap with internal or external networks
If NCP fails:
Pod networks and services may not be created
Kubernetes events may show CNI-related errors
Diagnostics:
Check NCP logs on NSX Manager or integration nodes
Verify API connectivity between NSX and Kubernetes
Confirm that required NSX objects (segments, routers, firewall rules) are being created
Supervisor issues can impact all VKS workloads, since it is the control plane integrated with vSphere.
Spherelet is responsible for:
Communicating with the Supervisor control plane
Managing PodVMs on ESXi
Problems appear as:
PodVMs stuck in pending or failed states
Nodes marked NotReady in the Supervisor view
Troubleshooting:
Check spherelet logs on ESXi
Verify NSX connectivity between ESXi and Supervisor control plane
Confirm that required certificates and tokens are valid
When etcd or API servers fail:
kubectl commands stop working or hang
Cluster state may become inconsistent
You investigate:
VM status and logs for control plane VMs
Storage health for etcd data (for example vSAN objects)
Network paths to the control plane VIP
If control plane VMs:
All end up on a single host or rack without anti-affinity
Use a storage policy that is now non-compliant
Outages may occur when that host or rack fails.
Troubleshooting:
Ensure anti-affinity rules are in place for control plane VMs
Check vSAN compliance and resync status for control plane disks
WCP coordinates Supervisor features.
Key logs:
WCP service logs on vCenter or related appliances
Errors creating Namespaces, TKCs, or PodVMs
You use these logs to see high-level Kubernetes operations from the vSphere side.
Failure signals include:
Authentication errors when accessing Supervisor API
Token validation failures in logs
Expired or invalid certificates on control plane endpoints
Troubleshooting steps:
Check certificate validity and chains
Validate OIDC configuration and token issuer data
Renew or rotate certificates and tokens as required
During upgrades:
Incompatible versions can cause partial failures
Rollback might be needed if control plane becomes unstable
You must:
Follow the documented sequencing strictly
Validate each step before advancing
Have a tested rollback strategy for Supervisor packages
TKCs are guest clusters; problems there often look like “normal” Kubernetes issues but have VCF-specific causes.
At bootstrap time:
Control plane VMs use cloud-init or ignition to configure Kubernetes components
Failures result in nodes stuck in NotReady or initial setup loops
Troubleshooting:
Inspect cloud-init logs inside the control plane VMs
Check that the correct TKC templates and versions are available in the content library
Verify network, DNS, and certificate settings for the TKC API endpoint
Worker nodes can fail to:
Provision
Join the cluster
Be replaced after a failure
You check:
Machine and MachineSet objects in Cluster API
Node logs for kubelet or network errors
Namespace quota or resource limits that might block VM creation
If PVs fail to bind:
CSI may not be correctly configured
StorageClasses may reference invalid Storage Policies
CNS may not be able to create or attach disks
Troubleshooting:
Check Kubernetes events on PVCs and PVs
Verify StorageClass parameters match vSphere policies
Look at CNS and vSphere logs for disk creation or attachment failures
ClusterClass errors include:
Incorrect references to machine templates
Invalid Kubernetes versions
Incompatible configuration options
You inspect:
Cluster and ClusterClass manifests
Cluster API logs
Validation output from any pre-deployment tooling
MachineHealthCheck:
Monitors nodes for health conditions
Triggers remediation (delete and recreate) on failure
Troubleshooting involves:
Checking MachineHealthCheck objects and conditions
Reviewing which nodes were remediated and why
Ensuring remediation does not conflict with maintenance or planned operations
Upgrade issues often come from:
Unsupported upgrade paths
Missing node images for the target version
Incompatible control plane and worker versions
You troubleshoot by:
Reviewing documented upgrade paths and compatibility
Ensuring content library has images for the target version
Checking upgrade logs from Cluster API controllers
vSAN ESA introduces a different architecture from OSA, so troubleshooting must adapt.
Fault domains protect against rack or chassis failures.
Troubleshooting:
Verify that hosts are assigned to correct fault domains
Check whether vSAN objects have components placed across domains as expected
Correct any misalignment that could lead to correlated failures
Prechecks might flag:
Unsupported NVMe devices
Inconsistent controller firmware
Insufficient disk group layout
You must:
Compare hardware against VMware’s compatibility guidance for ESA
Update firmware or reconfigure hardware as required
Symptoms:
High latency for reads or writes
Decreased IOPS compared to design expectations
Troubleshooting:
Check vSAN performance dashboards for congestion or contention
Ensure network bandwidth and MTU are configured correctly
Validate that parallelism and queue depths on storage devices are within supported ranges
After failures:
vSAN resyncs and rebuilds data to maintain FTT
ESA’s internal architecture shapes how and where data is rebuilt
Troubleshooting:
Monitor resync traffic and progress
Ensure there is enough spare capacity to complete rebuilds
Confirm that resyncs are not chronically stuck due to repeated failures or resource limits
Capacity imbalance may show as:
Some hosts or fault domains being nearly full
Objects in “reduced redundancy” or “non-compliant” states
Troubleshooting actions:
Trigger or monitor automatic rebalancing
Adjust policies if they are too strict for available hardware
Plan capacity additions to restore balance
Modern VCF + VKS systems require correlating logs across multiple layers.
You typically gather:
Supervisor control plane logs (API server, controller, etcd)
WCP service logs on vCenter or Supervisor components
TKC cluster logs (API, controllers, etcd)
Spherelet logs on ESXi for PodVM operations
Knowing where these logs reside is the first step in meaningful troubleshooting.
For NSX and NCP:
NCP logs show CNI and Kubernetes integration status
NSX Manager logs show control-plane operations and errors
Edge and transport node logs show routing and datapath events
These logs help explain why network elements did or did not get created.
vmkernel logs can reveal:
Storage timeouts for PodVM disks
Network driver issues impacting overlay tunnels
Resource constraints on ESXi hosts
Recognizing recurring patterns (for example, repeated path failovers or driver resets) is key.
Kubernetes control components log:
Scheduling decisions
Controller actions (such as creating pods or PVs)
API errors and authentication failures
These logs explain why pods are pending, unschedulable, or repeatedly recreated.
Complex issues often involve:
A Kubernetes symptom (pod cannot reach service)
An NSX misconfiguration (missing route or blocked firewall rule)
A vSphere-level problem (host or vSAN issue)
Troubleshooting means:
Starting from the symptom
Following the path down through logs at each layer
Identifying the first point where behavior deviates from expectations
Optimization ensures the platform not only works but works efficiently and predictably.
Good MTU design:
Ensures overlay packets (with GENEVE headers) do not get fragmented
Uses consistent MTU across host NICs and physical switches
Is validated with end-to-end tests, not just configuration assumptions
Fragmentation wastes CPU and reduces throughput.
You can:
Use ECMP with multiple edges for parallel forwarding
Tune BGP timers and settings for faster convergence
Ensure route tables remain clean and free from unnecessary prefixes
Poor routing design leads to slow failover and unpredictable application reachability.
Tuning includes:
Right-sizing Edge nodes for CPU and memory
Avoiding overload by distributing VIPs across multiple nodes
Using appropriate health checks to avoid flapping
A poorly sized load balancer can become a central bottleneck.
To keep the network manageable:
Plan CIDR ranges in advance with enough space for growth
Avoid small, disjoint ranges that are hard to summarize in routing
Use consistent design patterns across clusters
If CIDRs are badly fragmented, routing and firewall rules become complex and error-prone.
You can optimize:
Placement of services (for example, collocating chatty microservices)
Network path length (minimizing unnecessary hops)
Overlay designs so that traffic stays local when possible
In Kubernetes-heavy environments, most traffic is east–west. Optimizing it directly impacts user-facing performance.
What is a common reason for Supervisor Cluster deployment failure?
Misconfigured networking or missing prerequisites in the vSphere cluster.
Supervisor Cluster deployment requires specific infrastructure prerequisites including compatible ESXi hosts, configured networking, and supported storage policies. If networking components such as NSX or distributed switches are not properly configured, the deployment process may fail or become stuck. Administrators should verify cluster compatibility, networking configuration, and resource availability before enabling Workload Management.
Demand Score: 79
Exam Relevance Score: 88
Why might Kubernetes pods be unable to communicate with services in a Tanzu cluster?
Network policies or NSX configuration issues may be blocking traffic.
Kubernetes networking relies on correct configuration of network overlays, service routing, and firewall policies. If NSX distributed firewall rules or Kubernetes network policies are misconfigured, traffic between pods or services may be blocked. Administrators should inspect network policies, firewall rules, and service configurations to ensure that required communication paths are allowed.
Demand Score: 73
Exam Relevance Score: 84
What does a “Node Not Ready” status usually indicate in Kubernetes clusters on vSphere?
The node cannot communicate with the Kubernetes control plane or required services.
When a node reports a “Not Ready” status, it typically means that the kubelet or networking components cannot properly communicate with the control plane. Causes may include networking issues, resource exhaustion, misconfigured certificates, or failed system services. Administrators should inspect node logs and verify connectivity to the API server to diagnose the problem.
Demand Score: 70
Exam Relevance Score: 82
What tool can help diagnose Kubernetes cluster issues on vSphere?
kubectl diagnostic commands such as kubectl describe and kubectl logs.
kubectl provides diagnostic commands that help administrators inspect cluster objects and troubleshoot issues. Commands such as kubectl describe pod reveal detailed status information including events and configuration, while kubectl logs shows application logs from containers. These tools are commonly used to identify networking, scheduling, or application errors within Kubernetes clusters.
Demand Score: 67
Exam Relevance Score: 80