When problems arise in your Avi environment, you need a structured approach. Without structure, you’ll waste time and may misdiagnose the issue.
Define the problem:
Is the issue about access (users can’t reach the app)?
Is it about performance (app is slow)?
Is it about availability (service is down)?
Or is it a misconfiguration (unexpected behavior)?
Scope the impact:
Is it affecting just one Virtual Service (VS)?
One Service Engine (SE)?
Or the entire deployment (global issue)?
Use telemetry and logs to isolate cause:
Check metrics: latency, error rates, CPU/Memory usage.
Review logs: access logs, system events, health monitor alerts.
Use built‑in tools like FlightPath and logs:
The Avi Controller has tools for tracing requests and seeing where things broke.
Use logs to rewind what happened around the error/time of failure.
Here are questions to ask early in a troubleshooting session:
Is traffic reaching the Virtual Service (VS) at all?
Are backend pool members healthy? Are they up or marked down?
Are DNS names and IPs resolving correctly?
Is SSL/TLS termination happening correctly (if HTTPS)?
These questions help you quickly eliminate large domains of potential issues.
This section outlines the most frequent problems you’ll encounter in production environments and how to systematically resolve them.
Symptoms:
Application appears down or not responding.
No traffic observed on the backend servers.
Troubleshooting Steps:
Check VIP assignment:
Was the IP assigned from IPAM?
Is the IP reachable (ping from client/another host)?
Verify routing/firewall:
Can the client route to the SE data IP?
Are any firewall rules blocking access?
Validate health monitor status:
Common Fixes:
Correct IP allocation errors or reassign VIP.
Add required routes to SE subnet.
Allow inbound ports (80, 443, etc.) on firewalls.
Symptoms:
VS is up, but no real traffic flows to backend.
Health monitor shows pool members as down.
Troubleshooting Steps:
Review pool status in Avi UI.
Check health monitor logs for failure reason.
Try curl or telnet from SE to backend IP:Port to test reachability.
Common Fixes:
Fix application (backend server crashed).
Update IPs/ports in pool if misconfigured.
Adjust health monitor interval/timeout settings.
Symptoms:
Browsers show "insecure connection", "certificate error", or handshake failures.
Clients unable to connect via HTTPS.
Troubleshooting Steps:
Check certificate validity (expired? mismatched domain name?).
Verify SSL profile assigned to the VS.
Confirm backend supports expected TLS version/ciphers.
Common Fixes:
Replace expired cert or upload full certificate chain.
Enable correct SSL version (e.g., TLS 1.2 or 1.3).
Modify cipher suite in the SSL profile.
Symptoms:
Some users always hit the same backend.
Uneven traffic distribution.
Troubleshooting Steps:
Check load balancing algorithm in pool:
Verify session persistence (sticky sessions) settings.
Inspect pool member weights.
Common Fixes:
Adjust or disable persistence if not needed.
Rebalance member weights.
Switch to a different load balancing method if more suitable.
Symptoms:
Pages load slowly.
Timeouts or delays in API responses.
Troubleshooting Steps:
Check backend server performance (CPU, memory, app response).
Measure TLS handshake time in logs or FlightPath.
Look for misconfigured compression or excessive caching.
Monitor SE resource usage and bandwidth.
Common Fixes:
Tune backend app or scale out pool.
Upgrade SE CPU for better SSL offload.
Optimize caching/compression settings.
| Issue Type | First Checks | Common Fixes |
|---|---|---|
| VS Not Reachable | VIP assignment, routes, firewall | Fix IP config, add routes, open ports |
| Backend Down | Health monitors, server status | Restart app, fix IP/port, tune monitor |
| SSL Problems | Cert status, TLS versions | Replace certs, adjust cipher profiles |
| LB Algorithm Broken | Pool config, persistence | Change LB method, reweight members |
| Slow/High Latency | TLS timing, server RTT, SE CPU | Upgrade SE, tune TCP/SSL, scale backend |
VMware Avi includes powerful diagnostics tools, both graphical (UI) and command-line (CLI), that can help you pinpoint problems quickly.
What is it?
FlightPath is Avi’s built-in end-to-end request tracing tool.
How to Use:
Go to the Virtual Service page in the Controller UI.
Click “FlightPath”.
Input:
Source IP address
Protocol (e.g., HTTP, TCP)
Destination host or VS IP
What it Shows:
The complete traffic flow:
Each processing stage:
DNS lookup
SSL handshake
Request headers
Server response
Decision points like policy matches or errors
Use Case Examples:
Find out why traffic is not reaching the backend
Identify which server returned a 5xx error
See which rule or redirect was triggered
Best Practice:
Use FlightPath before changing any configs. It visualizes what’s actually happening under the hood.
Avi provides a comprehensive CLI via SSH or the Controller terminal.
Basic Useful Commands:
> show virtualservice <VS-Name>
> debug virtualservice <VS-Name>
> show logs
Advanced Options:
You can drill into pools, servers, SEs, and logs using tab-completion in CLI.
Filter logs by keyword or time to isolate problems.
When to Use:
When the UI isn’t available
During complex issues where FlightPath doesn’t show enough detail
When following a step-by-step diagnostic script
Avi captures several types of logs:
| Log Type | Description |
|---|---|
| Access Logs | Logs every HTTP(S) request processed by a VS |
| System Events | Cluster status, configuration changes, alerts |
| Health Monitor | Tracks when pool members go up/down, response times |
Viewing Logs:
Use the UI: Operations > Events / Logs
Use CLI: show logs
Export to external tools like:
Syslog
Kafka
Elasticsearch
vRealize Log Insight
Best Practices:
Use log filters to isolate by Virtual Service, timestamp, or error code.
Cross-reference logs with FlightPath results for full context.
| Tool | What It Helps With | How to Access |
|---|---|---|
| FlightPath | Visual trace of client → backend | Controller UI > Virtual Service |
| CLI Debug | Status, logs, enable tracing | SSH or console CLI |
| Log Analysis | Understand errors, request details, failures | UI / CLI / Syslog integrations |
This section focuses on diagnosing and repairing infrastructure-related issues — particularly those involving Service Engines and Controller nodes.
Common Symptoms:
SE shows as down or unreachable in the UI.
VS traffic is not processed (especially in Active/Standby mode).
High latency or packet loss.
Troubleshooting Steps:
Check infrastructure status (vCenter, AWS, Azure, etc.):
Is the VM running?
Is it responding to ping?
Check SE health from the Controller:
UI: Infrastructure > Service Engines
CLI: show serviceengine <SE-Name>
Verify system resources:
Is CPU/memory usage too high?
Are disk or vNICs overloaded?
Check networking:
Can the Controller reach the SE?
Can the SE reach VIP and pool IPs?
Fixes:
Restart SE from the UI or CLI.
Redeploy SE if it’s corrupted or stuck.
Review cloud integration (credentials, APIs, placement settings).
Common Symptoms:
One or more Controller nodes are unreachable.
Cluster shows sync errors or HA issues.
UI/API becomes unresponsive.
Troubleshooting Steps:
Ping each node from within the network.
Use CLI:
> show cluster status
> show cluster runtime
Check system resources:
Disk space: df -h
Memory/CPU load: top or vmstat
Inspect logs for HA sync issues or node communication problems.
Typical Causes:
DNS/NTP mismatch between nodes
Disk full on a Controller VM
IP address change without updating cluster settings
Fixes:
Restart the failed node (gracefully).
Rejoin the node using rejoin cluster command (advanced).
Ensure consistent DNS and NTP across all nodes.
| Component | Best Practice |
|---|---|
| SE | Use fault domains to spread across hosts/AZs |
| Controller | Use 3-node clusters with quorum monitoring |
| Both | Monitor resource usage and alerts regularly |
| Problem Type | Key Tools | Suggested Fixes |
|---|---|---|
| SE Down | Ping, UI/CLI, log review | Restart, redeploy, check vCenter/AWS logs |
| Controller Node Issues | show cluster, syslogs |
Free disk, check NTP, fix HA sync, restart |
| Resource Bottlenecks | UI metrics, CLI top, logs |
Increase vCPU/RAM, redistribute workloads |
Upgrades can fail for several reasons, including environment issues, version mismatches, or resource limitations. This section explains how to troubleshoot common upgrade failures.
Symptoms:
Upgrade stuck on one Controller node.
Cluster loses quorum or becomes unstable.
Web UI/API unavailable during upgrade.
Troubleshooting Steps:
Check Compatibility Matrix:
Confirm that your current version supports upgrade to the target version.
Look up VMware Avi Release Notes and Product Interoperability Matrix.
Ensure Backups Exist:
Backup config from UI: Administration > System > Backup.
Snapshot Controller VMs if possible (especially before major versions).
Check for Active SE Tasks:
Disk/Resource Validation:
Confirm each Controller node has sufficient:
CPU/RAM
Free disk space
Network access to the other nodes
Logs to Check:
/var/lib/avi/log/upgrade_mgr.log
show cluster status
show cluster runtime
Fixes:
Free up disk space or memory.
Resolve in-progress SE tasks before retrying.
Revert to snapshot if upgrade failed early.
Symptoms:
SE upgrade fails or times out.
Virtual Services stay down after SE upgrade.
Incompatible SE version warning.
Troubleshooting Steps:
Version Compatibility:
SEs must not be newer than Controllers.
Controller must upgrade first.
Resource Availability:
SE host (e.g., ESXi, AWS) must have enough:
vCPU and RAM
Disk space
IPs for new SEs if redeployment is triggered
Deployment Errors:
If using "No Access Cloud," SEs are upgraded manually.
For automated clouds (vCenter, AWS), ensure API credentials are valid.
Logs to Check:
/var/lib/avi/log/ on SE
UI under Infrastructure > Service Engine Group > Events
Fixes:
Reattempt upgrade after resolving system errors.
Manually replace failed SE via the Replace SE feature.
If SE fails to boot post-upgrade, redeploy a new SE.
| Issue | Cause | Fix |
|---|---|---|
| Controller upgrade fails | Incompatibility, SE tasks, disk | Backup, check logs, free resources |
| SE upgrade fails | Host limits, API failure | Verify access, fix cloud config, redeploy |
| Post-upgrade instability | Mixed versions, HA failure | Rollback, reboot nodes, ensure quorum |
These scenarios are less common but often more difficult to diagnose. They usually involve external integrations, automation, or platform-level behavior.
Symptoms:
Virtual Services that rely on DNS fail to reach backend servers.
Custom health monitors cannot resolve DNS names.
FQDN-based pool members remain down.
Troubleshooting Steps:
Verify DNS profile in the Avi Controller:
Confirm the external DNS server is reachable from both Controller and SEs.
Use nslookup or dig from the Controller CLI:
> shell
$ dig myapp.example.com
Check if DNS resolution works inside health monitor scripts (for HTTP monitors using hostnames).
Fixes:
Update DNS profile with correct IPs.
Ensure network/firewall rules allow DNS (UDP/53).
If using split DNS, validate internal zones are correctly routed.
Symptoms:
Automation tools (Postman, Ansible, Terraform) return error codes (403, 401, 429).
Integration scripts break after version upgrade.
Troubleshooting Steps:
Check authentication token validity.
Look at rate limiting headers or response codes.
Review Avi Controller logs for detailed error output:
/var/lib/avi/log/portal.log
/var/lib/avi/log/api.log
Use Avi's Postman Collection:
VMware provides a ready-made Postman collection to test REST APIs.
Use it to validate credentials, headers, and endpoints manually.
Fixes:
Regenerate expired tokens.
Switch to API keys or OAuth for long-running scripts.
Implement retries and backoff if hitting rate limits.
Symptoms:
Pool members repeatedly show up and down in the UI.
Logs show frequent toggling in health status.
Troubleshooting Steps:
View monitor settings: frequency, timeout, and retries.
Check server logs for signs of overload or slow response.
Use FlightPath to test backend manually and compare responses.
Root Causes:
Monitor interval is too aggressive.
Backend server has occasional errors or latency spikes.
Health check uses incorrect port or path.
Fixes:
Increase monitor interval and timeout.
Adjust retry/fail threshold (e.g., fail only after 3 consecutive misses).
Use passive health monitoring if suitable.
| Use Case | Symptom | Common Fix |
|---|---|---|
| DNS Resolution | FQDNs not resolving | Fix DNS profile, test with dig/nslookup |
| API Failure | 401/403/429 errors | Check token, enable retry/backoff logic |
| Health Monitor Flapping | Backend marked up/down rapidly | Tune monitor intervals and thresholds |
When issues go beyond quick fixes, you may need to reset parts of your configuration or recover from backups. This section teaches how to do that safely and effectively.
Why it matters:
Backups help you recover from catastrophic failures like data corruption, upgrade failures, or misconfigurations.
How to Create Backups:
Navigate to Administration > System > Backup in the UI.
Options:
Full system backup (includes VS, Pools, SE Groups, Policies, etc.)
Scheduled backups: Create automatic daily/weekly jobs.
Export backups: Download to local storage or external destination.
Restoring Backups:
UI: Upload the .tar backup file and click Restore.
API: POST /api/backuprestore with backup file.
Best Practice:
Sometimes it's better to reset than debug endlessly, especially for:
Broken configurations
Partial deployments
Testing new setups
Types of Resets:
| Scope | Description | Example Use Case |
|---|---|---|
| Partial | Reset a specific object (e.g., Virtual Service) | VS misconfigured, backend pool changed |
| Full | Factory reset of the Controller or SE | Lab rebuild, corrupted database |
Reset Methods:
Delete and recreate objects from UI or API.
CLI Factory Reset (Advanced):
On Controller shell:
> shell
$ reset-config
Use only if you fully intend to wipe the system!
When to Redeploy an SE:
SE is unreachable, corrupted, or fails to start.
SE upgrade fails and cannot be recovered.
How to Redeploy:
UI: Navigate to the SE group, click Replace SE.
The Controller will:
Remove the bad SE.
Deploy a new SE with identical configuration.
Rebalance VS traffic automatically.
For “No Access” Clouds:
Manually deploy the new SE image (OVA/AMI).
Use the CLI or UI to associate it with the correct SE Group.
| Action | Description | Tool / Method |
|---|---|---|
| Backup & Restore | Export or import full system config | UI > System > Backup / API |
| Partial Reset | Reset or delete specific VS/Pool | UI or API |
| Full Reset | Factory reset of Controller or SE | CLI reset-config (use with caution) |
| SE Redeploy | Replace damaged or unreachable SE | UI “Replace SE” or manual deploy |
To prevent issues before they happen, Avi provides built-in alerting and integration with external monitoring platforms.
Alerts for:
VS or SE down
Pool member failure
High CPU/memory usage
You can customize thresholds in Analytics Profiles (e.g., trigger alert when CPU > 85%).
Supported tools:
vRealize Operations
Splunk
ELK Stack
Kafka / Webhooks
SNMP Traps
Daily/weekly health reports
Custom dashboards per:
Application
Virtual Service
Tenant
| Topic | Key Takeaways |
|---|---|
| Troubleshooting Approach | Start with scoping, isolate with FlightPath and logs |
| Common Fixes | Address VIP, DNS, SSL, and LB config issues |
| Built-in Tools | Use FlightPath, CLI debug, logs |
| Controller/SE Failures | Check system health, HA sync, NTP/DNS, restart or redeploy |
| Upgrade Problems | Always backup, check compatibility, logs for failed upgrades |
| Advanced Use Cases | Diagnose DNS, API, monitor flapping with specific tools |
| Recovery Procedures | Use backups, resets, and redeploy SEs to restore health quickly |
| Proactive Monitoring | Set alerts, thresholds, and use external log/metric systems |
RCA enables engineers to determine the fundamental reason behind failures, not just symptoms. Avi supports structured RCA workflows post-incident.
Evidence Collection:
Logs: Pull from /var/lib/avi/log (Controller) or /opt/avi/log/ (SE)
FlightPath Traces: Packet flow visualization with policy/action tracebacks
SE Metrics: View resource usage at time of failure
Event Exports: Use API or UI to extract system events for timeline analysis
Post-Incident Reporting:
Use Controller UI → Events → Generate reports
Include:
Timeline of events
Impacted VS/SE
Recovery duration
Preventive suggestions
External Workflow Integration:
Forward reports or logs to:
Jira: For follow-up ticketing
ServiceNow: For full RCA lifecycle
Automate incident report exports via REST API
Diagnose and resolve issues caused by misconfigured HTTP policies, DataScripts, rate limiting, or WAF rules.
Redirect Loops:
Header Rewrite Failures:
Blocked Requests:
WAF rejecting valid traffic due to aggressive signatures
Misapplied rate limit thresholds
FlightPath:
Shows request → policy match → action taken
Helps locate drop, redirect, or block decisions
WAF Inspection:
Analyze rule ID, signature, and thresholds from rejected requests
Tune sensitivity or whitelist based on false positive analysis
CLI Commands:
show virtualservice <VS>
show pool <pool>
show policy http <policy>
403 Forbidden / Access Denied:
User lacks sufficient role permissions
LDAP group mapping is misaligned with Avi RBAC role
SAML / LDAP Auth Issues:
Incorrect Identity Provider metadata or attributes
Expired certificates or unreachable auth endpoint
Admin Lockout:
Check Role Assignments:
CLI:
show user
show role <role-name>
Verify LDAP Bind and Group Match:
Validate with ldaptest in CLI
Ensure correct group_dn, search_base, attribute_name
Recover Admin Access:
Console access or VM boot into rescue mode
Re-enable local admin user or reset via DB command line
To inspect real-time data path behavior between clients, SEs, and backend servers.
Controller or SE Packet Capture:
tcpdump -i eth0 host <ip> and port <port>
Capture Both Sides:
Client ↔ SE
SE ↔ Pool Member
PCAP Export:
Download .pcap file and analyze in Wireshark
Useful for:
SSL handshake errors
TCP resets
Retransmissions
High Traffic Filtering:
Use filters like:
tcpdump -i eth0 'tcp and port 443 and net 10.10.0.0/16'
UI inaccessible
Config changes not saving
Sync errors across cluster nodes
Backup First:
backupconfiguration or use UI/RESTGraceful DB Repair:
Restart pgsql service
Clear temp tables or stuck transactions
Full Rebuild:
CLI Tools for Diagnosis:
show cluster runtime
df -h for disk usage
journalctl -xe for DB errors
Preventive Maintenance:
Schedule fsck disk checks
Avoid abrupt shutdowns
Use supported upgrade paths
Audit logs help attribute config changes to users, correlate outages, and meet compliance.
Object modified
Who made the change
Before/after values
Timestamp
UI: Administration > Audit Logs
CLI:
show logs audit
External Integration:
Export to:
Syslog
Splunk
SIEM systems
Enable RBAC tag mapping in audit logs
Retain logs for minimum 90 days
Review before/after major changes (e.g., upgrades, policy edits)
Live system debugging without taking down Virtual Services or SEs.
show virtualservice runtime:
Flow Table Inspection:
SE Interface Monitoring:
Elastic HA Failover:
Real-time SE redistribution
Session preservation (if configured)
Intermittent Issues:
Latency spikes
Asymmetric routing
Packet loss from upstream issues
| Area | Tool/Method | Notes |
|---|---|---|
| RCA | Logs, FlightPath, Event Reports | Use for timeline reconstruction |
| Security Rules | FlightPath, CLI, WAF Debug | Detect drops, misfires |
| RBAC | LDAP, CLI, Group Mapping | Recover from lockouts |
| Packet Debug | tcpdump, PCAP Export |
Wireshark for deep inspection |
| DB Repair | Backup/Restore, DB tools | CLI and monitoring |
| Audit Trail | UI, API, Syslog | Compliance & traceability |
| Zero Downtime Debug | Flow stats, runtime monitor | HA-aware, non-intrusive |
What is the first component to check if a Virtual Service is not responding?
Verify that the Service Engine hosting the Virtual Service is operational.
Virtual Services run on Service Engines. If the associated Service Engine is down or unreachable, the Virtual Service will not accept connections.
Administrators should check:
Service Engine health status
network connectivity
resource utilization
If the Service Engine fails, the Controller may redeploy or migrate the service to another engine.
Demand Score: 93
Exam Relevance Score: 94
Why might backend servers appear as DOWN in Avi?
Because health monitor checks are failing.
Avi uses health monitors to verify backend server availability.
If a server fails the health check, it is marked DOWN and removed from the load balancing pool.
Common causes include:
incorrect health monitor configuration
firewall blocking health check traffic
application service not responding
Demand Score: 88
Exam Relevance Score: 91
What tool within Avi helps diagnose application performance problems?
The Avi analytics dashboard.
The analytics dashboard provides detailed metrics including:
request latency
server response times
error rates
connection statistics
Administrators can use these metrics to quickly identify application bottlenecks.
Demand Score: 79
Exam Relevance Score: 88
What could cause a Service Engine to fail deployment?
Infrastructure integration problems such as vCenter permissions or resource limitations.
The Controller relies on infrastructure APIs to deploy Service Engines.
If permissions or resource availability are insufficient, deployment fails.
Administrators should verify:
vCenter credentials
datastore availability
network mappings
Demand Score: 82
Exam Relevance Score: 90
Why would clients experience intermittent connectivity to a Virtual Service?
Because of Service Engine resource exhaustion or network instability.
If a Service Engine reaches CPU or memory limits, connection handling may degrade.
Monitoring resource utilization can reveal the issue.
Deploying additional Service Engines often resolves the problem.
Demand Score: 75
Exam Relevance Score: 87
What diagnostic step should be taken if health monitors fail unexpectedly?
Verify monitor configuration and application availability.
Health monitors rely on correct protocol, port, and path configuration.
If the monitor does not match the application response, servers will incorrectly appear DOWN.
Administrators should confirm:
protocol settings
port numbers
response codes
Demand Score: 84
Exam Relevance Score: 91