Troubleshooting and Repairing

Troubleshooting and Repairing Detailed Explanation

1. Troubleshooting Approach

When problems arise in your Avi environment, you need a structured approach. Without structure, you’ll waste time and may misdiagnose the issue.

1.1 General Troubleshooting Flow

Define the problem:
- Is the issue about access (users can’t reach the app)?
- Is it about performance (app is slow)?
- Is it about availability (service is down)?
- Or is it a misconfiguration (unexpected behavior)?
Scope the impact:
- Is it affecting just one Virtual Service (VS)?
- One Service Engine (SE)?
- Or the entire deployment (global issue)?
Use telemetry and logs to isolate cause:
- Check metrics: latency, error rates, CPU/Memory usage.
- Review logs: access logs, system events, health monitor alerts.
Use built‑in tools like FlightPath and logs:
- The Avi Controller has tools for tracing requests and seeing where things broke.
- Use logs to rewind what happened around the error/time of failure.

1.2 Key Diagnostic Questions

Here are questions to ask early in a troubleshooting session:

Is traffic reaching the Virtual Service (VS) at all?
Are backend pool members healthy? Are they up or marked down?
Are DNS names and IPs resolving correctly?
Is SSL/TLS termination happening correctly (if HTTPS)?

These questions help you quickly eliminate large domains of potential issues.

2. Common Issue Categories and Fixes

This section outlines the most frequent problems you’ll encounter in production environments and how to systematically resolve them.

2.1 Virtual Service Not Reachable

Symptoms:

Application appears down or not responding.
No traffic observed on the backend servers.

Troubleshooting Steps:

Check VIP assignment:
- Was the IP assigned from IPAM?
- Is the IP reachable (ping from client/another host)?
Verify routing/firewall:
- Can the client route to the SE data IP?
- Are any firewall rules blocking access?
Validate health monitor status:
- If the monitor fails, the SE won't forward traffic to the pool.

Common Fixes:

Correct IP allocation errors or reassign VIP.
Add required routes to SE subnet.
Allow inbound ports (80, 443, etc.) on firewalls.

2.2 Backend Pool Members Down

Symptoms:

VS is up, but no real traffic flows to backend.
Health monitor shows pool members as down.

Troubleshooting Steps:

Review pool status in Avi UI.
Check health monitor logs for failure reason.
Try curl or telnet from SE to backend IP:Port to test reachability.

Common Fixes:

Fix application (backend server crashed).
Update IPs/ports in pool if misconfigured.
Adjust health monitor interval/timeout settings.

2.3 SSL Errors

Symptoms:

Browsers show "insecure connection", "certificate error", or handshake failures.
Clients unable to connect via HTTPS.

Troubleshooting Steps:

Check certificate validity (expired? mismatched domain name?).
Verify SSL profile assigned to the VS.
Confirm backend supports expected TLS version/ciphers.

Common Fixes:

Replace expired cert or upload full certificate chain.
Enable correct SSL version (e.g., TLS 1.2 or 1.3).
Modify cipher suite in the SSL profile.

2.4 Load Balancing Doesn’t Work Properly

Symptoms:

Some users always hit the same backend.
Uneven traffic distribution.

Troubleshooting Steps:

Check load balancing algorithm in pool:
- Round Robin? Least Connections?
Verify session persistence (sticky sessions) settings.
Inspect pool member weights.

Common Fixes:

Adjust or disable persistence if not needed.
Rebalance member weights.
Switch to a different load balancing method if more suitable.

2.5 Slow Performance or High Latency

Symptoms:

Pages load slowly.
Timeouts or delays in API responses.

Troubleshooting Steps:

Check backend server performance (CPU, memory, app response).
Measure TLS handshake time in logs or FlightPath.
Look for misconfigured compression or excessive caching.
Monitor SE resource usage and bandwidth.

Common Fixes:

Tune backend app or scale out pool.
Upgrade SE CPU for better SSL offload.
Optimize caching/compression settings.

Summary: Common Issue Fixes

Issue Type	First Checks	Common Fixes
VS Not Reachable	VIP assignment, routes, firewall	Fix IP config, add routes, open ports
Backend Down	Health monitors, server status	Restart app, fix IP/port, tune monitor
SSL Problems	Cert status, TLS versions	Replace certs, adjust cipher profiles
LB Algorithm Broken	Pool config, persistence	Change LB method, reweight members
Slow/High Latency	TLS timing, server RTT, SE CPU	Upgrade SE, tune TCP/SSL, scale backend

3. Tools for Troubleshooting

VMware Avi includes powerful diagnostics tools, both graphical (UI) and command-line (CLI), that can help you pinpoint problems quickly.

3.1 FlightPath

What is it?
FlightPath is Avi’s built-in end-to-end request tracing tool.

How to Use:

Go to the Virtual Service page in the Controller UI.
Click “FlightPath”.
Input:
- Source IP address
- Protocol (e.g., HTTP, TCP)
- Destination host or VS IP

What it Shows:

The complete traffic flow:
- Client → SE → VIP → Pool → Server
Each processing stage:
- DNS lookup
- SSL handshake
- Request headers
- Server response
Decision points like policy matches or errors

Use Case Examples:

Find out why traffic is not reaching the backend
Identify which server returned a 5xx error
See which rule or redirect was triggered

Best Practice:
Use FlightPath before changing any configs. It visualizes what’s actually happening under the hood.

3.2 CLI Debug Commands

Avi provides a comprehensive CLI via SSH or the Controller terminal.

Basic Useful Commands:

> show virtualservice <VS-Name>

Shows real-time status: up/down, IPs, pool association, metrics.

> debug virtualservice <VS-Name>

Enables debug logging for deeper inspection.

> show logs

Displays system logs: includes errors, health monitor results, config changes.

Advanced Options:

You can drill into pools, servers, SEs, and logs using tab-completion in CLI.
Filter logs by keyword or time to isolate problems.

When to Use:

When the UI isn’t available
During complex issues where FlightPath doesn’t show enough detail
When following a step-by-step diagnostic script

3.3 Log Analysis

Avi captures several types of logs:

Log Type	Description
Access Logs	Logs every HTTP(S) request processed by a VS
System Events	Cluster status, configuration changes, alerts
Health Monitor	Tracks when pool members go up/down, response times

Viewing Logs:

Use the UI: Operations > Events / Logs
Use CLI: show logs
Export to external tools like:
- Syslog
- Kafka
- Elasticsearch
- vRealize Log Insight

Best Practices:

Use log filters to isolate by Virtual Service, timestamp, or error code.
Cross-reference logs with FlightPath results for full context.

Summary: Tools for Troubleshooting

Tool	What It Helps With	How to Access
FlightPath	Visual trace of client → backend	Controller UI > Virtual Service
CLI Debug	Status, logs, enable tracing	SSH or console CLI
Log Analysis	Understand errors, request details, failures	UI / CLI / Syslog integrations

4. SE and Controller Troubleshooting

This section focuses on diagnosing and repairing infrastructure-related issues — particularly those involving Service Engines and Controller nodes.

4.1 Service Engine (SE) Failures

Common Symptoms:

SE shows as down or unreachable in the UI.
VS traffic is not processed (especially in Active/Standby mode).
High latency or packet loss.

Troubleshooting Steps:

Check infrastructure status (vCenter, AWS, Azure, etc.):
- Is the VM running?
- Is it responding to ping?
Check SE health from the Controller:
- UI: Infrastructure > Service Engines
- CLI: show serviceengine <SE-Name>
Verify system resources:
- Is CPU/memory usage too high?
- Are disk or vNICs overloaded?
Check networking:
- Can the Controller reach the SE?
- Can the SE reach VIP and pool IPs?

Fixes:

Restart SE from the UI or CLI.
Redeploy SE if it’s corrupted or stuck.
Review cloud integration (credentials, APIs, placement settings).

4.2 Controller Node Issues

Common Symptoms:

One or more Controller nodes are unreachable.
Cluster shows sync errors or HA issues.
UI/API becomes unresponsive.

Troubleshooting Steps:

Ping each node from within the network.

Use CLI:

> show cluster status
> show cluster runtime

Check system resources:
- Disk space: df -h
- Memory/CPU load: top or vmstat
Inspect logs for HA sync issues or node communication problems.

Typical Causes:

DNS/NTP mismatch between nodes
Disk full on a Controller VM
IP address change without updating cluster settings

Fixes:

Restart the failed node (gracefully).
Rejoin the node using rejoin cluster command (advanced).
Ensure consistent DNS and NTP across all nodes.

Best Practices for SE/Controller Health

Component	Best Practice
SE	Use fault domains to spread across hosts/AZs
Controller	Use 3-node clusters with quorum monitoring
Both	Monitor resource usage and alerts regularly

Summary: SE and Controller Troubleshooting

Problem Type	Key Tools	Suggested Fixes
SE Down	Ping, UI/CLI, log review	Restart, redeploy, check vCenter/AWS logs
Controller Node Issues	`show cluster`, syslogs	Free disk, check NTP, fix HA sync, restart
Resource Bottlenecks	UI metrics, CLI `top`, logs	Increase vCPU/RAM, redistribute workloads

5. Upgrade Troubleshooting

Upgrades can fail for several reasons, including environment issues, version mismatches, or resource limitations. This section explains how to troubleshoot common upgrade failures.

5.1 Controller Upgrade Fails

Symptoms:

Upgrade stuck on one Controller node.
Cluster loses quorum or becomes unstable.
Web UI/API unavailable during upgrade.

Troubleshooting Steps:

Check Compatibility Matrix:
- Confirm that your current version supports upgrade to the target version.
- Look up VMware Avi Release Notes and Product Interoperability Matrix.
Ensure Backups Exist:
- Backup config from UI: Administration > System > Backup.
- Snapshot Controller VMs if possible (especially before major versions).
Check for Active SE Tasks:
- Ongoing SE upgrades or scaling jobs may block Controller upgrade.
Disk/Resource Validation:
- Confirm each Controller node has sufficient:
  - CPU/RAM
  - Free disk space
  - Network access to the other nodes

Logs to Check:

/var/lib/avi/log/upgrade_mgr.log
show cluster status
show cluster runtime

Fixes:

Free up disk space or memory.
Resolve in-progress SE tasks before retrying.
Revert to snapshot if upgrade failed early.

5.2 Service Engine (SE) Upgrade Issues

Symptoms:

SE upgrade fails or times out.
Virtual Services stay down after SE upgrade.
Incompatible SE version warning.

Troubleshooting Steps:

Version Compatibility:
- SEs must not be newer than Controllers.
- Controller must upgrade first.
Resource Availability:
- SE host (e.g., ESXi, AWS) must have enough:
  - vCPU and RAM
  - Disk space
  - IPs for new SEs if redeployment is triggered
Deployment Errors:
- If using "No Access Cloud," SEs are upgraded manually.
- For automated clouds (vCenter, AWS), ensure API credentials are valid.
Logs to Check:
- /var/lib/avi/log/ on SE
- UI under Infrastructure > Service Engine Group > Events

Fixes:

Reattempt upgrade after resolving system errors.
Manually replace failed SE via the Replace SE feature.
If SE fails to boot post-upgrade, redeploy a new SE.

Summary: Upgrade Troubleshooting

Issue	Cause	Fix
Controller upgrade fails	Incompatibility, SE tasks, disk	Backup, check logs, free resources
SE upgrade fails	Host limits, API failure	Verify access, fix cloud config, redeploy
Post-upgrade instability	Mixed versions, HA failure	Rollback, reboot nodes, ensure quorum

6. Debugging Advanced Use Cases

These scenarios are less common but often more difficult to diagnose. They usually involve external integrations, automation, or platform-level behavior.

6.1 DNS Resolution Issues

Symptoms:

Virtual Services that rely on DNS fail to reach backend servers.
Custom health monitors cannot resolve DNS names.
FQDN-based pool members remain down.

Troubleshooting Steps:

Verify DNS profile in the Avi Controller:
- Go to Administration > DNS Profiles
Confirm the external DNS server is reachable from both Controller and SEs.
Use nslookup or dig from the Controller CLI:
```
> shell
$ dig myapp.example.com
```
Check if DNS resolution works inside health monitor scripts (for HTTP monitors using hostnames).

Fixes:

Update DNS profile with correct IPs.
Ensure network/firewall rules allow DNS (UDP/53).
If using split DNS, validate internal zones are correctly routed.

6.2 API Failures

Symptoms:

Automation tools (Postman, Ansible, Terraform) return error codes (403, 401, 429).
Integration scripts break after version upgrade.

Troubleshooting Steps:

Check authentication token validity.
Look at rate limiting headers or response codes.
Review Avi Controller logs for detailed error output:
- /var/lib/avi/log/portal.log
- /var/lib/avi/log/api.log

Use Avi's Postman Collection:

VMware provides a ready-made Postman collection to test REST APIs.
Use it to validate credentials, headers, and endpoints manually.

Fixes:

Regenerate expired tokens.
Switch to API keys or OAuth for long-running scripts.
Implement retries and backoff if hitting rate limits.

6.3 Health Monitor Flapping

Symptoms:

Pool members repeatedly show up and down in the UI.
Logs show frequent toggling in health status.

Troubleshooting Steps:

View monitor settings: frequency, timeout, and retries.
Check server logs for signs of overload or slow response.
Use FlightPath to test backend manually and compare responses.

Root Causes:

Monitor interval is too aggressive.
Backend server has occasional errors or latency spikes.
Health check uses incorrect port or path.

Fixes:

Increase monitor interval and timeout.
Adjust retry/fail threshold (e.g., fail only after 3 consecutive misses).
Use passive health monitoring if suitable.

Summary: Advanced Use Case Troubleshooting

Use Case	Symptom	Common Fix
DNS Resolution	FQDNs not resolving	Fix DNS profile, test with dig/nslookup
API Failure	401/403/429 errors	Check token, enable retry/backoff logic
Health Monitor Flapping	Backend marked up/down rapidly	Tune monitor intervals and thresholds

7. Recovery and Repair Procedures

When issues go beyond quick fixes, you may need to reset parts of your configuration or recover from backups. This section teaches how to do that safely and effectively.

7.1 Configuration Backups

Why it matters:
Backups help you recover from catastrophic failures like data corruption, upgrade failures, or misconfigurations.

How to Create Backups:

Navigate to Administration > System > Backup in the UI.
Options:
- Full system backup (includes VS, Pools, SE Groups, Policies, etc.)
- Scheduled backups: Create automatic daily/weekly jobs.
- Export backups: Download to local storage or external destination.

Restoring Backups:

UI: Upload the .tar backup file and click Restore.
API: POST /api/backuprestore with backup file.

Best Practice:

Schedule daily backups and before major changes like upgrades or SE redeployments.

7.2 Resetting Configurations

Sometimes it's better to reset than debug endlessly, especially for:

Broken configurations
Partial deployments
Testing new setups

Types of Resets:

Scope	Description	Example Use Case
Partial	Reset a specific object (e.g., Virtual Service)	VS misconfigured, backend pool changed
Full	Factory reset of the Controller or SE	Lab rebuild, corrupted database

Reset Methods:

Delete and recreate objects from UI or API.
CLI Factory Reset (Advanced):
On Controller shell:
```
> shell
$ reset-config
```
Use only if you fully intend to wipe the system!

7.3 SE Redeployment

When to Redeploy an SE:

SE is unreachable, corrupted, or fails to start.
SE upgrade fails and cannot be recovered.

How to Redeploy:

UI: Navigate to the SE group, click Replace SE.
The Controller will:
- Remove the bad SE.
- Deploy a new SE with identical configuration.
- Rebalance VS traffic automatically.

For “No Access” Clouds:

Manually deploy the new SE image (OVA/AMI).
Use the CLI or UI to associate it with the correct SE Group.

Summary: Recovery & Repair Actions

Action	Description	Tool / Method
Backup & Restore	Export or import full system config	UI > System > Backup / API
Partial Reset	Reset or delete specific VS/Pool	UI or API
Full Reset	Factory reset of Controller or SE	CLI `reset-config` (use with caution)
SE Redeploy	Replace damaged or unreachable SE	UI “Replace SE” or manual deploy

8. Proactive Monitoring and Alerting (BONUS)

To prevent issues before they happen, Avi provides built-in alerting and integration with external monitoring platforms.

8.1 Alerts and Thresholds

Alerts for:
- VS or SE down
- Pool member failure
- High CPU/memory usage
You can customize thresholds in Analytics Profiles (e.g., trigger alert when CPU > 85%).

8.2 Monitoring Integrations

Supported tools:
- vRealize Operations
- Splunk
- ELK Stack
- Kafka / Webhooks
- SNMP Traps

8.3 Reports and Dashboards

Daily/weekly health reports
Custom dashboards per:
- Application
- Virtual Service
- Tenant

Final Summary: Troubleshooting & Repair Module

Topic	Key Takeaways
Troubleshooting Approach	Start with scoping, isolate with FlightPath and logs
Common Fixes	Address VIP, DNS, SSL, and LB config issues
Built-in Tools	Use FlightPath, CLI debug, logs
Controller/SE Failures	Check system health, HA sync, NTP/DNS, restart or redeploy
Upgrade Problems	Always backup, check compatibility, logs for failed upgrades
Advanced Use Cases	Diagnose DNS, API, monitor flapping with specific tools
Recovery Procedures	Use backups, resets, and redeploy SEs to restore health quickly
Proactive Monitoring	Set alerts, thresholds, and use external log/metric systems