Plan and Design

Plan and Design Detailed Explanation

Designing a VMware or VCF-based platform requires not only technical knowledge but the ability to interpret business needs, translate them into technical requirements, and build an architecture that is scalable, secure, and operationally sustainable.

1. Requirements Gathering and Analysis

1.1 Functional Requirements

What functional requirements represent

Functional requirements describe what the platform must do, from a capability perspective.

Examples:

“Support 200 virtual machines.”
“Run 5 business applications with strict uptime requirements.”
“Provide High Availability (HA) for all production workloads.”
“Enable VM-level encryption.”
“Support hybrid connectivity to public cloud.”

Functional requirements drive feature-level decisions, such as:

Whether to deploy vSAN, NSX, DRS, FT, or Tanzu.
Whether you need stretched clusters for cross-site HA.
Whether additional components (e.g., load balancers, VPNs) are needed.

1.2 Non-Functional Requirements

Key non-functional dimensions

Non-functional requirements describe how well the system must perform its functions.

Typical categories:

Availability
- Required uptime (e.g., 99.9%, 99.99%).
- Determines redundancy levels, cluster count, HA configuration.
RPO/RTO (Recovery Point Objective / Recovery Time Objective)
- RPO: how much data loss is acceptable.
- RTO: how long recovery may take.
Performance
- Response time requirements, IOPS needs, CPU/memory needs.
Security
- Encryption, segmentation, identity integration, logging.
Manageability
- Monitoring, lifecycle tooling, automation expectations.
Scalability
- Expected growth rate.
- Determines number of clusters, hosts, storage capacity.
Supportability
- Must fit within team skillset and operational capabilities.

Non-functional requirements ultimately shape architecture quality and sustainability.

1.3 Constraints

Understanding constraints

Constraints limit your design. They cannot be changed and must be respected.

Common constraints:

Budget
Limits hardware choices, licensing tiers, and redundancy options.
Existing hardware
Must reuse what is available; may impose compatibility or performance limits.
Datacenter location & space
Physical constraints like racks, power, cooling, room layout.
Licensing
Determines what VMware features you can enable (e.g., NSX editions, vSAN RAID-5 eligibility).

Recognizing constraints early prevents unrealistic or unsupported designs.

1.4 Assumptions

Documenting assumptions

Assumptions fill gaps in missing information.
They must be clearly documented because they affect design decisions.

Examples:

“Workload growth is assumed to be 15% per year.”
“Batch processing occurs only at night.”
“External load balancer is provided by the network team.”

Assumptions are not facts — they must be validated later.

1.5 Risks

Identifying risks

A good architect proactively identifies risks such as:

Single points of failure
- One switch, one firewall, one storage controller.
Vendor lock-in
- Hard to change products later.
Skills gaps
- Admins may not know NSX, vSAN, or automation yet.
Aggressive timelines
- Not enough time for proper testing or validation.

Risks should be mitigated through design choices or operational planning.

2. Capacity and Sizing

2.1 Workload Characterization

2.1.1 Peak vs average utilization

Understanding workload behavior is essential:

Average usage → Long-term capacity planning.
Peak usage → Determines performance and cluster sizing.

A platform sized only for average usage may fail under spikes.

2.1.2 Workload variability

Different workloads behave differently:

Batch workloads
- Short bursts of high CPU/disk usage.
- Affect storage and compute planning.
Steady workloads
- Predictable resource usage.
- Good candidates for reservations.

Accurate classification improves capacity modeling.

2.2 Compute Sizing

2.2.1 CPU and memory per host

Compute sizing includes:

Determining vCPU-to-pCPU ratios.
Allocating enough RAM to prevent ballooning/swapping.
Ensuring CPU/memory align with NUMA boundaries.

2.2.2 Many small hosts vs fewer large hosts

Smaller but more hosts

Better HA failover granularity.
Reduced resource “blast radius” if a host fails.
More network/management overhead.

Fewer but larger hosts

Higher consolidation efficiency.
Lower hardware cost per VM.
Larger failure impact; NUMA alignment critical.

Balance depends on workload type and budget.

2.3 Storage Sizing

2.3.1 Capacity and performance

Sizing considerations include:

Raw capacity needed
Effective capacity after RAID/FTT overhead
IOPS and throughput requirements
Latency sensitivity of applications

2.3.2 Replica and erasure coding overhead

Examples:

RAID-1 (FTT=1): uses 2x capacity.
RAID-5: uses ~1.33x capacity.
RAID-6: uses ~1.5x capacity.

Higher efficiency → more CPU cost and higher minimum cluster size.

2.3.3 Growth and retention

Account for:

Yearly data growth
Log retention policies
Snapshot retention
Backup storage needs

Storage planning must also consider upgrade windows and rebuild capacity.

2.4 Network Sizing

2.4.1 Host bandwidth, uplink count, and speeds

Network sizing includes:

Number of NICs per host
NIC speed (10/25/40/100 Gbps)
NIC teaming and redundancy
Segregation of traffic types (vMotion, vSAN, management, VM)

2.4.2 Traffic patterns

East–west traffic
- VM ↔ VM inside the datacenter
- Grows with microservices and multi-tier apps
North–south traffic
- VM ↔ Internet or external systems

High east–west traffic favors:

Better underlay bandwidth
NSX distributed routing
Larger leaf-spine fabrics

3. Logical and Physical Design

3.1 Cluster Design

3.1.1 Management vs workload domains

Management cluster/domain
- Runs vCenter, NSX, vSAN, and other infrastructure services.
Workload clusters/domains
- Run business applications.
- Provide lifecycle independence.

3.1.2 Cluster size limits, HA and DRS

Cluster considerations include:

Maximum hosts per cluster (typically 64).
HA admission control settings to guarantee failover.
DRS automation level (manual, partially automated, fully automated).
Affinity/anti-affinity rules for app-level redundancy.

3.2 Network Topology Design

3.2.1 Layered physical network

Typical structure:

Core layer – high-speed routing backbone.
Aggregation layer – optional layer for scaling larger networks.
Access layer – Top-of-Rack switches connecting ESXi hosts.

3.2.2 ToR switches, LAGs, MLAG/VPC

ToR (Top-of-Rack) switches connect hosts to the network.
LAG (Link Aggregation Group) bundles links for bandwidth + redundancy.
MLAG/VPC links two physical switches into one logical entity → prevents single switch failure.

3.2.3 NSX overlay on underlay

Underlay provides IP connectivity with proper MTU.
NSX overlays create logical networks independent of physical layout.
Overlay routing (distributed routing) improves east–west performance.

3.3 Storage Design

3.3.1 Storage technology choices

vSAN – hyperconverged, easy scaling.
SAN – traditional centralized block storage.
NAS – file-based, simple to manage.

Hybrid environments use multiple types depending on workload.

3.3.2 Fault domains and stretched clusters

Fault domains reduce correlated failure risks.
Stretched clusters provide cross-site resilience.

3.3.3 Storage policy design

Policies define:

Performance tier
Redundancy level
RAID format
Compliance reporting

Each workload can have tailored storage rules.

3.4 Security & Access Design

3.4.1 RBAC across vSphere, NSX, and VCF

Assign minimal privileges.
Separate operational teams: network, storage, security.

3.4.2 Identity provider integration

Common identity sources:

AD / LDAP
SAML / OIDC identity providers
MFA-enabled enterprise IdPs

3.4.3 Segmentation and firewall zones

Design zones:

Management
Production
DMZ
Sensitive data zones

Policies enforced using NSX DFW and gateway firewalls.

4. Design for Availability, Performance, and Resiliency

4.1 Availability

4.1.1 Redundancy (N+1, N+2)

Redundancy protects against failures:

N+1: one spare host or device
N+2: two simultaneous failures tolerated

Redundancy applies to:

Hosts
Switches
Storage
Power (dual PSUs, dual feeds)

4.1.2 HA & datastore redundancy

HA restarts VMs on surviving hosts.
Datastores must support redundancy to avoid storage outages.

4.2 Performance

4.2.1 Minimizing contention

Plan to avoid:

CPU contention (high ready time)
Memory ballooning/swapping
Storage latency spikes
Oversubscribed bandwidth

4.2.2 NUMA, locality, and network path design

Performance improves by:

Fitting VMs within NUMA boundaries
Using local storage paths
Ensuring low-latency networking

4.3 Resiliency & Disaster Recovery

4.3.1 Replication technologies

Storage-based replication
VM-based replication (vSphere Replication)
App-level replication (e.g., database clustering)

4.3.2 Consistency models

Crash-consistent – like a sudden power cut.
Application-consistent – uses VSS or quiescing for safer recovery.

4.3.3 Failover and runbooks

Document how to perform DR failover.
Runbooks include:
- Switch DNS
- Start order for VMs
- Network reconfigurations
- Testing instructions

5. Operational Design Considerations

5.1 Day-0 / Day-1 / Day-2 Design

5.1.1 Day-0: foundational planning

Includes:

Bill of Materials (BOM)
Compatibility verification
Version selection and alignment
Datacenter preparation

5.1.2 Day-1: initial deployment

Includes:

Installing ESXi
Deploying vCenter, NSX, vSAN
Creating clusters and storage policies
Setting up monitoring and backup

5.1.3 Day-2: continuous operations

Includes:

Upgrades and patches
Capacity monitoring
Log and alert management
Optimization tasks

5.2 Manageability

Ease of operation

Design with:

Clear naming standards
Documented procedures
Dashboards for metrics and logs
Automated health checks

5.3 Automation Strategy

5.3.1 Templates, scripts, images, policies

Use automation for:

VM deployment
Cluster configuration
Compliance enforcement
Lifecycle operations

5.3.2 Self-service vs centralized control

Decide:

Which operations developers may perform
Which actions require admin approval
Whether to expose API-based provisioning portals

Plan and Design (Additional Content)

1. Design Decision Documentation Framework

A good design is not just about the final architecture; it is about clearly documented decisions. A consistent decision framework makes the design reviewable, defensible, and easier to maintain.

Decision statement
This is a concise description of what you decided. It should be specific, technology-aware, and unambiguous.
Examples:

Use vSAN ESA for primary production storage in the management domain.
Use NSX overlay networking for all application tiers across workload domains.

Alternatives considered
List the realistic options you evaluated, not just the one you chose.
Examples:

vSAN ESA vs vSAN OSA vs external FC SAN.
NSX overlay vs VLAN-only network design.
This shows you compared viable options instead of picking a solution arbitrarily.

Justification
Explain why the chosen option is the best fit for the requirements and constraints. This should tie back to business and technical drivers such as availability, cost, manageability, and performance.
Examples:

vSAN ESA selected for higher throughput and better space efficiency on all-NVMe hardware.
NSX overlay chosen to simplify multi-tenant network segmentation and micro-segmentation.

Impact analysis
Document the consequences of the decision on the overall design.
This includes:

Operational impact (skills, processes, tooling).
Cost impact (licenses, hardware requirements).
Architectural impact (dependencies on other components).
Example: adopting NSX overlays requires NSX expertise, MTU configuration on the underlay, and changes to troubleshooting practices.

Risks and mitigations
Every decision introduces risk. Identify them and specify mitigations.
Examples:

Risk: team inexperience with NSX.
Mitigation: training, phased deployment, strong documentation.
Risk: tight hardware compatibility window for vSAN ESA.
Mitigation: strict HCL management and lab validation prior to production rollout.

2. VMware Design Quality Attributes (Framework)

VMware uses a consistent set of design quality attributes to evaluate and structure architectures. Every major design decision should be assessed against these attributes.

Availability
Measures how resilient the design is to failures and how much uptime it can provide.
Influenced by: HA clustering, redundancy (N+1/N+2), fault domains, stretched clusters, and DR design.

Manageability
Describes how easy it is to operate, monitor, upgrade, and troubleshoot the platform.
Influenced by: automation, standardization, naming conventions, tooling integration, and operational processes.

Performance
Reflects how well the platform meets workload SLAs under normal and peak load.
Influenced by: host sizing, storage design (IOPS/latency), network bandwidth, resource overcommit policies, and NUMA awareness.

Recoverability
Focuses on how quickly and how completely the platform can be recovered after a failure or disaster.
Influenced by: backup and restore strategy, replication, RPO/RTO alignment, runbooks, and DR test practices.

Security
Covers the protection of data, identities, workloads, and management planes.
Influenced by: encryption, RBAC, identity integration, segmentation, hardening baselines, and certificate management.

3. vSphere HA Admission Control Strategies

Admission control ensures that enough resources remain available in a cluster to restart VMs if one or more hosts fail.

Percentage-based admission control
Configures a percentage of cluster CPU and memory to be reserved for failover.
Dynamic and flexible: the cluster evaluates total capacity and reserved resources to decide if additional VMs can be powered on.

Slot-based admission control
Uses a “slot” concept derived from the largest CPU and memory reservations (or defaults if none).
Determines how many such slots the cluster can host and how many must be held in reserve for failover.
Can be overly conservative when a few large VMs exist.

Dedicated failover hosts
Reserves one or more hosts purely for failover.
No workloads should run on these hosts during normal operation.
Simple to understand but less efficient in terms of resource usage.

Impact on workload placement and cluster sizing
Admission control directly affects:

How many VMs you can place in a cluster.
How much spare capacity you must maintain.
How aggressively you can overcommit resources.
Designers must balance safety (enough failover capacity) against cost and utilization efficiency.

4. Sizing Considerations for Upgrades and Maintenance

Designs must account not only for failures but also for planned activities such as upgrades and hardware maintenance.

N+1 for maintenance vs N+1 for HA
N+1 for HA ensures that the cluster can tolerate the loss of one host while keeping all VMs running.
N+1 for maintenance considers that you may simultaneously remove a host for planned work and still need to tolerate failures.
In some environments, you may effectively need N+2 capacity for both HA and upgrades.

Resource overhead during rolling upgrades
During rolling upgrades of ESXi, NSX, or vSAN, hosts enter maintenance mode and evacuate VMs or data.
The cluster must have enough spare CPU, memory, network, and storage capacity to:

Host evacuated VMs.
Handle vSAN resync operations.
Maintain acceptable performance for users.

Lifecycle Manager image-based upgrade constraints
Image-based management requires all hosts in a cluster to conform to a single desired image (ESXi version, drivers, firmware).
Constraints include:

Hardware compatibility with the desired image.
Vendor firmware bundles that must align with the VMware image.
The need to stage and remediate hosts without breaking vSAN or NSX connectivity.

5. Monitoring and Observability Design Structure

A well-designed monitoring strategy is structured and intentional rather than reactive.

Metrics
Quantitative measures such as CPU utilization, memory consumption, IOPS, latency, packet loss, and throughput.
Metrics provide trending, capacity planning, and performance insight.

Logs
Detailed records of system events, warnings, and errors from vCenter, ESXi, NSX, vSAN, and applications.
Logs are key for root cause analysis, security investigations, and audit trails.

Events
Higher-level notifications that something significant occurred, such as VM power events, configuration changes, or HA failovers.
Events are often used for alerting and automation triggers.

Traces
End-to-end transaction or request paths, particularly relevant in microservices architectures.
Traces help identify where latency or failures occur across multiple services and tiers.

Alarm and dashboard design standards
Alarms should be:

Prioritized (critical, warning, info).
Actionable (clear description and remediation guidance).
Dashboards should:
Present health at different levels (global, cluster, application).
Highlight trends and anomalies.
A standardized observability design ensures consistent behavior across environments and teams.

6. Multi-Cluster, Multi-Site, and Multi-Tenant Design

Modern designs often span multiple clusters, sites, and tenants; the architecture must define clear boundaries and patterns.

Tenant isolation boundaries
Isolation can be implemented at several levels:

Folder or resource pool separation in vSphere.
Dedicated clusters or workload domains for strict isolation.
Separate NSX segments, gateways, and security policies.
The chosen boundary depends on security, compliance, and administrative needs.

Cluster expansion vs cluster segmentation strategies
Adding capacity to an existing cluster simplifies management but can increase blast radius and failure impact.
Creating additional clusters reduces blast radius and allows different policies or maintenance windows, but increases management overhead.
Decisions should be driven by tenant needs, SLAs, risk tolerance, and operational scale.

Active-Active vs Active-Passive designs
Active-Active: both sites run workloads concurrently, distributing load and providing fast failover with low RPO/RTO.
Active-Passive: one site runs production; the other is primarily for DR. Failover involves orchestration and may accept higher RTO.

Inter-site connectivity and failover patterns
Multi-site designs require:

Sufficient bandwidth and low enough latency for replication and control-plane traffic.
Consistent IP addressing or mechanisms for IP failover.
Clearly defined failover runbooks, including DNS changes, routing changes, and application startup order.

7. VCF-Specific Design Topics

VCF adds specific constructs and operational patterns that must be considered in the design.

Workload Domain placement strategy
Deciding how to group workloads into workload domains based on:

Security and compliance boundaries.
Lifecycle independence (different upgrade schedules).
Application criticality and isolation requirements.
Over-fragmentation increases overhead; too few domains reduce isolation and flexibility.

Network Pool design considerations
Network Pools define the IP ranges and VLANs used for host and Edge connectivity.
Design must ensure sufficient address space, routability, MTU compatibility, and alignment with the underlay topology.

Bring-up prerequisites and validation
VCF bring-up requires strict pre-validation of:

DNS, NTP, and certificate requirements.
Underlay network readiness (routing, MTU, VLANs).
Hardware compatibility.
Validation failures during bring-up can be time-consuming to fix, so design must ensure readiness.

Certificate and password rotation design impacts
VCF orchestrates rotation across multiple components.
The design must account for:

Scheduled windows to avoid impacting production.
Dependencies between SSO, vCenter, NSX, and SDDC Manager certificates.
Integration with enterprise PKI and password policies.

Lifecycle drift prevention
Over time, unmanaged changes can cause version and configuration drift.
VCF’s lifecycle management aims to maintain a consistent Bill of Materials across domains.
Design should minimize manual changes and rely on SDDC Manager workflows to maintain consistency.

8. Security Design Blueprint

A security blueprint ensures that cryptography, segmentation, and roles are consistently designed and enforced.

Key management (KMS and certificate hierarchy)
Centralized KMS systems manage encryption keys for VM and vSAN encryption.
A well-defined certificate hierarchy ensures trust between vCenter, ESXi, NSX, and integrated systems.
The design should define where KMS resides, how it is protected, and how failover is handled.

Encryption design (VM, vSAN, and in-transit)
VM Encryption protects individual virtual machines and their disks.
vSAN Encryption protects data at rest across the cluster.
In-transit encryption protects data moving between components or sites (for example, IPsec or TLS for management and replication traffic).
Design must consider performance impact, compliance requirements, and operational complexity.

Zero Trust segmentation principles
Assume no implicit trust based on network location.
Apply least privilege access via:

Micro-segmentation using NSX distributed firewall.
Strict management plane isolation.
Identity-aware rules where possible.
Zero Trust designs minimize lateral movement in case of compromise.

Role separation across platform, network, and security operations
Segregate duties so that no single team has unrestricted end-to-end control.
Examples:

Platform team manages vSphere and VCF operations.
Network team manages NSX underlay and overlay routing.
Security team manages firewalls, policies, and SIEM integration.
Clear separation reduces risk and better aligns with compliance frameworks.

Shopping cart

Subtotal:

3V0-22.25 Plan and Design

Detailed list of 3V0-22.25 knowledge points

Plan and Design Detailed Explanation

1. Requirements Gathering and Analysis

1.1 Functional Requirements

What functional requirements represent

1.2 Non-Functional Requirements

Key non-functional dimensions

1.3 Constraints

Understanding constraints

1.4 Assumptions

Documenting assumptions

1.5 Risks

Identifying risks

2. Capacity and Sizing

2.1 Workload Characterization

2.1.1 Peak vs average utilization

2.1.2 Workload variability

2.2 Compute Sizing

2.2.1 CPU and memory per host

2.2.2 Many small hosts vs fewer large hosts

2.3 Storage Sizing

2.3.1 Capacity and performance

2.3.2 Replica and erasure coding overhead

2.3.3 Growth and retention

2.4 Network Sizing

2.4.1 Host bandwidth, uplink count, and speeds

2.4.2 Traffic patterns

3. Logical and Physical Design

3.1 Cluster Design

3.1.1 Management vs workload domains

3.1.2 Cluster size limits, HA and DRS

3.2 Network Topology Design

3.2.1 Layered physical network

3.2.2 ToR switches, LAGs, MLAG/VPC

3.2.3 NSX overlay on underlay

3.3 Storage Design

3.3.1 Storage technology choices

3.3.2 Fault domains and stretched clusters

3.3.3 Storage policy design

3.4 Security & Access Design

3.4.1 RBAC across vSphere, NSX, and VCF

3.4.2 Identity provider integration

3.4.3 Segmentation and firewall zones

4. Design for Availability, Performance, and Resiliency

4.1 Availability

4.1.1 Redundancy (N+1, N+2)

4.1.2 HA & datastore redundancy

4.2 Performance

4.2.1 Minimizing contention

4.2.2 NUMA, locality, and network path design

4.3 Resiliency & Disaster Recovery

4.3.1 Replication technologies

4.3.2 Consistency models

4.3.3 Failover and runbooks

5. Operational Design Considerations

5.1 Day-0 / Day-1 / Day-2 Design

5.1.1 Day-0: foundational planning

5.1.2 Day-1: initial deployment

5.1.3 Day-2: continuous operations

5.2 Manageability

Ease of operation

5.3 Automation Strategy

5.3.1 Templates, scripts, images, policies

5.3.2 Self-service vs centralized control

Plan and Design (Additional Content)

1. Design Decision Documentation Framework

2. VMware Design Quality Attributes (Framework)

3. vSphere HA Admission Control Strategies

4. Sizing Considerations for Upgrades and Maintenance

5. Monitoring and Observability Design Structure

6. Multi-Cluster, Multi-Site, and Multi-Tenant Design

7. VCF-Specific Design Topics

8. Security Design Blueprint

Frequently Asked Questions