Plan and Design

Plan and Design Detailed Explanation

This section explains how to design a VMware Cloud Foundation (VCF) + vSphere with Tanzu (VKS) environment, moving from business needs to actual implementation.

1. Requirements, Constraints, Assumptions, and Risks

1.1 Requirements Gathering

Design always begins with understanding what the system must accomplish. Requirements fall into two categories: functional (what the system does) and non-functional (how well it must perform).

Functional Requirements

Functional requirements describe the capabilities or features the solution must deliver. In a VCF + VKS design, examples include:

“Provide a Kubernetes platform supporting X clusters and Y namespaces.”
“Support both virtual machine (VM) and container workloads in the same platform.”
“Enable multi-tenant workload isolation using namespaces and Workload Domains.”
“Provide self-service provisioning for DevOps and application teams.”

Functional requirements directly shape the logical architecture, workload placement strategy, and operational model.

Non-Functional Requirements (NFRs)

NFRs define quality expectations and operational characteristics. Common NFR categories include:

Scalability
- Maximum number of clusters, nodes, pods, VMs
- Ability to expand storage and compute resources without downtime
Performance
- Latency thresholds
- IOPS or throughput requirements
- Application response time expectations
Availability
- SLA commitments (e.g., 99.9% uptime)
- Disaster recovery goals: RPO (Recovery Point Objective) and RTO (Recovery Time Objective)
Security
- Encryption (data at rest and in transit)
- Compliance frameworks (CIS, FIPS, ISO)
- Role-based access control and audit visibility
Manageability
- Monitoring tools
- Automation level (e.g., API-driven operations)
- Logging and troubleshooting expectations

NFRs directly influence physical design, storage policies, NSX configuration, and lifecycle management strategy.

1.2 Constraints

Constraints are limitations or rules imposed by the organization or environment. They restrict design choices.

Common examples:

Budget limitations
- Limits hardware selection or cluster scalability.
Fixed hardware vendor or model
- May force specific CPU generations, NICs, or vSAN disk configurations.
Networking restrictions
- Must use existing Layer 2 network domain
- Must maintain legacy architectures (e.g., L2-only ToR switches)
Identity & security constraints
- Must integrate with an existing identity provider (e.g., AD)
- Must comply with specific encryption standards
Data center policies
- Mandated site location
- Required cloud provider for backup or DR

Constraints are not negotiable—they must be respected throughout the design.

1.3 Assumptions

Assumptions fill in gaps when information is not yet available. They must be explicit and documented to avoid misunderstandings.

Common examples:

“The networking team will provide required VLANs and BGP routing.”
“All clusters will use NTP and DNS servers reachable from both data centers.”
“Storage hardware will meet VMware’s compatibility requirements.”
“A DevOps team will manage Kubernetes deployments.”

Assumptions must be validated during design reviews.

1.4 Risks

Risks are potential events that could negatively impact the solution.

Common risk examples:

Single-site deployment
- A disaster could take down all workloads.
Under-sized hardware
- Platform may fail to meet performance SLAs.
Skill gaps in Kubernetes operations
- Increased likelihood of configuration errors or outages.

Mitigation strategies must be provided, such as:

Implementing a second site with SRM
Adding buffer capacity (N+2 hosts)
Training or hiring Kubernetes specialists

Risk management ensures the design remains robust and practical.

2. Conceptual and Logical Design for VCF + VKS

2.1 Conceptual Design

Conceptual design defines the big picture of the solution without diving into technical details.

Key conceptual elements:

Major Services

A typical conceptual statement might be:

“Provide a private cloud for VM and Kubernetes workloads with self-service provisioning, centralized management, observability, and disaster recovery.”

This establishes the platform’s purpose and high-level capabilities.

Consumers

Identify who uses or operates the solution:

Infrastructure team
DevOps team
Application developers
Security and compliance teams
Business or management stakeholders

Understanding consumers helps clarify access control, automation needs, and operational workflows.

High-Level Capabilities

Common conceptual features in a VCF + VKS environment:

Multi-tenancy using Workload Domains and Kubernetes Namespaces
Central management via SDDC Manager and vCenter
Kubernetes-as-a-Service via Supervisor and Tanzu Kubernetes Clusters
Integrated security and policy enforcement through NSX and Kubernetes RBAC

This conceptual framework guides the creation of logical architecture.

2.2 Logical Design

Logical design translates conceptual requirements into a technical blueprint showing logical components and their interactions.

VCF Logical Domains

Logical domain design must specify:

Number of Management and Workload Domains
Segmentation of environments:
- Production
- Non-Production
- UAT
- Lab or development

Logical domains support workload isolation, governance, and lifecycle independence.

Supervisor Cluster Design

Designing Supervisor includes:

Selecting which vSphere clusters will be enabled for Kubernetes
Allocating resources for the Supervisor control plane
Determining how workloads will consume resources
Designing the Namespace hierarchy:
- Per application team
- Per environment
- Per project or business unit

Namespaces enforce resource limits and access policies.

TKC / Guest Cluster Design

Design considerations:

Size of control plane nodes
Number and size of worker nodes
K8s versioning strategy and how upgrades will be performed
Load-balancing architecture for:
- Public ingress
- East–west traffic
- API endpoints

Logical design must align TKCs with organizational structure and workload patterns.

Network Logical Design

Logical network design includes:

Required segments for:
- Management
- vMotion
- vSAN
- Overlay networks
- Workload and application traffic
NSX architecture:
- Tier-0 gateways
- Tier-1 gateways
- Routing model (BGP, static routes)
Kubernetes CIDR selections:
- Pod CIDR
- Service CIDR
- Avoiding overlap with corporate networks

Logical networking is crucial for scalability and connectivity.

3. Physical Design

3.1 Compute Sizing

Physical compute design determines the amount and type of hardware needed.

ESXi Host Sizing

Factors include:

CPU and RAM capacity
NUMA alignment for large VMs
Hardware capabilities for vSAN ESA (NVMe drives)
Required density of VMs and Kubernetes pods

Sizing must accommodate peak workload usage and growth projections.

Cluster Sizing

Key rules:

Minimum of 4–5 hosts per cluster for vSAN + HA stability
Capacity for N+1 or N+2 host failures
Reserve capacity for maintenance operations
Align cluster sizes with workload segmentation for performance and governance

Compute design directly affects cost, resilience, and lifecycle operations.

3.2 Storage Design

Physical storage design defines how capacity, performance, and resilience will be delivered.

Disk Group / Device Layout

Depending on vSAN architecture:

OSA – disk groups with cache and capacity disks
ESA – tiered device architecture optimized for NVMe

Capacity & Growth

Initial capacity requirements
1–3 year growth forecast
Overhead for failures-to-tolerate (FTT) policy
Space for rebuild operations

Storage Policies

Policies should be tailored to workload types:

High-performance workloads may use:
- Higher FTT
- Higher stripe width
Capacity-optimized workloads may use:
- Thin provisioning
- Lower redundancy levels

Mapping K8s Storage to vSphere

StorageClasses correspond to vSphere Storage Policies
PVCs are backed by First-Class Disks (FCDs) in vSphere
Ensures consistent storage behavior across clusters

3.3 Network Physical Design

Network physical design ensures that logical networks are supported by robust physical infrastructure.

Key considerations:

Uplink count and bandwidth
- 2×25GbE uplinks is common for VCF
VLAN mapping
- Management
- vMotion
- vSAN
- Overlay transport (NSX)
- Kubernetes networks
Leaf–spine considerations
- ECMP routing
- MTU configuration (jumbo frames required for NSX/vSAN performance)
Redundancy and multipathing
- LACP bundles
- Dual ToR switches
- Resilient edge connectivity

Network physical design ensures availability, throughput, and scalability.

4. Design Decisions and Trade-offs

4.1 Documenting Design Decisions

Every architectural decision must be documented clearly. For each decision, include:

Decision
- Example: “Use a single Management Domain and two Workload Domains (Prod and Non-Prod).”
Justification
- Cost savings, governance needs, operational simplicity
Impacts
- Effect on scalability, maintenance, and isolation
Risks and Mitigations
- Risk: Reduced isolation
- Mitigation: Strict RBAC and Namespace governance

This ensures transparency and alignment across stakeholders.

4.2 Common Trade-offs in VCF + VKS

Designing a VCF + VKS platform often requires balancing competing priorities.

Common trade-offs:

More domains vs simpler operations
- More domains = stronger isolation
- Fewer domains = easier lifecycle management
Larger clusters vs more small clusters
- Large clusters maximize resource pooling
- Small clusters improve fault isolation and upgrade flexibility
Strong isolation vs efficient pooling
- Workload Domains provide physical isolation
- Resource pools provide logical segmentation
Aggressive overcommit vs strict SLAs
- Overcommit reduces hardware requirements
- Strict SLAs require reserved capacity and optimized performance

A well-designed solution balances business needs with operational feasibility.

Plan and Design (Additional Content)

1. Lifecycle Management (LCM) Design for VCF 9.x

Lifecycle Management is about how you keep the platform healthy over time: patching, upgrading, and staying supported without breaking things. In VCF 9.x, this is handled mainly by SDDC Manager + vSphere Lifecycle Manager (vLCM).

Designing LCM properly means you think about upgrades and consistency from day one, not as an afterthought.

Upgrade sequencing across NSX, vCenter, ESXi, and vSAN

You cannot upgrade components in a random order. VMware publishes a Bill of Materials (BOM) and a supported upgrade path.

Typical sequence is conceptually:

First: management and underpinning components (for example NSX managers)
Then: vCenter Server
Then: ESXi hosts and vSAN bits
Finally: add-ons and dependent tools

In VCF, SDDC Manager orchestrates this order. From a design perspective, you must:

Plan maintenance windows large enough for full stack upgrades
Avoid out-of-band manual upgrades (for example upgrading NSX directly from its UI)
Ensure your design does not create circular dependencies (for example monitoring depending on a component that is down during every upgrade)

vSphere Lifecycle Manager (vLCM) image-based cluster lifecycle

vLCM uses a “cluster image” instead of many separate patch baselines.

When you design a VCF cluster, you must:

Decide which ESXi version, firmware, and driver bundle will be the “image”
Ensure all hosts in the cluster can support that image
Plan how you will remediate hosts without violating availability (for example maintain N+1 capacity so at least one host can be in maintenance at a time)

Image-based management simplifies lifecycle, but it requires homogeneous hardware and discipline about not modifying hosts outside vLCM.

Bundle dependency and VMware BOM version constraints

Each VCF release comes with a BOM that pins:

One specific vCenter version
One or more ESXi versions
A specific NSX version
A supported vSAN version

LCM design must respect those constraints:

You cannot arbitrarily “jump” to a newer NSX if VCF’s BOM is not ready
You may need to perform multiple step upgrades to reach a target version
You must keep management and workload domains within supported ranges

This affects project planning: sometimes you design for “where we are now + upgrade path,” not only for the final state.

Drift detection and remediation workflows

Drift means a component no longer matches the desired version or configuration.

Examples:

A host where someone manually updated a NIC driver outside vLCM
A cluster where one host missed the last patch
NSX or vCenter at a version different from what SDDC Manager expects

In design, you assume:

Drift will happen
You must have clear workflows to detect and fix it

That means:

Enabling and regularly reviewing compliance/drift reports
Designing clusters with enough spare capacity for remediation runs
Avoiding “snowflake” hosts with special manual tweaks

Host commission and decommission processes

In VCF, you do not just add a host directly into vCenter. You:

Commission a host into SDDC Manager (which checks hardware, firmware, image compatibility, networking)
Then assign it to a domain/cluster

Decommission is the reverse: safely evacuate workloads, remove from domain, and clean up config.

For design, this means:

Ensuring network and firmware standards are defined up front
Requiring new hardware to comply with vLCM images and VCF standards
Designing capacity so you can safely remove hosts if needed (for example for RMA or lifecycle)

Firmware and driver lifecycle integration

In traditional environments, firmware was often updated using vendor tools, independently from ESXi. In VCF 9.x with vLCM:

Firmware and drivers are part of the cluster image
Updates happen together with ESXi remediation

Design considerations:

Choose hardware vendors that provide supported vLCM hardware support packages
Plan unified firmware + ESXi remediation windows
Avoid manual firmware updates that bypass vLCM and break image compliance

Ensuring version consistency across domains

VCF can have multiple Workload Domains plus the Management Domain.

Design goals:

Keep domains within supported BOM versions
Avoid large version skews between domains that interact (for example management tools, backup, monitoring)
Define a lifecycle policy: which domains are upgraded first, and how often

This becomes especially important when you run mixed workloads (classic VMs and VKS clusters) across different domains.

2. Network Design for vSphere with Tanzu (VKS)

When you enable vSphere with Tanzu, your network design must support both VM traffic and Kubernetes traffic. NSX is usually the CNI for pod networking in VCF-based designs.

Supervisor Cluster networking architecture

The Supervisor Cluster:

Runs on a vSphere cluster
Exposes a Kubernetes API endpoint (usually via a virtual IP)
Uses NSX-backed networks for PodVMs and services

Design tasks:

Define management networks for Supervisor control plane VMs
Define workload networks for PodVMs and TKCs
Plan IP ranges and routing so developers can reach the API endpoint and services safely

NSX CNI architecture and packet flow

As CNI, NSX:

Creates logical segments for pods (for example per Namespace or per TKC)
Handles overlay encapsulation (GENEVE) between hosts and edge nodes
Programs distributed firewall rules for pod-level isolation

Packet flow to understand:

Pod-to-pod in the same node: handled locally
Pod-to-pod between nodes: via overlay and distributed routing
Pod-to-external: via Tier-1 and Tier-0 gateways and potentially load balancers

Your design must ensure MTU, routing, and firewall policies all support these flows.

PodVM networking model and traffic separation

In VKS, “native pods” often run inside PodVMs:

Each PodVM is a small VM with its own vNICs
NSX treats them like other VMs at the networking layer
PodVMs may have separate interfaces for node communication vs workload traffic

Traffic separation design:

Management/control traffic (Supervisor, kube API)
Node infrastructure traffic
Application data traffic

You design segments, gateways, and firewall rules for each category.

Node CIDR and Pod CIDR planning

Kubernetes needs:

Node CIDRs: IP ranges for nodes (VMs / PodVMs)
Pod CIDRs: IP ranges for pods themselves

Design principles:

Avoid overlap with existing corporate networks
Reserve enough space for future scaling
Keep CIDR planning simple and well-documented

Poor CIDR planning leads to routing conflicts that are hard to fix later.

Service CIDR and Ingress network design

Service CIDR:

An internal, virtual IP range used for ClusterIP services
Not routed outside the cluster

For design:

Choose a CIDR that does not overlap with real networks
Ensure DNS and CoreDNS can resolve service names correctly

Ingress network design:

Decide which subnets and IPs will front HTTP/HTTPS traffic
Configure NSX Load Balancer or NSX ALB to terminate or pass traffic
Ensure firewall rules allow traffic from clients to Ingress IPs

NSX Load Balancer integration for Kubernetes

The NSX load balancer:

Provides LoadBalancer Service support for Kubernetes
Integrates with Ingress controllers
Routes external requests to pods via NodePort or direct pod endpoints

Design questions:

How many load balancer instances and edges do you need?
How will VIPs be allocated and announced (BGP, static routes)?
Do you need L7 features such as URL-based routing or TLS offload?

North–south routing for Kubernetes services

North–south means in/out of the data center or cluster.

You must:

Design Tier-0 connectivity to the physical network (BGP or static)
Ensure service IPs (load balancer VIPs, Ingress IPs) are reachable from users
Consider edge cluster placement and high availability

The routing design must be robust enough so that losing a single edge or link does not break service.

Namespace-level network isolation patterns

Namespaces can be isolated through:

Kubernetes NetworkPolicies (for example default-deny and explicit allows)
NSX DFW rules mapped to Namespace labels or pod labels

Design patterns:

Per-Namespace default-deny with explicit rules for allowed traffic
Separate logical segments per team or application
Combining Namespace isolation with Workload Domain separation for stronger isolation in sensitive environments

3. Security and Identity Design

Security and identity design defines how users authenticate, what they can access, and how workloads are protected.

vSphere Identity Federation architecture

Identity Federation:

Allows vSphere to delegate authentication to external identity providers (for example SAML or OIDC providers)
Supports modern multi-factor authentication and central identity policies

Design aspects:

Selecting the identity provider
Designing trust relationships and token lifetimes
Ensuring high availability for the identity provider

Kubernetes OIDC authentication design

Kubernetes clusters (Supervisor and TKCs) can:

Use OIDC to validate tokens from an identity provider
Map token claims (for example groups) to Kubernetes RBAC roles

Design tasks:

Define which groups map to which Kubernetes roles
Plan token expiry and refresh behavior
Ensure network connectivity between clusters and the identity provider

Namespace RBAC governance model

RBAC in Kubernetes:

Uses roles and role bindings at Namespace scope
Grants fine-grained permissions (for example create Pods, read ConfigMaps)

A good design:

Defines standard roles per team type (developer, ops, read-only)
Uses group-based bindings rather than individual user bindings
Keeps permissions minimal while supporting normal workflows

Identity-based segmentation in NSX (DFW rules)

NSX can use identity information to drive firewalls:

User identities or AD group membership
VM or pod identities based on tags
Kubernetes labels mapped to security groups

Design usage:

Allow only specific groups to access management endpoints
Restrict east–west traffic between application tiers or tenants
Combine RBAC and DFW to provide defense-in-depth

Certificate lifecycle and rotation requirements

Certificates:

Secure API endpoints (vCenter, NSX, K8s API, Ingress)
Are required for mTLS between services

Design must cover:

Certificate authorities (internal PKI vs external CA)
Renewal and rotation schedules
Automatic vs manual certificate management for K8s clusters and NSX

Ignoring certificate lifecycle leads to outages when certificates expire.

Policy-as-code enforcement (OPA/Gatekeeper)

Policy-as-code:

Uses tools like OPA/Gatekeeper to evaluate Kubernetes resources against rules
Prevents “bad” manifests from being applied

Examples:

Require labels on all namespaces and workloads
Block privileged containers
Enforce that all services use TLS

Design work includes:

Defining the policy set
Rolling out gradually (audit mode first, then enforce)
Integrating with CI/CD pipelines for pre-deployment checks

Image registry access controls and security scanning

Container images are a major attack surface.

Design includes:

Limiting where images can be pulled from (approved registries)
Enforcing authentication to registries
Integrating image scanners to detect vulnerabilities
Possibly blocking deployment of images with critical vulnerabilities or from unknown sources

4. Multi-Site and Disaster Recovery Architecture

Here you design how the platform behaves across sites or regions and how you recover from major failures.

VCF stretched cluster requirements and constraints

Stretched clusters:

Spread a single vSphere/vSAN cluster across two sites
Use a witness for quorum decisions

Design considerations:

Latency between sites must be low and within vSAN limits
Sufficient bandwidth for synchronous replication
Witness location and sizing
Clear understanding of failure behavior (which site remains active)

Supervisor Cluster availability across multiple zones

You can increase Supervisor availability by:

Placing control plane VMs across fault domains or AZs
Ensuring persistent storage is resilient across zones
Designing networking so losing an AZ does not cut off the API endpoint

This is similar to multi-AZ design for traditional applications, but applied to the Kubernetes control plane.

TKC cluster etcd backup and restore architecture

TKCs hold their state in etcd:

Backing up etcd periodically protects you from control plane failures and misconfigurations
Restoring etcd allows you to recover the cluster state at a specific point in time

Design decisions:

Where to store backups (on-site, off-site)
How often to run backups
How to test restore procedures without impacting production

Storage replication and PV/PVC recovery behavior

For stateful workloads:

VM-level replication or vSAN replication will move data
PVCs must be reattached or recreated at the DR site
StorageClasses and policies must exist at the target location

Design must ensure a PV created in site A has a recognizable and compatible counterpart in site B after failover.

Cross-site failover models for VKS workloads

Possible models:

Rebuild clusters at DR site from code and attach replicated data
Maintain warm standby TKCs at DR site and promote them on failover
Run active–active clusters in multiple regions and use GSLB

Each has different cost, complexity, RTO, and RPO characteristics.

Network design considerations for multi-region operations

You must pay attention to:

IP addressing (can you reuse same CIDRs, or must you re-IP?)
Load balancers and DNS failover (GSLB)
Latency and bandwidth between regions
Network security and compliance for cross-border data flows

Bad network design can make a DR plan impossible to execute.

Application-level DR vs platform-level DR approaches

Platform-level DR:

Moves entire clusters or VMs (for example via SRM)
Focuses on infrastructure continuity

Application-level DR:

Recreates applications and data using app-aware methods (for example database replication, GitOps)
Focuses on the app’s own resilience mechanisms

Most real-world designs use a mix of both, depending on each application’s needs.

5. VCF-Specific Hardware and Cluster Constraints

VCF imposes specific requirements that affect hardware selection and cluster layout.

Minimum host count per cluster (including vSAN + HA requirements)

You must design:

Enough hosts for vSAN to meet FTT and performance needs
Enough hosts for HA to tolerate failures (for example N+1, N+2)
Enough hosts to support maintenance mode without risking capacity exhaustion

VCF often recommends at least 4 or more hosts per vSAN cluster in production scenarios.

ESA/OSA selection criteria and operational differences

Choosing between vSAN ESA and OSA impacts:

Hardware requirements (NVMe requirement for ESA)
Performance characteristics
How cache and capacity are structured
Migration strategies from OSA to ESA later

You should prefer ESA in modern designs where hardware allows it, but you must confirm compatibility.

Host NIC bandwidth requirements for VCF (for example, 25 GbE)

Your design should:

Provide sufficient bandwidth for vSAN, vMotion, management, and overlay traffic
Use at least 25 GbE in most production-grade VCF deployments
Reserve enough bandwidth for peaks and resync operations

Low bandwidth results in contention between storage and application traffic.

GPU/SmartNIC (DPU) design considerations within VCF 9.x

If you use GPUs or DPUs:

Ensure they are on the VMware HCL and supported by vLCM images
Plan which clusters will host GPU workloads
Consider NUMA and PCIe topology for performance
Check integration with Kubernetes (for example device plugins)

SmartNICs/DPUs may offload NSX or storage functions; this affects how you design host roles and firmware updates.

NSX Edge cluster sizing and placement rules

Edges must handle:

Aggregate north–south throughput
Load balancer traffic
Routing and NAT operations

Design tasks:

Decide how many edges per edge cluster
Place edges across racks or AZs for redundancy
Size CPU and memory to handle peak traffic and failover scenarios

AVN (Application Virtual Network) requirements for domain deployment

AVNs are NSX logical networks used by management components or services.

Design includes:

Which AVNs are required for specific domains or solutions
How AVNs connect to the physical network
IP ranges and routing for AVN segments

These decisions affect how easily you can expand the platform later.

6. Design Validation and Compliance

After designing, you must prove that the design is correct, complete, and supportable.

Validating alignment with functional and non-functional requirements

You should map:

Each functional requirement to one or more design elements
Each non-functional requirement (availability, security, performance) to specific mechanisms

If something cannot be traced back to a requirement, question whether it belongs. If a requirement is not satisfied by any design element, the design is incomplete.

Logical-to-physical mapping verification

Check that:

Every logical component (cluster, Namespace, network, storage policy) has a realistic physical implementation
Capacity models match actual hardware counts and capabilities
Network diagrams match switch, VLAN, and routing designs

This prevents “paper designs” that cannot be implemented.

Scalability modeling and capacity forecasting

You should:

Estimate current and projected usage (VM count, pod count, storage)
Model growth over 1–3 years
Determine when you need to add hosts, disks, or clusters

Tools like Aria Operations can help, but the design must include high-level assumptions and thresholds.

Availability modeling for host, rack, and site failures

Ask:

What happens if a host fails? A rack? A whole site?
Does the design meet the RTO/RPO and SLA commitments in each case?
Are there enough spare resources to restart workloads elsewhere?

This analysis might lead you to adjust cluster sizes, vSAN policies, or DR strategies.

Compatibility checks (HCL, BOM, interop matrix)

Before finalizing the design:

Confirm hardware is on the VMware HCL
Confirm software versions match the VCF BOM
Check interoperability between VMware products and third-party tools

Skipping this step leads to upgrade blocks and unsupported configurations.

Risk re-evaluation and mitigation confirmation

During design, you identify risks (budget, skill gaps, hardware constraints). In validation, you:

Re-check that each risk has an appropriate mitigation
Decide if the residual risk is acceptable to the business
Document risk ownership (who accepts it)

Operational readiness validation

Finally, check if the organization is ready to run what you have designed:

Are there enough trained staff?
Are monitoring, backup, and DR runbooks in place and tested?
Are support processes and escalation paths defined?

A technically perfect design can fail if operations are not prepared to support it.

Shopping cart

Subtotal:

3V0-24.25 Plan and Design

Detailed list of 3V0-24.25 knowledge points

Plan and Design Detailed Explanation

1. Requirements, Constraints, Assumptions, and Risks

1.1 Requirements Gathering

Functional Requirements

Non-Functional Requirements (NFRs)

1.2 Constraints

1.3 Assumptions

1.4 Risks

2. Conceptual and Logical Design for VCF + VKS

2.1 Conceptual Design

Major Services

Consumers

High-Level Capabilities

2.2 Logical Design

VCF Logical Domains

Supervisor Cluster Design

TKC / Guest Cluster Design

Network Logical Design

3. Physical Design

3.1 Compute Sizing

ESXi Host Sizing

Cluster Sizing

3.2 Storage Design

Disk Group / Device Layout

Capacity & Growth

Storage Policies

Mapping K8s Storage to vSphere

3.3 Network Physical Design

4. Design Decisions and Trade-offs

4.1 Documenting Design Decisions

4.2 Common Trade-offs in VCF + VKS

Plan and Design (Additional Content)

1. Lifecycle Management (LCM) Design for VCF 9.x

Upgrade sequencing across NSX, vCenter, ESXi, and vSAN

vSphere Lifecycle Manager (vLCM) image-based cluster lifecycle

Bundle dependency and VMware BOM version constraints

Drift detection and remediation workflows

Host commission and decommission processes

Firmware and driver lifecycle integration

Ensuring version consistency across domains

2. Network Design for vSphere with Tanzu (VKS)

Supervisor Cluster networking architecture

NSX CNI architecture and packet flow

PodVM networking model and traffic separation

Node CIDR and Pod CIDR planning

Service CIDR and Ingress network design

NSX Load Balancer integration for Kubernetes

North–south routing for Kubernetes services

Namespace-level network isolation patterns

3. Security and Identity Design

vSphere Identity Federation architecture

Kubernetes OIDC authentication design

Namespace RBAC governance model

Identity-based segmentation in NSX (DFW rules)

Certificate lifecycle and rotation requirements

Policy-as-code enforcement (OPA/Gatekeeper)

Image registry access controls and security scanning

4. Multi-Site and Disaster Recovery Architecture

VCF stretched cluster requirements and constraints

Supervisor Cluster availability across multiple zones

TKC cluster etcd backup and restore architecture

Storage replication and PV/PVC recovery behavior

Cross-site failover models for VKS workloads

Network design considerations for multi-region operations

Application-level DR vs platform-level DR approaches

5. VCF-Specific Hardware and Cluster Constraints

Minimum host count per cluster (including vSAN + HA requirements)

ESA/OSA selection criteria and operational differences

Host NIC bandwidth requirements for VCF (for example, 25 GbE)

GPU/SmartNIC (DPU) design considerations within VCF 9.x

NSX Edge cluster sizing and placement rules

AVN (Application Virtual Network) requirements for domain deployment

6. Design Validation and Compliance

Validating alignment with functional and non-functional requirements

Logical-to-physical mapping verification