This section explains how to design a VMware Cloud Foundation (VCF) + vSphere with Tanzu (VKS) environment, moving from business needs to actual implementation.
Design always begins with understanding what the system must accomplish. Requirements fall into two categories: functional (what the system does) and non-functional (how well it must perform).
Functional requirements describe the capabilities or features the solution must deliver. In a VCF + VKS design, examples include:
“Provide a Kubernetes platform supporting X clusters and Y namespaces.”
“Support both virtual machine (VM) and container workloads in the same platform.”
“Enable multi-tenant workload isolation using namespaces and Workload Domains.”
“Provide self-service provisioning for DevOps and application teams.”
Functional requirements directly shape the logical architecture, workload placement strategy, and operational model.
NFRs define quality expectations and operational characteristics. Common NFR categories include:
Scalability
Maximum number of clusters, nodes, pods, VMs
Ability to expand storage and compute resources without downtime
Performance
Latency thresholds
IOPS or throughput requirements
Application response time expectations
Availability
SLA commitments (e.g., 99.9% uptime)
Disaster recovery goals: RPO (Recovery Point Objective) and RTO (Recovery Time Objective)
Security
Encryption (data at rest and in transit)
Compliance frameworks (CIS, FIPS, ISO)
Role-based access control and audit visibility
Manageability
Monitoring tools
Automation level (e.g., API-driven operations)
Logging and troubleshooting expectations
NFRs directly influence physical design, storage policies, NSX configuration, and lifecycle management strategy.
Constraints are limitations or rules imposed by the organization or environment. They restrict design choices.
Common examples:
Budget limitations
Fixed hardware vendor or model
Networking restrictions
Must use existing Layer 2 network domain
Must maintain legacy architectures (e.g., L2-only ToR switches)
Identity & security constraints
Must integrate with an existing identity provider (e.g., AD)
Must comply with specific encryption standards
Data center policies
Mandated site location
Required cloud provider for backup or DR
Constraints are not negotiable—they must be respected throughout the design.
Assumptions fill in gaps when information is not yet available. They must be explicit and documented to avoid misunderstandings.
Common examples:
“The networking team will provide required VLANs and BGP routing.”
“All clusters will use NTP and DNS servers reachable from both data centers.”
“Storage hardware will meet VMware’s compatibility requirements.”
“A DevOps team will manage Kubernetes deployments.”
Assumptions must be validated during design reviews.
Risks are potential events that could negatively impact the solution.
Common risk examples:
Single-site deployment
Under-sized hardware
Skill gaps in Kubernetes operations
Mitigation strategies must be provided, such as:
Implementing a second site with SRM
Adding buffer capacity (N+2 hosts)
Training or hiring Kubernetes specialists
Risk management ensures the design remains robust and practical.
Conceptual design defines the big picture of the solution without diving into technical details.
Key conceptual elements:
A typical conceptual statement might be:
“Provide a private cloud for VM and Kubernetes workloads with self-service provisioning, centralized management, observability, and disaster recovery.”
This establishes the platform’s purpose and high-level capabilities.
Identify who uses or operates the solution:
Infrastructure team
DevOps team
Application developers
Security and compliance teams
Business or management stakeholders
Understanding consumers helps clarify access control, automation needs, and operational workflows.
Common conceptual features in a VCF + VKS environment:
Multi-tenancy using Workload Domains and Kubernetes Namespaces
Central management via SDDC Manager and vCenter
Kubernetes-as-a-Service via Supervisor and Tanzu Kubernetes Clusters
Integrated security and policy enforcement through NSX and Kubernetes RBAC
This conceptual framework guides the creation of logical architecture.
Logical design translates conceptual requirements into a technical blueprint showing logical components and their interactions.
Logical domain design must specify:
Number of Management and Workload Domains
Segmentation of environments:
Production
Non-Production
UAT
Lab or development
Logical domains support workload isolation, governance, and lifecycle independence.
Designing Supervisor includes:
Selecting which vSphere clusters will be enabled for Kubernetes
Allocating resources for the Supervisor control plane
Determining how workloads will consume resources
Designing the Namespace hierarchy:
Per application team
Per environment
Per project or business unit
Namespaces enforce resource limits and access policies.
Design considerations:
Size of control plane nodes
Number and size of worker nodes
K8s versioning strategy and how upgrades will be performed
Load-balancing architecture for:
Public ingress
East–west traffic
API endpoints
Logical design must align TKCs with organizational structure and workload patterns.
Logical network design includes:
Required segments for:
Management
vMotion
vSAN
Overlay networks
Workload and application traffic
NSX architecture:
Tier-0 gateways
Tier-1 gateways
Routing model (BGP, static routes)
Kubernetes CIDR selections:
Pod CIDR
Service CIDR
Avoiding overlap with corporate networks
Logical networking is crucial for scalability and connectivity.
Physical compute design determines the amount and type of hardware needed.
Factors include:
CPU and RAM capacity
NUMA alignment for large VMs
Hardware capabilities for vSAN ESA (NVMe drives)
Required density of VMs and Kubernetes pods
Sizing must accommodate peak workload usage and growth projections.
Key rules:
Minimum of 4–5 hosts per cluster for vSAN + HA stability
Capacity for N+1 or N+2 host failures
Reserve capacity for maintenance operations
Align cluster sizes with workload segmentation for performance and governance
Compute design directly affects cost, resilience, and lifecycle operations.
Physical storage design defines how capacity, performance, and resilience will be delivered.
Depending on vSAN architecture:
OSA – disk groups with cache and capacity disks
ESA – tiered device architecture optimized for NVMe
Initial capacity requirements
1–3 year growth forecast
Overhead for failures-to-tolerate (FTT) policy
Space for rebuild operations
Policies should be tailored to workload types:
High-performance workloads may use:
Higher FTT
Higher stripe width
Capacity-optimized workloads may use:
Thin provisioning
Lower redundancy levels
StorageClasses correspond to vSphere Storage Policies
PVCs are backed by First-Class Disks (FCDs) in vSphere
Ensures consistent storage behavior across clusters
Network physical design ensures that logical networks are supported by robust physical infrastructure.
Key considerations:
Uplink count and bandwidth
VLAN mapping
Management
vMotion
vSAN
Overlay transport (NSX)
Kubernetes networks
Leaf–spine considerations
ECMP routing
MTU configuration (jumbo frames required for NSX/vSAN performance)
Redundancy and multipathing
LACP bundles
Dual ToR switches
Resilient edge connectivity
Network physical design ensures availability, throughput, and scalability.
Every architectural decision must be documented clearly. For each decision, include:
Decision
Justification
Impacts
Risks and Mitigations
Risk: Reduced isolation
Mitigation: Strict RBAC and Namespace governance
This ensures transparency and alignment across stakeholders.
Designing a VCF + VKS platform often requires balancing competing priorities.
Common trade-offs:
More domains vs simpler operations
More domains = stronger isolation
Fewer domains = easier lifecycle management
Larger clusters vs more small clusters
Large clusters maximize resource pooling
Small clusters improve fault isolation and upgrade flexibility
Strong isolation vs efficient pooling
Workload Domains provide physical isolation
Resource pools provide logical segmentation
Aggressive overcommit vs strict SLAs
Overcommit reduces hardware requirements
Strict SLAs require reserved capacity and optimized performance
A well-designed solution balances business needs with operational feasibility.
Lifecycle Management is about how you keep the platform healthy over time: patching, upgrading, and staying supported without breaking things. In VCF 9.x, this is handled mainly by SDDC Manager + vSphere Lifecycle Manager (vLCM).
Designing LCM properly means you think about upgrades and consistency from day one, not as an afterthought.
You cannot upgrade components in a random order. VMware publishes a Bill of Materials (BOM) and a supported upgrade path.
Typical sequence is conceptually:
First: management and underpinning components (for example NSX managers)
Then: vCenter Server
Then: ESXi hosts and vSAN bits
Finally: add-ons and dependent tools
In VCF, SDDC Manager orchestrates this order. From a design perspective, you must:
Plan maintenance windows large enough for full stack upgrades
Avoid out-of-band manual upgrades (for example upgrading NSX directly from its UI)
Ensure your design does not create circular dependencies (for example monitoring depending on a component that is down during every upgrade)
vLCM uses a “cluster image” instead of many separate patch baselines.
When you design a VCF cluster, you must:
Decide which ESXi version, firmware, and driver bundle will be the “image”
Ensure all hosts in the cluster can support that image
Plan how you will remediate hosts without violating availability (for example maintain N+1 capacity so at least one host can be in maintenance at a time)
Image-based management simplifies lifecycle, but it requires homogeneous hardware and discipline about not modifying hosts outside vLCM.
Each VCF release comes with a BOM that pins:
One specific vCenter version
One or more ESXi versions
A specific NSX version
A supported vSAN version
LCM design must respect those constraints:
You cannot arbitrarily “jump” to a newer NSX if VCF’s BOM is not ready
You may need to perform multiple step upgrades to reach a target version
You must keep management and workload domains within supported ranges
This affects project planning: sometimes you design for “where we are now + upgrade path,” not only for the final state.
Drift means a component no longer matches the desired version or configuration.
Examples:
A host where someone manually updated a NIC driver outside vLCM
A cluster where one host missed the last patch
NSX or vCenter at a version different from what SDDC Manager expects
In design, you assume:
Drift will happen
You must have clear workflows to detect and fix it
That means:
Enabling and regularly reviewing compliance/drift reports
Designing clusters with enough spare capacity for remediation runs
Avoiding “snowflake” hosts with special manual tweaks
In VCF, you do not just add a host directly into vCenter. You:
Commission a host into SDDC Manager (which checks hardware, firmware, image compatibility, networking)
Then assign it to a domain/cluster
Decommission is the reverse: safely evacuate workloads, remove from domain, and clean up config.
For design, this means:
Ensuring network and firmware standards are defined up front
Requiring new hardware to comply with vLCM images and VCF standards
Designing capacity so you can safely remove hosts if needed (for example for RMA or lifecycle)
In traditional environments, firmware was often updated using vendor tools, independently from ESXi. In VCF 9.x with vLCM:
Firmware and drivers are part of the cluster image
Updates happen together with ESXi remediation
Design considerations:
Choose hardware vendors that provide supported vLCM hardware support packages
Plan unified firmware + ESXi remediation windows
Avoid manual firmware updates that bypass vLCM and break image compliance
VCF can have multiple Workload Domains plus the Management Domain.
Design goals:
Keep domains within supported BOM versions
Avoid large version skews between domains that interact (for example management tools, backup, monitoring)
Define a lifecycle policy: which domains are upgraded first, and how often
This becomes especially important when you run mixed workloads (classic VMs and VKS clusters) across different domains.
When you enable vSphere with Tanzu, your network design must support both VM traffic and Kubernetes traffic. NSX is usually the CNI for pod networking in VCF-based designs.
The Supervisor Cluster:
Runs on a vSphere cluster
Exposes a Kubernetes API endpoint (usually via a virtual IP)
Uses NSX-backed networks for PodVMs and services
Design tasks:
Define management networks for Supervisor control plane VMs
Define workload networks for PodVMs and TKCs
Plan IP ranges and routing so developers can reach the API endpoint and services safely
As CNI, NSX:
Creates logical segments for pods (for example per Namespace or per TKC)
Handles overlay encapsulation (GENEVE) between hosts and edge nodes
Programs distributed firewall rules for pod-level isolation
Packet flow to understand:
Pod-to-pod in the same node: handled locally
Pod-to-pod between nodes: via overlay and distributed routing
Pod-to-external: via Tier-1 and Tier-0 gateways and potentially load balancers
Your design must ensure MTU, routing, and firewall policies all support these flows.
In VKS, “native pods” often run inside PodVMs:
Each PodVM is a small VM with its own vNICs
NSX treats them like other VMs at the networking layer
PodVMs may have separate interfaces for node communication vs workload traffic
Traffic separation design:
Management/control traffic (Supervisor, kube API)
Node infrastructure traffic
Application data traffic
You design segments, gateways, and firewall rules for each category.
Kubernetes needs:
Node CIDRs: IP ranges for nodes (VMs / PodVMs)
Pod CIDRs: IP ranges for pods themselves
Design principles:
Avoid overlap with existing corporate networks
Reserve enough space for future scaling
Keep CIDR planning simple and well-documented
Poor CIDR planning leads to routing conflicts that are hard to fix later.
Service CIDR:
An internal, virtual IP range used for ClusterIP services
Not routed outside the cluster
For design:
Choose a CIDR that does not overlap with real networks
Ensure DNS and CoreDNS can resolve service names correctly
Ingress network design:
Decide which subnets and IPs will front HTTP/HTTPS traffic
Configure NSX Load Balancer or NSX ALB to terminate or pass traffic
Ensure firewall rules allow traffic from clients to Ingress IPs
The NSX load balancer:
Provides LoadBalancer Service support for Kubernetes
Integrates with Ingress controllers
Routes external requests to pods via NodePort or direct pod endpoints
Design questions:
How many load balancer instances and edges do you need?
How will VIPs be allocated and announced (BGP, static routes)?
Do you need L7 features such as URL-based routing or TLS offload?
North–south means in/out of the data center or cluster.
You must:
Design Tier-0 connectivity to the physical network (BGP or static)
Ensure service IPs (load balancer VIPs, Ingress IPs) are reachable from users
Consider edge cluster placement and high availability
The routing design must be robust enough so that losing a single edge or link does not break service.
Namespaces can be isolated through:
Kubernetes NetworkPolicies (for example default-deny and explicit allows)
NSX DFW rules mapped to Namespace labels or pod labels
Design patterns:
Per-Namespace default-deny with explicit rules for allowed traffic
Separate logical segments per team or application
Combining Namespace isolation with Workload Domain separation for stronger isolation in sensitive environments
Security and identity design defines how users authenticate, what they can access, and how workloads are protected.
Identity Federation:
Allows vSphere to delegate authentication to external identity providers (for example SAML or OIDC providers)
Supports modern multi-factor authentication and central identity policies
Design aspects:
Selecting the identity provider
Designing trust relationships and token lifetimes
Ensuring high availability for the identity provider
Kubernetes clusters (Supervisor and TKCs) can:
Use OIDC to validate tokens from an identity provider
Map token claims (for example groups) to Kubernetes RBAC roles
Design tasks:
Define which groups map to which Kubernetes roles
Plan token expiry and refresh behavior
Ensure network connectivity between clusters and the identity provider
RBAC in Kubernetes:
Uses roles and role bindings at Namespace scope
Grants fine-grained permissions (for example create Pods, read ConfigMaps)
A good design:
Defines standard roles per team type (developer, ops, read-only)
Uses group-based bindings rather than individual user bindings
Keeps permissions minimal while supporting normal workflows
NSX can use identity information to drive firewalls:
User identities or AD group membership
VM or pod identities based on tags
Kubernetes labels mapped to security groups
Design usage:
Allow only specific groups to access management endpoints
Restrict east–west traffic between application tiers or tenants
Combine RBAC and DFW to provide defense-in-depth
Certificates:
Secure API endpoints (vCenter, NSX, K8s API, Ingress)
Are required for mTLS between services
Design must cover:
Certificate authorities (internal PKI vs external CA)
Renewal and rotation schedules
Automatic vs manual certificate management for K8s clusters and NSX
Ignoring certificate lifecycle leads to outages when certificates expire.
Policy-as-code:
Uses tools like OPA/Gatekeeper to evaluate Kubernetes resources against rules
Prevents “bad” manifests from being applied
Examples:
Require labels on all namespaces and workloads
Block privileged containers
Enforce that all services use TLS
Design work includes:
Defining the policy set
Rolling out gradually (audit mode first, then enforce)
Integrating with CI/CD pipelines for pre-deployment checks
Container images are a major attack surface.
Design includes:
Limiting where images can be pulled from (approved registries)
Enforcing authentication to registries
Integrating image scanners to detect vulnerabilities
Possibly blocking deployment of images with critical vulnerabilities or from unknown sources
Here you design how the platform behaves across sites or regions and how you recover from major failures.
Stretched clusters:
Spread a single vSphere/vSAN cluster across two sites
Use a witness for quorum decisions
Design considerations:
Latency between sites must be low and within vSAN limits
Sufficient bandwidth for synchronous replication
Witness location and sizing
Clear understanding of failure behavior (which site remains active)
You can increase Supervisor availability by:
Placing control plane VMs across fault domains or AZs
Ensuring persistent storage is resilient across zones
Designing networking so losing an AZ does not cut off the API endpoint
This is similar to multi-AZ design for traditional applications, but applied to the Kubernetes control plane.
TKCs hold their state in etcd:
Backing up etcd periodically protects you from control plane failures and misconfigurations
Restoring etcd allows you to recover the cluster state at a specific point in time
Design decisions:
Where to store backups (on-site, off-site)
How often to run backups
How to test restore procedures without impacting production
For stateful workloads:
VM-level replication or vSAN replication will move data
PVCs must be reattached or recreated at the DR site
StorageClasses and policies must exist at the target location
Design must ensure a PV created in site A has a recognizable and compatible counterpart in site B after failover.
Possible models:
Rebuild clusters at DR site from code and attach replicated data
Maintain warm standby TKCs at DR site and promote them on failover
Run active–active clusters in multiple regions and use GSLB
Each has different cost, complexity, RTO, and RPO characteristics.
You must pay attention to:
IP addressing (can you reuse same CIDRs, or must you re-IP?)
Load balancers and DNS failover (GSLB)
Latency and bandwidth between regions
Network security and compliance for cross-border data flows
Bad network design can make a DR plan impossible to execute.
Platform-level DR:
Moves entire clusters or VMs (for example via SRM)
Focuses on infrastructure continuity
Application-level DR:
Recreates applications and data using app-aware methods (for example database replication, GitOps)
Focuses on the app’s own resilience mechanisms
Most real-world designs use a mix of both, depending on each application’s needs.
VCF imposes specific requirements that affect hardware selection and cluster layout.
You must design:
Enough hosts for vSAN to meet FTT and performance needs
Enough hosts for HA to tolerate failures (for example N+1, N+2)
Enough hosts to support maintenance mode without risking capacity exhaustion
VCF often recommends at least 4 or more hosts per vSAN cluster in production scenarios.
Choosing between vSAN ESA and OSA impacts:
Hardware requirements (NVMe requirement for ESA)
Performance characteristics
How cache and capacity are structured
Migration strategies from OSA to ESA later
You should prefer ESA in modern designs where hardware allows it, but you must confirm compatibility.
Your design should:
Provide sufficient bandwidth for vSAN, vMotion, management, and overlay traffic
Use at least 25 GbE in most production-grade VCF deployments
Reserve enough bandwidth for peaks and resync operations
Low bandwidth results in contention between storage and application traffic.
If you use GPUs or DPUs:
Ensure they are on the VMware HCL and supported by vLCM images
Plan which clusters will host GPU workloads
Consider NUMA and PCIe topology for performance
Check integration with Kubernetes (for example device plugins)
SmartNICs/DPUs may offload NSX or storage functions; this affects how you design host roles and firmware updates.
Edges must handle:
Aggregate north–south throughput
Load balancer traffic
Routing and NAT operations
Design tasks:
Decide how many edges per edge cluster
Place edges across racks or AZs for redundancy
Size CPU and memory to handle peak traffic and failover scenarios
AVNs are NSX logical networks used by management components or services.
Design includes:
Which AVNs are required for specific domains or solutions
How AVNs connect to the physical network
IP ranges and routing for AVN segments
These decisions affect how easily you can expand the platform later.
After designing, you must prove that the design is correct, complete, and supportable.
You should map:
Each functional requirement to one or more design elements
Each non-functional requirement (availability, security, performance) to specific mechanisms
If something cannot be traced back to a requirement, question whether it belongs. If a requirement is not satisfied by any design element, the design is incomplete.
Check that:
Every logical component (cluster, Namespace, network, storage policy) has a realistic physical implementation
Capacity models match actual hardware counts and capabilities
Network diagrams match switch, VLAN, and routing designs
This prevents “paper designs” that cannot be implemented.
You should:
Estimate current and projected usage (VM count, pod count, storage)
Model growth over 1–3 years
Determine when you need to add hosts, disks, or clusters
Tools like Aria Operations can help, but the design must include high-level assumptions and thresholds.
Ask:
What happens if a host fails? A rack? A whole site?
Does the design meet the RTO/RPO and SLA commitments in each case?
Are there enough spare resources to restart workloads elsewhere?
This analysis might lead you to adjust cluster sizes, vSAN policies, or DR strategies.
Before finalizing the design:
Confirm hardware is on the VMware HCL
Confirm software versions match the VCF BOM
Check interoperability between VMware products and third-party tools
Skipping this step leads to upgrade blocks and unsupported configurations.
During design, you identify risks (budget, skill gaps, hardware constraints). In validation, you:
Re-check that each risk has an appropriate mitigation
Decide if the residual risk is acceptable to the business
Document risk ownership (who accepts it)
Finally, check if the organization is ready to run what you have designed:
Are there enough trained staff?
Are monitoring, backup, and DR runbooks in place and tested?
Are support processes and escalation paths defined?
A technically perfect design can fail if operations are not prepared to support it.
What is the minimum host requirement for enabling a Supervisor Cluster in vSphere Kubernetes Service?
A minimum of three ESXi hosts is required.
Supervisor Clusters require high availability for the Kubernetes control plane components that run as virtual machines on ESXi hosts. To ensure resilience and proper scheduling of these control plane nodes, VMware requires at least three ESXi hosts in the cluster. This design ensures that the control plane remains operational if one host fails. The cluster must also meet networking and storage requirements before Kubernetes can be enabled. In exam scenarios, host count requirements are frequently tested because they affect deployment feasibility and platform resilience.
Demand Score: 84
Exam Relevance Score: 90
Why are vSphere namespaces used in Kubernetes-enabled vSphere environments?
vSphere namespaces provide logical isolation and resource management for Kubernetes workloads.
Namespaces act as a boundary within the vSphere Kubernetes environment that controls access permissions, resource quotas, and policies for developers or teams. Administrators can assign CPU, memory, and storage limits while integrating identity and access management through vCenter. Namespaces also allow mapping Kubernetes resources to vSphere constructs such as storage policies and network segments. This design ensures that multiple teams can safely share the same infrastructure while maintaining security and resource fairness.
Demand Score: 79
Exam Relevance Score: 88
What factor is most important when sizing Tanzu Kubernetes Clusters?
Expected workload resource requirements such as CPU, memory, and storage consumption.
When designing Tanzu Kubernetes Clusters, administrators must evaluate the resource demands of the containerized applications that will run inside the cluster. This includes CPU usage, memory requirements, storage throughput, and network bandwidth. These requirements influence node sizes, worker node counts, and storage policies. Proper sizing ensures performance, scalability, and high availability for production workloads.
Demand Score: 73
Exam Relevance Score: 84
What is the purpose of resource quotas in vSphere namespaces?
Resource quotas limit the amount of compute and storage resources that a namespace can consume.
Resource quotas prevent a single team or application from consuming excessive infrastructure resources. Administrators define limits on CPU, memory, storage, and number of objects that can be deployed within the namespace. This ensures fair resource allocation among multiple development teams and improves infrastructure stability.
Demand Score: 70
Exam Relevance Score: 82
Why should storage policies be considered during Kubernetes platform design?
Storage policies determine performance, availability, and placement characteristics for Kubernetes persistent volumes.
vSphere storage policies define how storage resources are provisioned and managed for workloads. When Kubernetes persistent volumes are created, they map to vSphere storage policies that control attributes such as replication, performance tiers, and datastore placement. Selecting appropriate policies ensures that container workloads receive the correct storage performance and resilience. Misconfigured policies can lead to poor application performance or insufficient redundancy.
Demand Score: 68
Exam Relevance Score: 83