Design methodology is how you think as an architect.
If you get this thinking right, a lot of exam questions become easier.
These four words appear everywhere in design questions.
You must be able to recognize them in a scenario and use them in your decisions.
Requirements
Requirements are what the solution must deliver.
Business requirements
Come from the business, not from IT
Examples:
“The HR system must be available 99.9% of the time.”
“We must keep customer data for at least 7 years.”
“The system must support 2,000 concurrent users.”
Technical requirements
Derived from business requirements, but more specific and technical
Examples:
“The platform must support vSphere HA.”
“We must provide 4 TB of usable storage for the CRM database.”
“All management traffic must be separated from VM traffic.”
Regulatory / compliance requirements
Came from laws or standards:
“Data must be encrypted at rest.”
“Data must not leave the country.”
Functional vs Non-functional
Functional requirements → what the system does
Non-functional requirements (NFRs) → how well it does it
“Login response time must be < 2 seconds in 95% of cases.”
“The system must be available 99.99% of the time.”
In exam scenarios, requirements are non-negotiable: you must design to meet them.
Constraints
Constraints are limitations you must respect. You cannot change them (or only with great difficulty).
Examples:
Budget:
Timeline:
Technology:
“We must use vendor X storage because it’s already purchased.”
“We must use vSphere 8 because of support policy.”
Organization:
Constraints may restrict your design choices. For example:
Requirement: “We want 99.99% availability.”
Constraint: “We only have one data center.”
→ You cannot design active-active multi-site; you must find the best within that limitation.
Assumptions
Assumptions are things you believe to be true, but they’re not confirmed.
Examples:
“We assume that the network team will provide 10 GbE uplinks.”
“We assume that all workloads can be migrated to vSphere without changes.”
“We assume that backups will be handled by the backup team with their existing solution.”
Good practice:
Document assumptions clearly
Validate them with stakeholders as early as possible
If an assumption turns out to be wrong, the design may need to change
In exams, sometimes they show an assumption and ask:
Risks
Risks are potential bad things that might happen in the future and affect your solution.
Examples:
“There is a risk that the single storage array may fail, causing a major outage.”
“There is a risk that the network team cannot deliver 10 GbE on time.”
“There is a risk that the growth of data will be higher than expected.”
Each risk should have:
Probability → how likely it is (low/medium/high)
Impact → how bad it is if it happens (low/medium/high)
Response:
Mitigate → reduce probability or impact
Avoid → change design to remove the risk
Transfer → e.g., insurance, support contracts
Accept → consciously accept, usually documented and approved
In the exam, you may see a list of statements and must classify them as Requirement / Constraint / Assumption / Risk or choose which risks need mitigation.
Architects usually think in three “layers” of design.
You start high-level and get more detailed as you go.
Conceptual design
Conceptual design answers:
“What is the solution going to do, for whom, without any vendor-specific details?”
Characteristics:
High-level view
Focus on:
Users
Major systems
Data flows
Business capabilities
Examples of conceptual statements:
“Provide a highly available virtual infrastructure for production workloads.”
“Separate management workloads from business workloads.”
“Allow remote branch offices to run local workloads with central management.”
No mention of:
vSphere version
CPU models
Storage vendors
VLAN IDs
These details come later.
Logical design
Logical design answers:
“How will the solution be structured in terms of components and relationships, still without exact hardware or product SKUs?”
Characteristics:
More detailed than conceptual
Still technology-agnostic in terms of specific part numbers, but you can mention technologies (vSphere, vSAN, etc.)
Focus on:
How many clusters
Host roles
Logical network layout
Logical storage layout
Example logical statements:
“We will have 3 vSphere clusters: Management, Production, DMZ.”
“Each cluster will contain 6 ESXi hosts.”
“A vSAN datastore will be used for management workloads.”
“Separate port groups will be used for management, vMotion, and VM traffic.”
No mention of:
“Dell PowerEdge R750 with 2 × Intel Xeon XYZ”
“VLAN ID 1234 for vMotion”
Those are physical design details.
Physical design
Physical design answers:
“Exactly what will be implemented and how? Which hardware, which settings, which IDs?”
Characteristics:
Concrete and implementable
Includes vendor and model details
Contains configuration values
Examples:
“Use 6 × Dell PowerEdge R750 servers, each with 2 × 16-core CPUs and 512 GB RAM.”
“Use vSphere 8.0 U2 with vCenter Server Appliance.”
“Use VLAN 10 for management, VLAN 20 for vMotion, VLAN 30 for vSAN.”
“Create a vSAN datastore with RAID-1 FTT=1 for management VMs.”
“Use RAID 10 on the storage array with 8 × 1.92 TB SSDs.”
Conceptual → Logical → Physical in practice (simple example)
Imagine the requirement:
A possible chain:
Conceptual:
Logical:
“Create two clusters: Production (HA enabled) and Test (HA optional).”
“Production cluster will host critical line-of-business VMs; Test cluster will host development/test VMs.”
Physical:
“Production cluster: 6 ESXi hosts, each with 2 × 12-core CPUs, 256 GB RAM.”
“Test cluster: 3 ESXi hosts, each with 2 × 8-core CPUs, 128 GB RAM.”
“vSAN all-flash for Production, NFS datastore for Test.”
In the exam, you may be asked to map items between these layers or identify what type a given statement belongs to.
Capacity planning & sizing answers the question:
“How big does the environment need to be to support all workloads now and in the future?”
You don’t want to design something too small (performance problems) or too big (waste of money).
Before sizing, you must understand the workloads.
Collect existing workload metrics
Typical metrics:
CPU usage
Average usage over time (MHz or %)
Peak usage during busy hours
Helps estimate vCPU requirements
Memory consumption
Active memory, not just configured memory
Some VMs allocate 16 GB but only use 4 GB; that matters
Storage IOPS and latency
IOPS = how many input/output operations per second
Latency = how long each operation takes
Throughput (MB/s) for large sequential workloads
Network throughput and patterns
Average and peak traffic
East–west (between VMs) vs north–south (VM ↔ external)
How to collect these in real life (conceptually):
Performance statistics from existing hypervisors
Monitoring tools (e.g., Aria Operations, other monitoring systems)
OS-level tools (Perfmon, top, sar)
Group workloads
You rarely treat all workloads the same. You group them by:
Criticality
Tier 1 (mission critical)
Tier 2 (important)
Tier 3 (non-critical)
Performance profile
CPU-heavy (analytics, compute jobs)
Memory-heavy (in-memory DBs)
Storage-heavy (databases, file servers)
Network-heavy (proxies, gateways)
Environment
Production
UAT
Development / Test
Why group?
Once you understand workloads, you estimate how many hosts and how much resource per host.
Decide: number of hosts per cluster
Factors:
Total CPU and memory needed (from workload analysis)
Overcommit ratios (how much sharing you accept)
HA goals (N+1, N+2)
Maintenance requirements
Example idea (simplified):
Total CPU needed after growth: X GHz
Each host provides: Y GHz
You need H hosts such that:
H × Y ≥ X even after you lose 1 or 2 hostsN+1 / N+2 policies
N+1
N+2
Example:
You decide you need 5 hosts worth of capacity to run all workloads.
For N+1, you design 6 hosts
For N+2, you design 7 hosts
Exam questions often give an SLA like:
Maintenance overhead
Besides failure capacity, you must also support:
Patch windows
Hardware replacement
Upgrades
Often you design so that:
This is about how you share resources in the cluster.
vCPU to pCPU ratio targets
vSphere allows CPU overcommit
Safe ratios depend on workload type
As a rough conceptual idea:
Light workloads: high ratio (e.g., 6:1 or more)
Heavy workloads: low ratio (e.g., 1:1 to 3:1)
For design questions, you don’t usually calculate exact ratios but should know:
Too much overcommit → CPU contention (high ready time)
Right-sizing VMs is critical
Memory overcommit thresholds
vSphere can overcommit memory using TPS, ballooning, compression, swapping
Swapping must be avoided in design because it kills performance
Designers often target utilization levels like “70% memory usage at steady state” to provide headroom
Storage capacity and policies
You size datastores based on:
Raw capacity
Usable capacity after RAID, FTT, overhead
Growth expectations
For vSAN specifically:
Storage policies (e.g., FTT=1, RAID-1, RAID-5) increase required raw capacity
Example:
FTT=1 with RAID-1, you need roughly 2 TB raw capacityDesigners must always think:
“How much raw capacity do I need to meet usable capacity under given policies?”
This part answers:
“How do we keep the system running, even when things break?”
Configure HA clusters to meet SLAs
SLA might say:
Your design uses:
vSphere HA
Sufficient host redundancy (N+1, N+2)
Redundant networking
Redundant storage paths
You choose HA settings (admission control, isolation response, etc.) that align with business requirements.
Admission control based on failures to tolerate and workload criticality
If the business says, “We must tolerate 1 host failure,” you pick a policy that keeps enough resources free for 1 host failover
If critical workloads are few, you may use:
VM-level reservations or priority for those VMs
Specific clusters just for critical workloads
Decide: FT vs HA vs application clustering
FT (Fault Tolerance):
For highest criticality and small VMs
Zero downtime, zero data loss for supported failure types
HA:
For most workloads
Good compromise: some downtime (reboot time), no manual action
Application-level clustering (e.g., MSCS/WSFC, Oracle RAC):
Protects at the application layer
Often better for complex apps, databases
Rule of thumb:
Use HA as the default
Use FT for small, ultra-critical services where reboot is not acceptable
Use app clustering when application vendors recommend or require it
Business continuity ≈ “How do we keep the business running during major failures?”
Disaster recovery (DR) is the technical part of that.
Define RPO and RTO
RPO (Recovery Point Objective):
How much data can we lose during a disaster?
Example: RPO = 15 minutes → replication at least every 15 minutes
RTO (Recovery Time Objective):
How long can the service be down?
Example: RTO = 4 hours → must bring the service back within 4 hours
These strongly influence:
DR technology choice
Replication frequency
Level of automation
Decide scope: what is protected?
Options:
Entire sites (all workloads)
Specific business-critical applications
Only some databases
Protecting everything is expensive. Often you prioritize Tier 1 workloads.
Technologies
vSphere Replication:
Per-VM replication
Asynchronous
Array-based replication:
SRM (Site Recovery Manager):
Orchestrates failover and failback
Uses VR or array replication
Automates:
Boot order
IP changes
Network mappings
Third-party tools may also be used, but for the exam, these three are key.
DR runbooks
Runbooks define step-by-step procedures:
Start order of VMs
Dependencies (DB first, then app servers, then web servers)
Network mappings (old IP vs new IP)
Manual steps if needed
Failback steps after the primary site is restored
In real life, DR that is not documented and tested usually fails.
In design exams, you should always think about documentation and testing as part of the solution.
NUMA awareness
Modern servers have multiple NUMA nodes
Best practice:
Very large VMs that span NUMA nodes may have reduced performance
Design implications:
Do not oversize VMs unnecessarily
Understand the physical NUMA layout of hosts
Reservations, limits, shares
These are vSphere resource controls.
Reservation:
Minimum guaranteed resources
Pros: ensures critical VMs get what they need
Cons: reserved resources cannot be used by others
Limit:
Hard maximum cap
Dangerous: if set too low, you artificially throttle a VM even when resources are available
Shares:
Relative priority under contention
High shares = VM gets larger portion of resources if contention occurs
Design guidelines:
Use reservations sparingly and usually only for key VMs
Avoid limits unless you have a strong reason
Use shares to express business priority (Tier 1 > Tier 2 > Tier 3)
IOPS and latency requirements
Some apps (databases) need high IOPS and low latency
Others (archive servers) can live with low performance
You must match:
Design parameters
RAID levels:
RAID 10 → good performance, lower usable capacity
RAID 5/6 → more capacity, slower writes
Queue depths:
At the HBA (host bus adapter) and storage array
Too small → bottlenecks
Too big → can overload the array
Multipathing policies:
Round Robin
Fixed
Most Recently Used (MRU)
Choose policies based on storage vendor best practices.
MTU (Maximum Transmission Unit)
Standard MTU: 1500 bytes
Jumbo frames: 9000 bytes (commonly)
Larger MTU reduces CPU overhead and can improve throughput for:
vMotion traffic
vSAN traffic
Storage traffic (iSCSI, NFS)
But:
All devices along the path must support it
MTU mismatches can cause strange problems
NIC teaming policies
Examples:
Route based on originating port ID:
Route based on IP hash:
Route based on physical NIC load (LBT):
Design must respect:
Switch capabilities
Redundancy needs
Type of traffic
Traffic segmentation
Good practice: separate traffic types:
Management
vMotion
vSAN
Storage (iSCSI/NFS)
VM networks
Methods:
Different VLANs
Different port groups
Possibly different physical NICs or uplinks
This improves:
Performance
Security
Troubleshooting
Security is not an afterthought; it is part of design from the beginning.
vCenter SSO
SSO provides centralized authentication
Connects to identity sources like:
Active Directory
LDAP directories
This allows users to login to vCenter with domain accounts.
RBAC (Role-Based Access Control)
Define roles (sets of permissions)
Assign roles to groups or users at specific scopes
Principle:
Example roles:
VM admin (can manage VMs, but not hosts)
Storage admin (can manage datastores, not VMs)
Read-only (can view, not change)
ESXi host lockdown modes
Restricts direct login to hosts
Forces use of vCenter for management
Reduces attack surface
Disable unnecessary services
SSH
ESXi shell
Any management service not actively needed
Fewer open ports → less risk.
Network isolation
Separate management networks
Separate vMotion, vSAN, storage traffic
Use firewalls and NSX micro-segmentation to enforce security policies
Micro-segmentation:
Apply firewall rules at VM level
Example:
Web tier can talk to app tier
App tier can talk to DB tier
But web tier cannot talk directly to DB tier
This helps contain attacks and lateral movement.
vSphere VM encryption
Encrypts VM files (VMDKs, etc.)
Protects data at rest
Needs a Key Management Server (KMS) integration
vSAN encryption
Encrypts data at the vSAN datastore level
Can be “at-rest” or “end-to-end” (depending on version)
Also uses KMS
Compliance requirements
Standards like PCI DSS, GDPR, HIPAA, etc., may require:
Encryption
Access control and audit logs
Data locality (where data is stored)
Retention and deletion policies
Designers must map these requirements to:
vSphere capabilities (encryption, RBAC)
Network controls (NSX, firewalls)
DR and backup designs
In real enterprise architectures, not all requirements can be satisfied fully. Prioritization ensures the architecture focuses on what delivers the highest business value.
A commonly used method is the MoSCoW model:
Must
Mandatory requirements that must be satisfied for the solution to be accepted. Failure to meet a Must requirement means the design is invalid.
Should
Important requirements but not fundamental. If necessary, trade-offs can be made.
Could
Nice-to-have requirements that improve usability or convenience but are not essential.
Won’t
Out-of-scope items, intentionally excluded to avoid scope creep.
Architects use this method for structured decision-making and to guide discussions when trade-offs are required.
Conflicts among requirements are common. Several examples include:
High availability versus limited budget
High performance versus limited hardware
Security versus operational simplicity
Scalability versus short timelines
Architects must evaluate relative priority, identify constraints, and choose the option that best aligns with business goals rather than purely technical preferences.
Trade-offs appear throughout design decisions. Common examples include:
Choosing RAID-10 for performance versus RAID-5/6 for cost efficiency
Choosing all-flash vSAN for latency-sensitive workloads versus hybrid vSAN to reduce cost
Applying strict network segmentation to enhance security, even if it increases operational complexity
In VMware design exams, the correct answer usually reflects the best alignment with business priorities (requirements + constraints), not necessarily the technically “best” product or feature.
A design decision defines how a specific architectural aspect is implemented. A well-formed design decision should include:
The option chosen
The alternatives considered
The justification for the choice
The impact (positive or negative) on requirements, cost, operations, and constraints
Strong justification is a key differentiator in VMware design exams.
A best-practice format:
Decision: Use vSAN all-flash as the primary storage for the production cluster.
Alternatives: External SAN, NFS.
Justification: Meets performance SLAs, offers policy-based management, integrates with lifecycle and automation tooling, and simplifies operations.
Impact: Higher initial cost, requires NVMe-based cache devices.
A complete design must show the architect has critically evaluated the options rather than choosing arbitrarily.
Design-oriented questions often ask:
Which option represents the best design decision?
Which justification is most aligned with the given requirement set?
Correct answers follow requirements, constraints, and operational realities—not personal preference.
A complete design document typically includes:
Executive summary
Detailed RCAR (requirements, constraints, assumptions, risks)
Conceptual, logical, and physical designs
Security and compliance considerations
Operational considerations (patching, monitoring, backup)
Migration/integration plans
Diagrams, IP plans, VLAN matrices, cluster architecture details
Good documentation increases maintainability and ensures clear communication across teams.
Validation confirms that a design meets business and technical requirements. Methods may include:
Proof of Concept testing
Pilot deployments in controlled environments
Performance benchmarking
HA/DR failover testing
Stakeholder review and sign-off
User acceptance testing
Validation is mandatory because assumptions and theoretical sizing must be proven under real conditions.
Typical exam-style questions include:
Which activity validates that the design meets the performance requirement?
When a new risk is discovered, which design artifact must be updated?
Understanding documentation structure and validation steps is essential for correct answers.
A good design includes a full migration plan to transition from the current-state environment to the future-state architecture.
Common strategies include:
Cold migration (VM powered off)
vMotion or Storage vMotion for live migrations
Rolling cluster upgrades or replacement
Swing host / swing cluster techniques
HCX-assisted migration, including live bulk movement across sites
SRM-assisted planned migration
Choosing the correct method depends on workload criticality, downtime tolerance, and network/storage compatibility.
A structured migration plan typically includes:
Assessment and discovery
Pilot migration to validate tooling and cutover steps
Phased batch migrations
Final cutover window
Tested rollback plan
Post-migration validation and acceptance
Each step reduces operational risk and ensures predictable outcomes.
Typical risks include:
Application incompatibility with new hypervisor versions
Network/IP changes impacting communication
Cross-version or cross-storage compatibility issues
Storage format changes (VMFS version, vSAN policy differences)
Performance degradation during migration windows
Exam questions often ask which migration option best satisfies downtime, network, or compatibility constraints.
The design must reflect what is required to support and run the environment after implementation. Key operational areas include:
Monitoring and alerting policies
Log retention and analysis requirements
Capacity forecasting and reporting
Backup and recovery requirements
Patch and firmware management cycles
Vulnerability and configuration compliance cycles
Operational requirements influence cluster layout, networking choices, and lifecycle tooling.
Designers must plan for operational tooling, even if specific vendors are not named.
Categories include:
Performance monitoring and anomaly detection platforms
Centralized log aggregation or SIEM systems
Configuration or drift management tooling
Lifecycle patching/orchestration tools
The goal is operational consistency and reduced manual intervention.
Common exam questions:
Which design supports the monitoring and alerting requirement?
Which operational constraint impacts the host-count or cluster-size decision?
Correct answers depend heavily on understanding operational maturity and organizational constraints.
Designs must satisfy future-state growth, not only current-state requirements. Projections typically evaluate:
3- or 5-year capacity needs
Data center rack space, power, and cooling
Cluster scalability limits (hosts, VMs, memory)
vCenter and SSO domain scalability
Storage growth, IOPS growth, network bandwidth needs
Ignoring growth results in short-lived architectures.
Architects must plan for the full lifecycle of hardware and software:
ESXi and vCenter upgrade paths
Deprecation/removal of older features (for example, SD cards for ESXi boot)
Hardware lifecycle and warranty cycles
VMware Compatibility Guide (VCG) dependencies
Interoperability between versions of vSphere, NSX, vSAN, and other components
This includes ensuring that all layers of the stack have valid upgrade routes over the next several years.
Common lifecycle risks include:
Hardware nearing end-of-support or end-of-life
CPU generations no longer supported by new ESXi versions
Firmware incompatibility with desired vSphere releases
vCenter and ESXi version mismatches in multi-site architectures
Legacy storage or network protocols approaching deprecation
A robust design identifies these risks early and incorporates mitigation strategies.
What is a key design principle for scalable VCF automation?
Use modular and reusable automation components.
Designing reusable blueprints and workflows ensures scalability and reduces duplication. Modular designs allow updates without impacting the entire system. A common mistake is creating monolithic automation workflows that are difficult to maintain and scale.
Demand Score: 78
Exam Relevance Score: 85
How should workload domains influence automation design?
Automation should be aligned with workload domain boundaries to maintain isolation and policy control.
Each workload domain in VCF represents a logical separation of resources. Automation must respect these boundaries to avoid policy conflicts and ensure security. Ignoring domain separation can lead to misconfigurations and governance issues.
Demand Score: 75
Exam Relevance Score: 83
What is a common mistake in VCF automation design?
Over-centralizing automation without considering domain-specific requirements.
Centralized automation can simplify management but may ignore unique needs of different domains. This leads to inflexible designs and operational bottlenecks. Proper balance between central governance and domain autonomy is critical.
Demand Score: 72
Exam Relevance Score: 80
Why is policy-based design important in VCF automation?
It ensures consistent governance and automated enforcement of standards.
Policies define how resources are provisioned and managed. Integrating them into automation prevents drift and enforces compliance. A common oversight is applying policies manually instead of embedding them into automation workflows.
Demand Score: 74
Exam Relevance Score: 82