Shopping cart

Subtotal:

$0.00

HPE1-H05 Disaster Recovery and Advising

Disaster Recovery and Advising

Detailed list of HPE1-H05 knowledge points

Disaster Recovery and Advising Detailed Explanation

1. DR Concepts and Strategy

This is the foundation: understanding what the business needs and what the technology can deliver.

1.1 RPO and RTO

RPO (Recovery Point Objective): maximum acceptable data loss

RPO answers the question:
“How much data can we afford to lose if a disaster happens?”

It is always measured in time.
Examples:

  • RPO = 0 seconds → no data loss allowed (requires synchronous replication).

  • RPO = 5 minutes → acceptable to lose up to 5 minutes of changes.

  • RPO = 24 hours → losing 1 day of data is acceptable (typical for non-critical systems).

If your last backup or replication was 1 hour ago, and disaster happens, you have potentially lost 1 hour of data → RPO = 1 hour.

RTO (Recovery Time Objective): maximum acceptable downtime

RTO answers:
“How long can the system be down before business suffers too much?”

Examples:

  • RTO = seconds → needs automatic failover (cluster, stretched cluster).

  • RTO = minutes → fast recovery (replication + orchestrated failover).

  • RTO = hours → manual restore from backup.

  • RTO = days → some archival systems with low business impact.

RTO strongly influences the DR architecture (cluster vs backup restore).

Mapping workloads to different RPO/RTO tiers

Because not all workloads are equal, you classify them:

  • Tier 0 (mission-critical)

    • RPO = near-zero

    • RTO = near-zero

    • Example: payment systems, trading systems, hospital systems

  • Tier 1 (business-critical)

    • RPO = minutes

    • RTO = minutes to an hour

    • Example: ERP, CRM, some databases

  • Tier 2 (important but not urgent)

    • RPO = hours

    • RTO = hours

    • Example: intranet, reporting systems

  • Tier 3 (non-critical)

    • RPO = 24 hours

    • RTO = >24 hours

    • Example: archives, test systems

This is essential because DR costs increase as RPO/RTO decrease.
You don’t want to spend Tier-0 budget on Tier-3 workloads.

1.2 DR topologies

Different DR architectures exist depending on needs and budget.

Backup and restore only (no live DR site)
  • Backups stored on disk, tape, or cloud.

  • During disaster:

    • Build new servers.

    • Restore data from backups.

Pros:

  • Cheapest option.

Cons:

  • Long RTO (hours to days).

  • RPO depends on backup frequency (often 24 hours).

Good for:

  • Tier 2–3 workloads.
Active/passive DR site: warm or cold standby

There is a secondary site, but it’s not active.

  • Cold standby

    • Minimal hardware; everything is powered off.

    • Longest RTO (many hours or days).

  • Warm standby

    • Infrastructure is running but workloads are not running.

    • Replication sends data regularly (asynchronous).

    • Faster RTO than cold standby.

Pros:

  • Much faster recovery than backup-only.

  • Good balance of cost vs performance.

Used for:

  • Tier 1 workloads.
Active/active multi-site: synchronous replication, stretched clusters

Here, both sites are live and hosting workloads.

  • Synchronous replication = RPO ≈ 0

  • Automatic failover across sites = RTO ≈ seconds

Examples:

  • Stretched clusters that look like “one giant cluster across two buildings.”

Pros:

  • Best availability and lowest RPO/RTO.

Cons:

  • Most expensive

  • Requires low-latency, high-bandwidth links between sites

  • More complex operationally

Used for:

  • Tier 0 workloads.

2. Backup and Restore

Backups are the last line of defense, especially against ransomware or data corruption.

2.1 Backup strategies

Full, incremental, differential backups
  • Full backup

    • Captures everything.

    • Slowest and largest but simplest for restores.

  • Incremental backup

    • Backs up only data changed since the last backup (full OR incremental).

    • Small and fast, but restore requires multiple incremental data sets.

  • Differential backup

    • Backs up changes since the last full backup.

    • Larger than incrementals but quicker to restore.

Most modern systems use a mix for efficiency.

Backup frequency and retention policies
  • Frequency (e.g., every 1 hour, once per day).

    • Determines RPO for backup-based recovery.
  • Retention

    • How long backups are kept (days, weeks, months, years).

    • Influenced by compliance (finance, healthcare, legal requirements).

Example:

  • Daily backups kept 30 days.

  • Weekly full backups kept 3 months.

  • Monthly backups kept 1 year.

  • Yearly backups kept 7 years (common compliance rule).

Backup windows: scheduling to avoid peak times

Backups can generate heavy I/O.

To avoid slowing production systems:

  • Run backups at night or during non-peak hours.

  • Use incremental/differential during business hours if needed.

Some modern systems offer:

  • Continuous data protection

  • Snapshot integration to reduce application impact

2.2 Backup targets

Disk-based backup appliances
  • Fast backup and restore.

  • Commonly used for short- and medium-term retention.

  • Often deduplication appliances to reduce storage cost.

Used in most modern environments.

Tape libraries and offsite storage
  • Best cost per TB for long-term storage.

  • Not fast to restore, but great for archival data.

  • Tapes can be physically removed (air-gap), very useful against ransomware.

Still widely used in enterprise DR strategies.

Cloud-based backup repositories
  • Store backups in cloud storage.

  • Nearly infinite retention options.

  • No need to manage physical media.

Good for:

  • Offsite DR without huge hardware investment.

  • Long-term retention.

2.3 Backup integration

Application-consistent backups (e.g., VSS, database quiescing)

To avoid corrupted data, backups must be application-aware:

  • VSS (Volume Shadow Copy Service) for Windows apps.

  • Database quiescing (pausing writes briefly) to take a clean backup.

Without application consistency:

  • Restores may fail.

  • Databases may require repair or lose transactions.

Snapshot-based backup integration with arrays

Storage arrays offer snapshots:

  • Very fast, point-in-time copies.

  • Low impact compared to full backup.

Backup applications can:

  • Trigger array snapshots.

  • Copy snapshot data to backup storage.

This combines performance of snapshots with safety of true backups.

Catalog and index design for restore efficiency

Backups need catalogs to:

  • Locate which media/version contains the needed file or VM.

  • Allow fast searching.

If catalogs are poorly designed or corrupted:

  • Restores take longer or fail entirely.

So catalog backup is as important as data backup.

3. Replication and Data Mobility

Replication copies data from one location to another for DR or mobility.

3.1 Local protection

Snapshots: point-in-time, copy-on-write vs redirect-on-write
  • Snapshot = a quick, space-efficient point-in-time image.

  • Two main technologies:

    • Copy-on-Write (CoW):

      • When original data changes, the old blocks are copied elsewhere.

      • Slower on heavy write workloads.

    • Redirect-on-Write (RoW):

      • New writes go to a new location; old data stays intact.

      • Usually better performance.

Snapshots are good for:

  • Rapid recovery from logical errors.

  • Short-term protection (hours/days).

Clones: full-volume copies for testing or backup
  • A clone is a complete copy of data.

  • Uses more space than snapshots but is:

    • Independent

    • Very safe for testing, analytics, development, or long-term backup

Clones don’t depend on original data after creation.

3.2 Remote replication

Synchronous replication: zero data loss
  • Writes go to both sites at the same time.

  • Only acknowledged when written in both places.

Pros:

  • RPO ≈ 0 (no data loss).

Cons:

  • Needs very low latency between sites (usually <5 ms).

  • Limited to short distances.

Used for:

  • Tier 0 workloads.

  • Stretched-cluster architectures.

Asynchronous replication: better for distance
  • Writes are acknowledged immediately at primary site.

  • Data is sent to secondary site later (with delay).

Pros:

  • Works over long distances.

  • Less bandwidth required.

Cons:

  • RPO depends on replication interval (seconds/minutes).

Used for:

  • Tier 1 and some Tier 2 workloads.
Multi-target replication
  • One primary volume replicates to multiple secondary sites.

    • Example: replicate to Site B and cloud site C.

Pro:

  • Very resilient; multiple DR options.

  • Good for geo-distributed enterprises.

Con:

  • More complex management.

3.3 Cross-platform and cloud DR

Replication between on-prem and cloud storage/platform
  • Replicate on-prem arrays to cloud storage (object or block).

  • Use cloud VMs as DR compute environment.

Useful when:

  • No budget for a physical secondary site.

  • Need elasticity for DR tests.

Use of cloud as DR target for certain workloads

Some workloads may fail over into cloud temporarily:

  • Less critical apps

  • Webfronts

  • Stateless services

You can:

  • Replicate VM images to cloud

  • Bring them up in a DR event

  • Redirect DNS/load balancers

Cloud-based DR is flexible and cost-effective if designed properly.

4. DR Runbooks and Testing

DR isn’t complete until it’s proven and practiced.

4.1 DR runbooks

Runbooks are step-by-step instructions for disaster recovery.
They make sure people do the right actions quickly under stress.

Runbooks include:

Step-by-step failover procedures
  • How to declare a disaster

  • Which systems fail over first

  • How to activate replicated VMs or volumes

  • How to test and validate the DR site

Must be:

  • Clear

  • Tested

  • Available even if primary systems are down

Role assignments: who does what during a DR event

Examples:

  • Incident commander (decision maker)

  • Storage admin

  • Network admin

  • Application owners

  • Communications lead

  • Compliance officer

Everyone must know their responsibilities.

Communication plans: internal and external communication

Internal:

  • IT teams

  • Executives

  • Department leaders

External:

  • Customers

  • Partners

  • Regulators

During an incident, communication clarity is as important as technical actions.

4.2 Testing

DR drills: scheduled tests with partial or full failover

Drills simulate real events:

  • Partial failover

    • Fail over specific workloads only.
  • Full failover

    • Entire environment switches to DR site.

Drills identify:

  • Gaps

  • Misconfigurations

  • People or process weaknesses

Tabletop exercises: dry-runs without actual failover

A group sits in a room and walks through a hypothetical disaster scenario.

Benefits:

  • Zero risk

  • Improves team coordination

  • Reveals documentation weaknesses

Post-test reporting: gaps, issues, improvement actions

After tests:

  • Document what went well

  • Document what went wrong

  • Update the DR plan and runbooks

  • Schedule fixes and improvements

Continuous improvement is essential.

5. Advisory and Governance

Finally, DR is not only technical—it involves policy, compliance, and advising the business.

5.1 Advising on trade-offs

Cost vs RPO/RTO

Key concept:

  • Lower RPO and RTO = higher cost

Examples:

  • RPO 0 → synchronous replication → expensive storage, high-bandwidth links

  • RTO seconds → stretched cluster → complex networking, licensing, support

Your job:

  • Explain these trade-offs clearly

  • Help the business choose what they really need

Never push a Tier-0 solution on a Tier-2 workload.

Complexity vs resilience

More resilience often means:

  • More components

  • More moving parts

  • More operational challenges

Sometimes a simpler solution:

  • Reduces mistakes

  • Lowers cost

  • Increases reliability in practice

Your role:

  • Recommend solutions that match the organization’s capability to operate them.

5.2 Policy and compliance

Aligning DR plan with regulatory requirements

DR must support compliance needs such as:

  • Data retention periods

  • Encryption rules

  • Required RTO/RPO for regulated systems

  • Mandatory testing schedules

Industries like banking, healthcare, and government have strict DR expectations.

Ensuring data sovereignty and retention policies are followed

Sovereignty:

  • Where data physically resides

  • Some laws require data to stay within a country or region

Retention:

  • How long data must be kept

  • When it must be deleted

Your DR design must respect:

  • Location rules

  • Retention rules across all DR copies, including backup and replicated data

  • Deletion rules (e.g., GDPR “right to be forgotten”)

Disaster Recovery and Advising (Additional Content)

1. Failback Strategy (Returning Workloads to the Primary Site)

Failback is often more complex than failover and must be explicitly planned, documented, and tested. A complete DR strategy includes not only moving workloads to the DR site but also safely returning them to the primary location once conditions allow.

Key Elements of Failback

  • Validation of primary site readiness, including power, cooling, network, and storage health.

  • Synchronization of changed data from the DR site back to the primary site.

  • Reversal of replication direction with safe cutover preparation.

  • Minimizing downtime during failback by sequencing applications appropriately.

  • Business-defined criteria for authorizing failback, including risk assessment.

  • Acceptance testing and sign-off after production workloads return to the primary site.

2. Classification of DR Testing: Non-Disruptive, Disruptive, Automated

DR testing must be structured to balance operational assurance with business impact. Categorizing tests enables better planning and compliance alignment.

Non-Disruptive Tests

  • No impact to production systems.

  • Validate DR readiness using read-only mounts, clones, or isolated test networks.

  • Suitable for frequent validation with minimal operational risk.

Disruptive Tests

  • Actual failover of selected or all workloads to the DR site.

  • Requires business approval and planned maintenance windows.

  • Validates operational readiness under real-world failover conditions.

Automated or Orchestrated Tests

  • Conducted using DR orchestration tools that automate recovery sequences.

  • Provide repeatability, auditability, and reduced human error.

  • Useful for regulatory reporting and continuous validation.

3. DR Automation and Orchestration Tools

Automation enhances recovery speed, reliability, and consistency. Modern DR tools provide structured execution of failover tasks that would be error-prone if done manually.

Key Capabilities

  • Automated failover workflows with predefined steps.

  • Dependency mapping between infrastructure and application tiers, such as database → application → web.

  • Automated adjustment of IP addressing, DNS, or load-balancer routing during DR events.

  • Execution of scripted processes representing “runbooks as code.”

  • Regular orchestrated DR testing without manual intervention.

4. Network Design Considerations for DR

DR plans must incorporate full network behavior across sites; otherwise, even perfectly replicated workloads may not function during failover.

Essential Considerations

  • Layer-2 stretch or Layer-3 routing models used for multi-site clusters and application mobility.

  • Firewall rules allowing replication, management traffic, and administrative failover operations.

  • Bandwidth sizing based on peak change rates, deduplication efficiency, and replication intervals.

  • Failover IP strategy, including DNS redirection, load-balancer adjustments, VIP failover, or SD-WAN routing updates.

5. Immutable Backups and Ransomware-Resilient DR

Ransomware resiliency has become a core part of modern DR design. Protecting backup data from alteration or deletion ensures recovery remains possible even after a cyberattack.

Key Protective Measures

  • Immutable or locked backups that cannot be modified for a defined retention period.

  • WORM retention enforcing write-once-read-many guarantees.

  • Air-gapped or offline backups kept outside reachable networks.

  • Multi-factor authentication requirements for backup deletions or retention changes.

  • Zero-trust access controls applied to backup and snapshot systems.

Frequently Asked Questions

What is the primary purpose of replication in a storage disaster recovery strategy?

Answer:

Replication copies data from a primary storage system to a secondary system to ensure data availability during failures.

Explanation:

Replication is used to maintain an up-to-date copy of data at another location or storage system. If the primary site becomes unavailable due to hardware failure, natural disaster, or network outage, the replicated data can be used to restore operations. Replication can be synchronous, where writes are committed to both systems simultaneously, or asynchronous, where updates are transferred with a delay. Synchronous replication provides stronger data protection but requires low-latency links, while asynchronous replication is more flexible for long-distance deployments. Organizations select replication methods based on recovery point objectives (RPO) and recovery time objectives (RTO).

Demand Score: 82

Exam Relevance Score: 87

How do snapshots contribute to data protection strategies?

Answer:

Snapshots create point-in-time copies of data that allow systems to recover from corruption or accidental deletion.

Explanation:

Snapshots capture the state of a storage volume at a specific moment without requiring a full copy of all data. Most storage systems implement snapshots using copy-on-write or redirect-on-write techniques, which only store changed data blocks. This makes snapshots efficient and fast to create. Administrators can use snapshots to quickly restore files or entire volumes if data becomes corrupted or deleted. However, snapshots typically reside on the same storage system and may not protect against hardware failure or site outages, so they are often combined with replication for comprehensive protection.

Demand Score: 79

Exam Relevance Score: 84

What are RPO and RTO in disaster recovery planning?

Answer:

RPO (Recovery Point Objective) defines acceptable data loss, while RTO (Recovery Time Objective) defines how quickly systems must be restored.

Explanation:

RPO determines how much data loss an organization can tolerate during a disaster. For example, an RPO of 15 minutes means the system must replicate or back up data frequently enough that only 15 minutes of data could be lost. RTO defines how long systems can remain unavailable before business operations are affected. Critical applications may require RTO values of minutes, while less important systems may tolerate hours or days. Storage architects design replication and backup strategies that meet these objectives while balancing cost and infrastructure complexity.

Demand Score: 77

Exam Relevance Score: 83

Why is geographic separation important in disaster recovery planning?

Answer:

Geographic separation protects data from site-wide disasters affecting the primary location.

Explanation:

If both primary and backup systems are located in the same data center, a single disaster such as fire, flooding, or power failure could destroy all copies of the data. By replicating data to a geographically separate site, organizations reduce the risk of total data loss. Distance requirements depend on risk tolerance and network capabilities. Many enterprises maintain secondary sites in different cities or regions. This approach ensures that critical data remains accessible even during large-scale incidents affecting the primary location.

Demand Score: 75

Exam Relevance Score: 80

Why should disaster recovery plans be tested regularly?

Answer:

Regular testing ensures that recovery procedures actually work during real incidents.

Explanation:

Many organizations create disaster recovery plans but fail to verify them through testing. Over time, infrastructure changes, software updates, and configuration modifications can invalidate recovery procedures. Regular testing allows administrators to identify gaps such as missing replication configurations, outdated documentation, or insufficient network capacity. Simulated recovery exercises also train operational teams to execute recovery steps efficiently. Without testing, organizations may discover problems only during real disasters, which can significantly extend downtime and increase operational risk.

Demand Score: 73

Exam Relevance Score: 79

HPE1-H05 Training Course