Disaster Recovery and Advising

Disaster Recovery and Advising Detailed Explanation

1. DR Concepts and Strategy

This is the foundation: understanding what the business needs and what the technology can deliver.

1.1 RPO and RTO

RPO (Recovery Point Objective): maximum acceptable data loss

RPO answers the question:
“How much data can we afford to lose if a disaster happens?”

It is always measured in time.
Examples:

RPO = 0 seconds → no data loss allowed (requires synchronous replication).
RPO = 5 minutes → acceptable to lose up to 5 minutes of changes.
RPO = 24 hours → losing 1 day of data is acceptable (typical for non-critical systems).

If your last backup or replication was 1 hour ago, and disaster happens, you have potentially lost 1 hour of data → RPO = 1 hour.

RTO (Recovery Time Objective): maximum acceptable downtime

RTO answers:
“How long can the system be down before business suffers too much?”

Examples:

RTO = seconds → needs automatic failover (cluster, stretched cluster).
RTO = minutes → fast recovery (replication + orchestrated failover).
RTO = hours → manual restore from backup.
RTO = days → some archival systems with low business impact.

RTO strongly influences the DR architecture (cluster vs backup restore).

Mapping workloads to different RPO/RTO tiers

Because not all workloads are equal, you classify them:

Tier 0 (mission-critical)
- RPO = near-zero
- RTO = near-zero
- Example: payment systems, trading systems, hospital systems
Tier 1 (business-critical)
- RPO = minutes
- RTO = minutes to an hour
- Example: ERP, CRM, some databases
Tier 2 (important but not urgent)
- RPO = hours
- RTO = hours
- Example: intranet, reporting systems
Tier 3 (non-critical)
- RPO = 24 hours
- RTO = >24 hours
- Example: archives, test systems

This is essential because DR costs increase as RPO/RTO decrease.
You don’t want to spend Tier-0 budget on Tier-3 workloads.

1.2 DR topologies

Different DR architectures exist depending on needs and budget.

Backup and restore only (no live DR site)

Backups stored on disk, tape, or cloud.
During disaster:
- Build new servers.
- Restore data from backups.

Pros:

Cheapest option.

Cons:

Long RTO (hours to days).
RPO depends on backup frequency (often 24 hours).

Good for:

Tier 2–3 workloads.

Active/passive DR site: warm or cold standby

There is a secondary site, but it’s not active.

Cold standby
- Minimal hardware; everything is powered off.
- Longest RTO (many hours or days).
Warm standby
- Infrastructure is running but workloads are not running.
- Replication sends data regularly (asynchronous).
- Faster RTO than cold standby.

Pros:

Much faster recovery than backup-only.
Good balance of cost vs performance.

Used for:

Tier 1 workloads.

Active/active multi-site: synchronous replication, stretched clusters

Here, both sites are live and hosting workloads.

Synchronous replication = RPO ≈ 0
Automatic failover across sites = RTO ≈ seconds

Examples:

Stretched clusters that look like “one giant cluster across two buildings.”

Pros:

Best availability and lowest RPO/RTO.

Cons:

Most expensive
Requires low-latency, high-bandwidth links between sites
More complex operationally

Used for:

Tier 0 workloads.

2. Backup and Restore

Backups are the last line of defense, especially against ransomware or data corruption.

2.1 Backup strategies

Full, incremental, differential backups

Full backup
- Captures everything.
- Slowest and largest but simplest for restores.
Incremental backup
- Backs up only data changed since the last backup (full OR incremental).
- Small and fast, but restore requires multiple incremental data sets.
Differential backup
- Backs up changes since the last full backup.
- Larger than incrementals but quicker to restore.

Most modern systems use a mix for efficiency.

Backup frequency and retention policies

Frequency (e.g., every 1 hour, once per day).
- Determines RPO for backup-based recovery.
Retention
- How long backups are kept (days, weeks, months, years).
- Influenced by compliance (finance, healthcare, legal requirements).

Example:

Daily backups kept 30 days.
Weekly full backups kept 3 months.
Monthly backups kept 1 year.
Yearly backups kept 7 years (common compliance rule).

Backup windows: scheduling to avoid peak times

Backups can generate heavy I/O.

To avoid slowing production systems:

Run backups at night or during non-peak hours.
Use incremental/differential during business hours if needed.

Some modern systems offer:

Continuous data protection
Snapshot integration to reduce application impact

2.2 Backup targets

Disk-based backup appliances

Fast backup and restore.
Commonly used for short- and medium-term retention.
Often deduplication appliances to reduce storage cost.

Used in most modern environments.

Tape libraries and offsite storage

Best cost per TB for long-term storage.
Not fast to restore, but great for archival data.
Tapes can be physically removed (air-gap), very useful against ransomware.

Still widely used in enterprise DR strategies.

Cloud-based backup repositories

Store backups in cloud storage.
Nearly infinite retention options.
No need to manage physical media.

Good for:

Offsite DR without huge hardware investment.
Long-term retention.

2.3 Backup integration

Application-consistent backups (e.g., VSS, database quiescing)

To avoid corrupted data, backups must be application-aware:

VSS (Volume Shadow Copy Service) for Windows apps.
Database quiescing (pausing writes briefly) to take a clean backup.

Without application consistency:

Restores may fail.
Databases may require repair or lose transactions.

Snapshot-based backup integration with arrays

Storage arrays offer snapshots:

Very fast, point-in-time copies.
Low impact compared to full backup.

Backup applications can:

Trigger array snapshots.
Copy snapshot data to backup storage.

This combines performance of snapshots with safety of true backups.

Catalog and index design for restore efficiency

Backups need catalogs to:

Locate which media/version contains the needed file or VM.
Allow fast searching.

If catalogs are poorly designed or corrupted:

Restores take longer or fail entirely.

So catalog backup is as important as data backup.

3. Replication and Data Mobility

Replication copies data from one location to another for DR or mobility.

3.1 Local protection

Snapshots: point-in-time, copy-on-write vs redirect-on-write

Snapshot = a quick, space-efficient point-in-time image.
Two main technologies:
- Copy-on-Write (CoW):
  - When original data changes, the old blocks are copied elsewhere.
  - Slower on heavy write workloads.
- Redirect-on-Write (RoW):
  - New writes go to a new location; old data stays intact.
  - Usually better performance.

Snapshots are good for:

Rapid recovery from logical errors.
Short-term protection (hours/days).

Clones: full-volume copies for testing or backup

A clone is a complete copy of data.
Uses more space than snapshots but is:
- Independent
- Very safe for testing, analytics, development, or long-term backup

Clones don’t depend on original data after creation.

3.2 Remote replication

Synchronous replication: zero data loss

Writes go to both sites at the same time.
Only acknowledged when written in both places.

Pros:

RPO ≈ 0 (no data loss).

Cons:

Needs very low latency between sites (usually <5 ms).
Limited to short distances.

Used for:

Tier 0 workloads.
Stretched-cluster architectures.

Asynchronous replication: better for distance

Writes are acknowledged immediately at primary site.
Data is sent to secondary site later (with delay).

Pros:

Works over long distances.
Less bandwidth required.

Cons:

RPO depends on replication interval (seconds/minutes).

Used for:

Tier 1 and some Tier 2 workloads.

Multi-target replication

One primary volume replicates to multiple secondary sites.
- Example: replicate to Site B and cloud site C.

Pro:

Very resilient; multiple DR options.
Good for geo-distributed enterprises.

Con:

More complex management.

3.3 Cross-platform and cloud DR

Replication between on-prem and cloud storage/platform

Replicate on-prem arrays to cloud storage (object or block).
Use cloud VMs as DR compute environment.

Useful when:

No budget for a physical secondary site.
Need elasticity for DR tests.

Use of cloud as DR target for certain workloads

Some workloads may fail over into cloud temporarily:

Less critical apps
Webfronts
Stateless services

You can:

Replicate VM images to cloud
Bring them up in a DR event
Redirect DNS/load balancers

Cloud-based DR is flexible and cost-effective if designed properly.

4. DR Runbooks and Testing

DR isn’t complete until it’s proven and practiced.

4.1 DR runbooks

Runbooks are step-by-step instructions for disaster recovery.
They make sure people do the right actions quickly under stress.

Runbooks include:

Step-by-step failover procedures

How to declare a disaster
Which systems fail over first
How to activate replicated VMs or volumes
How to test and validate the DR site

Must be:

Clear
Tested
Available even if primary systems are down

Role assignments: who does what during a DR event

Examples:

Incident commander (decision maker)
Storage admin
Network admin
Application owners
Communications lead
Compliance officer

Everyone must know their responsibilities.

Communication plans: internal and external communication

Internal:

IT teams
Executives
Department leaders

External:

Customers
Partners
Regulators

During an incident, communication clarity is as important as technical actions.

4.2 Testing

DR drills: scheduled tests with partial or full failover

Drills simulate real events:

Partial failover
- Fail over specific workloads only.
Full failover
- Entire environment switches to DR site.

Drills identify:

Gaps
Misconfigurations
People or process weaknesses

Tabletop exercises: dry-runs without actual failover

A group sits in a room and walks through a hypothetical disaster scenario.

Benefits:

Zero risk
Improves team coordination
Reveals documentation weaknesses

Post-test reporting: gaps, issues, improvement actions

After tests:

Document what went well
Document what went wrong
Update the DR plan and runbooks
Schedule fixes and improvements

Continuous improvement is essential.

5. Advisory and Governance

Finally, DR is not only technical—it involves policy, compliance, and advising the business.

5.1 Advising on trade-offs

Cost vs RPO/RTO

Key concept:

Lower RPO and RTO = higher cost

Examples:

RPO 0 → synchronous replication → expensive storage, high-bandwidth links
RTO seconds → stretched cluster → complex networking, licensing, support

Your job:

Explain these trade-offs clearly
Help the business choose what they really need

Never push a Tier-0 solution on a Tier-2 workload.

Complexity vs resilience

More resilience often means:

More components
More moving parts
More operational challenges

Sometimes a simpler solution:

Reduces mistakes
Lowers cost
Increases reliability in practice

Your role:

Recommend solutions that match the organization’s capability to operate them.

5.2 Policy and compliance

Aligning DR plan with regulatory requirements

DR must support compliance needs such as:

Data retention periods
Encryption rules
Required RTO/RPO for regulated systems
Mandatory testing schedules

Industries like banking, healthcare, and government have strict DR expectations.

Ensuring data sovereignty and retention policies are followed

Sovereignty:

Where data physically resides
Some laws require data to stay within a country or region

Retention:

How long data must be kept
When it must be deleted

Your DR design must respect:

Location rules
Retention rules across all DR copies, including backup and replicated data
Deletion rules (e.g., GDPR “right to be forgotten”)

Disaster Recovery and Advising (Additional Content)

1. Failback Strategy (Returning Workloads to the Primary Site)

Failback is often more complex than failover and must be explicitly planned, documented, and tested. A complete DR strategy includes not only moving workloads to the DR site but also safely returning them to the primary location once conditions allow.

Key Elements of Failback

Validation of primary site readiness, including power, cooling, network, and storage health.
Synchronization of changed data from the DR site back to the primary site.
Reversal of replication direction with safe cutover preparation.
Minimizing downtime during failback by sequencing applications appropriately.
Business-defined criteria for authorizing failback, including risk assessment.
Acceptance testing and sign-off after production workloads return to the primary site.

2. Classification of DR Testing: Non-Disruptive, Disruptive, Automated

DR testing must be structured to balance operational assurance with business impact. Categorizing tests enables better planning and compliance alignment.

Non-Disruptive Tests

No impact to production systems.
Validate DR readiness using read-only mounts, clones, or isolated test networks.
Suitable for frequent validation with minimal operational risk.

Disruptive Tests

Actual failover of selected or all workloads to the DR site.
Requires business approval and planned maintenance windows.
Validates operational readiness under real-world failover conditions.

Automated or Orchestrated Tests

Conducted using DR orchestration tools that automate recovery sequences.
Provide repeatability, auditability, and reduced human error.
Useful for regulatory reporting and continuous validation.

3. DR Automation and Orchestration Tools

Automation enhances recovery speed, reliability, and consistency. Modern DR tools provide structured execution of failover tasks that would be error-prone if done manually.

Key Capabilities

Automated failover workflows with predefined steps.
Dependency mapping between infrastructure and application tiers, such as database → application → web.
Automated adjustment of IP addressing, DNS, or load-balancer routing during DR events.
Execution of scripted processes representing “runbooks as code.”
Regular orchestrated DR testing without manual intervention.

4. Network Design Considerations for DR

DR plans must incorporate full network behavior across sites; otherwise, even perfectly replicated workloads may not function during failover.

Essential Considerations

Layer-2 stretch or Layer-3 routing models used for multi-site clusters and application mobility.
Firewall rules allowing replication, management traffic, and administrative failover operations.
Bandwidth sizing based on peak change rates, deduplication efficiency, and replication intervals.
Failover IP strategy, including DNS redirection, load-balancer adjustments, VIP failover, or SD-WAN routing updates.

5. Immutable Backups and Ransomware-Resilient DR

Ransomware resiliency has become a core part of modern DR design. Protecting backup data from alteration or deletion ensures recovery remains possible even after a cyberattack.

Key Protective Measures

Immutable or locked backups that cannot be modified for a defined retention period.
WORM retention enforcing write-once-read-many guarantees.
Air-gapped or offline backups kept outside reachable networks.
Multi-factor authentication requirements for backup deletions or retention changes.
Zero-trust access controls applied to backup and snapshot systems.

Shopping cart

Subtotal:

HPE1-H05 Disaster Recovery and Advising

Detailed list of HPE1-H05 knowledge points

Disaster Recovery and Advising Detailed Explanation

1. DR Concepts and Strategy

1.1 RPO and RTO

RPO (Recovery Point Objective): maximum acceptable data loss

RTO (Recovery Time Objective): maximum acceptable downtime

Mapping workloads to different RPO/RTO tiers

1.2 DR topologies

Backup and restore only (no live DR site)

Active/passive DR site: warm or cold standby

Active/active multi-site: synchronous replication, stretched clusters

2. Backup and Restore

2.1 Backup strategies

Full, incremental, differential backups

Backup frequency and retention policies

Backup windows: scheduling to avoid peak times

2.2 Backup targets

Disk-based backup appliances

Tape libraries and offsite storage

Cloud-based backup repositories

2.3 Backup integration

Application-consistent backups (e.g., VSS, database quiescing)

Snapshot-based backup integration with arrays

Catalog and index design for restore efficiency

3. Replication and Data Mobility

3.1 Local protection

Snapshots: point-in-time, copy-on-write vs redirect-on-write

Clones: full-volume copies for testing or backup

3.2 Remote replication

Synchronous replication: zero data loss

Asynchronous replication: better for distance

Multi-target replication

3.3 Cross-platform and cloud DR

Replication between on-prem and cloud storage/platform

Use of cloud as DR target for certain workloads

4. DR Runbooks and Testing

4.1 DR runbooks

Step-by-step failover procedures

Role assignments: who does what during a DR event

Communication plans: internal and external communication

4.2 Testing

DR drills: scheduled tests with partial or full failover

Tabletop exercises: dry-runs without actual failover

Post-test reporting: gaps, issues, improvement actions

5. Advisory and Governance

5.1 Advising on trade-offs

Cost vs RPO/RTO

Complexity vs resilience

5.2 Policy and compliance

Aligning DR plan with regulatory requirements

Ensuring data sovereignty and retention policies are followed

Disaster Recovery and Advising (Additional Content)

1. Failback Strategy (Returning Workloads to the Primary Site)

Key Elements of Failback

2. Classification of DR Testing: Non-Disruptive, Disruptive, Automated

Non-Disruptive Tests

Disruptive Tests

Automated or Orchestrated Tests

3. DR Automation and Orchestration Tools

Key Capabilities

4. Network Design Considerations for DR

Essential Considerations

5. Immutable Backups and Ransomware-Resilient DR

Key Protective Measures

Frequently Asked Questions