Disaster Recovery Runbook

Aligned with NIST SP 800-34r1, FCA regulatory expectations, and best practices for operational resilience.

Document Owner: IT Continuity Lead Version: 1.0 Review Cycle: Semi-annually or post-major change Last Reviewed: [Insert Date]

Overview

Field

Description

Purpose

Guide the recovery of critical systems and services following a disruption.

Scope

Covers systems, applications, networks, and data critical to business operations.

Assumptions

Staff availability, backup availability, offsite access.

References

BIA Document, Crisis Comms Plan, Incident Register, Configuration Management DB.

Roles & Responsibilities

Role

Name

Contact

Responsibilities

DR Manager

Coordinates execution of runbook

IT Lead

Leads system recovery

Comms Lead

Handles stakeholder updates

Security Officer

Ensures security during recovery

Vendor Liaison

Engages external partners

DR Scenarios & Triggers

Scenario

Trigger

Examples

Cybersecurity Incident

Ransomware, DDoS

Data encrypted, systems offline

Infrastructure Outage

Power, hosting

Cloud provider outage

Data Loss

Accidental or malicious

Critical database corruption

Site Unavailability

Flood, fire, etc.

Data centre destroyed

Critical Assets and Recovery Details

Asset Name

Type

RTO

RPO

Recovery Steps

Location

Payment API

Application

2 hours

15 mins

See Section 6

AWS (EU-WEST)

Customer DB

Database

4 hours

1 hour

See Backup Procedure

Azure

IAM System

Security

1 hour

0 mins

Activate failover

On-prem

Jira/Confluence

Collaboration

8 hours

24 hours

Contact Atlassian support

SaaS

Communication Plan

Audience

Method

Frequency

Responsible

Internal Teams

Slack / Email

Hourly

DR Manager

Executives

SMS / Call

2-Hourly

Comms Lead

Customers

Status Page

As needed

Comms Lead

Regulators (e.g. FCA)

Email / Call

Within 24h

Compliance Officer

Step-by-Step Recovery Procedures

Example: Payment API Recovery

Trigger: Monitoring system flags downtime or data loss.
Notify: DR team via PagerDuty; escalate to DR Manager.
Initial Response:
- Isolate affected VMs
- Snapshot logs and preserve evidence
Activate DR site:
- Deploy from pre-approved CloudFormation template (AWS)
Data Restoration:
- Recover from S3 backup (timestamp T-15min)
Validation:
- QA team runs regression test suite
Go/No-Go Decision: DR Manager signs off
Customer Notification: Update status page and email notices
Post-Incident Review:
- Root cause analysis
- Lessons learned
- Update this runbook if needed

Testing & Maintenance Schedule

Test Type

Frequency

Last Performed

Next Due

Owner

Tabletop Exercise

Quarterly

[Insert Date]

IT Continuity Lead

Full DR Test

Annually

[Insert Date]

Security Officer

Backup Restore Test

Monthly

[Insert Date]

IT Ops

Appendices

Appendix A: Contact Sheet
Appendix B: Asset Configuration Docs
Appendix C: Backup Schedules & Locations
Appendix D: DR Site Map & Network Diagrams
Appendix E: Incident Response Plan Link

PreviousBusiness Impact Analysis (BIA) Template NextAppendix A: Contact Sheet

Last updated 8 months ago