Disaster Recovery Runbook

Aligned with NIST SP 800-34r1, FCA regulatory expectations, and best practices for operational resilience.

Document Owner: IT Continuity Lead Version: 1.0 Review Cycle: Semi-annually or post-major change Last Reviewed: [Insert Date]

  1. Overview

Field
Description

Purpose

Guide the recovery of critical systems and services following a disruption.

Scope

Covers systems, applications, networks, and data critical to business operations.

Assumptions

Staff availability, backup availability, offsite access.

References

BIA Document, Crisis Comms Plan, Incident Register, Configuration Management DB.

  1. Roles & Responsibilities

Role
Name
Contact
Responsibilities

DR Manager

Coordinates execution of runbook

IT Lead

Leads system recovery

Comms Lead

Handles stakeholder updates

Security Officer

Ensures security during recovery

Vendor Liaison

Engages external partners

  1. DR Scenarios & Triggers

Scenario
Trigger
Examples

Cybersecurity Incident

Ransomware, DDoS

Data encrypted, systems offline

Infrastructure Outage

Power, hosting

Cloud provider outage

Data Loss

Accidental or malicious

Critical database corruption

Site Unavailability

Flood, fire, etc.

Data centre destroyed

  1. Critical Assets and Recovery Details

Asset Name
Type
RTO
RPO
Recovery Steps
Location

Payment API

Application

2 hours

15 mins

See Section 6

AWS (EU-WEST)

Customer DB

Database

4 hours

1 hour

See Backup Procedure

Azure

IAM System

Security

1 hour

0 mins

Activate failover

On-prem

Jira/Confluence

Collaboration

8 hours

24 hours

Contact Atlassian support

SaaS

  1. Communication Plan

Audience
Method
Frequency
Responsible

Internal Teams

Slack / Email

Hourly

DR Manager

Executives

SMS / Call

2-Hourly

Comms Lead

Customers

Status Page

As needed

Comms Lead

Regulators (e.g. FCA)

Email / Call

Within 24h

Compliance Officer

  1. Step-by-Step Recovery Procedures

Example: Payment API Recovery

  1. Trigger: Monitoring system flags downtime or data loss.

  2. Notify: DR team via PagerDuty; escalate to DR Manager.

  3. Initial Response:

    • Isolate affected VMs

    • Snapshot logs and preserve evidence

  4. Activate DR site:

    • Deploy from pre-approved CloudFormation template (AWS)

  5. Data Restoration:

    • Recover from S3 backup (timestamp T-15min)

  6. Validation:

    • QA team runs regression test suite

  7. Go/No-Go Decision: DR Manager signs off

  8. Customer Notification: Update status page and email notices

  9. Post-Incident Review:

    • Root cause analysis

    • Lessons learned

    • Update this runbook if needed

  1. Testing & Maintenance Schedule

Test Type
Frequency
Last Performed
Next Due
Owner

Tabletop Exercise

Quarterly

[Insert Date]

[Insert Date]

IT Continuity Lead

Full DR Test

Annually

[Insert Date]

[Insert Date]

Security Officer

Backup Restore Test

Monthly

[Insert Date]

[Insert Date]

IT Ops

  1. Appendices

  • Appendix A: Contact Sheet

  • Appendix B: Asset Configuration Docs

  • Appendix C: Backup Schedules & Locations

  • Appendix D: DR Site Map & Network Diagrams

  • Appendix E: Incident Response Plan Link

Last updated