Hurricane Exercise: Live Disaster Recovery Failover and Business Continuity
A full-scale live disaster recovery (DR) failover during a simulated hurricane in the Cayman Islands validated end-to-end resilience. The exercise confirmed business-continuity communications across primary and DR sites, verified transaction integrity for last-minute activity, validated global office connectivity and ran over a weekend to avoid disrupting day-to-day operations.
Context
The platform combined fund administration, investor relations and HR/business-continuity functionality: employee, client and investor records; templated communications with storm merge fields; phone trees; and pre/post-evacuation lists and reports. The exercise needed to validate that these capabilities worked seamlessly from both the primary and disaster recovery (DR) sites and across all global offices.
Approach
- Scope and objectives – validate live failover, confirm communications continuity, verify transaction integrity and confirm global office connectivity.
- Pre-staged alternate sites – web sites and web services (for example, the Investor Portal and the client fund-manager application, which was a desktop client connecting to data via a web service) had alternate DR instances pre‑installed and configured. These alternate sites required only a DNS change to be made live; the system contained a pre-defined email template to instruct the DNS provider.
- DNS change not executed – the actual DNS alteration was not performed during the exercise because DNS propagation and provider processing would have exceeded the accelerated timeframe. Instead, site connectivity to DR was validated by navigating directly to the DR webservers and services.
- Communications artefacts prepared – mail templates with merge fields and evacuation lists were treated as core DR assets and preloaded for the exercise; phone tree data was available as a fallback but was not executed in this run.
- Live transactional testing – operational users processed transactions against demonstration funds in production to create realistic, reconcilable data.
- Global participation and timing – operational users from every office joined the exercise; it ran over a weekend to avoid business disruption while still exercising realistic load and behaviours.
- Emergency failover capability – a secondary emergency failover application was available to trigger failover from an alternate location (for example, the US IT office) even if the primary database was offline; this provided an additional safety path should the primary site be unreachable or unable to initiate failover.
Exercise execution
The exercise ran in phases: pre-failover checks (final sync, template and data checks), a controlled live failover to the DR site, verification of communications and transaction reconciliation, and recovery/rollback once stability was confirmed. Mixed user behaviour was simulated: some users followed shutdown instructions while others continued working to the last moment to reveal edge cases and validate in-flight transaction handling. Communications verification focused on email templates and direct coordination; the phone tree dataset was available as a fallback but was not executed during the exercise.
Outcomes
- Communications continuity – templated email messages and direct coordination worked from both primary and DR sites; phone tree data was available as a validated artefact but was not exercised.
- Rapid failover – the primary – DR failover completed in three seconds during the test, demonstrating the effectiveness of the automated failover path.
- Emergency path validated – an emergency failover application existed to allow IT staff to trigger failover from an alternate location even if the primary database was offline; this secondary path provided resilience against scenarios where the primary site could not initiate failover.
- Transaction integrity – transactions entered immediately prior to failover and those processed on DR reconciled without loss or duplication.
- Global connectivity – every office successfully connected to DR webservers and validated cross-site workflows by direct navigation to DR endpoints.
- Runbook and automation improvements – routing and DNS steps were automated where possible and manual handoffs reduced, improving the recovery time objective.
- Stakeholder confidence – embedding a live DR failover within a broader business-continuity exercise, with operational staff participating, increased confidence among operational stakeholders that the DR process was efficient and effective.
- Operational risk identified – exercising DR and business-continuity together revealed staffing and availability assumptions for IT and communications teams that were subsequently mitigated through cross-training and escalation planning.
- Audit artefacts – logs, verification checklists and timelines were produced for regulatory assurance and after-action review.
Key takeaways
- Treat communications templates and HR lists as first-class DR artefacts and validate them under exercise conditions; keep phone tree data current as a fallback even if not exercised.
- Pre-stage alternate web and web-service sites so a DNS change is the only step to make DR instances live; include a pre-defined provider instruction in the system to speed execution when required.
- When DNS changes cannot be executed in an accelerated exercise, validate DR connectivity by navigating directly to DR webservers and services.
- Include global offices in exercises to validate connectivity and cross-site operational procedures.
- Run live exercises outside business hours where possible to avoid disruption while still testing realistic behaviours and loads.
- Embed DR failover tests within business continuity scenarios that involve operational staff and activity to build stakeholder confidence and reveal real-world dependencies.
- Provide an emergency failover path that can be triggered from an alternate location and that works even if the primary database is offline; this reduces single-point-of-failure risk and increases operational options during catastrophic primary-site loss.
- Exercise DR and business continuity together where possible to surface staffing and availability assumptions for IT and communications teams; mitigate those risks through cross-training and clear escalation plans.