Revolutionizing Disaster Recovery with OpenStack: Project Aegis

This is part 3 of our series on OpenStack disaster recovery. Read Part 2: Pure Storage FlashArray Implementation Deep Dive
Beyond Traditional Failover: A New Approach to Disaster Recovery
Traditional OpenStack Cinder replication has a significant limitation: failover operations occur at the backend level, affecting all volumes on a storage backend simultaneously. For multi-tenant environments or organizations with diverse application requirements, this all-or-nothing approach creates operational challenges and unnecessary risk.
The OpenStack-DR project (Project Aegis), currently in proof-of-concept phase, introduces a revolutionary approach that achieves granular, per-tenant disaster recovery without the limitations of traditional Cinder failover mechanisms.
The Problem with Traditional Cinder Failover
Before exploring the solution, it’s important to understand the limitations of standard Cinder replication failover:
Backend-Level Operations: Traditional openstack volume service failover command affect entire backends, meaning all volumes – replicated and non-replicated – are impacted.
Non-Replicated Volume Disruption: Perhaps most critically, failover operations make non-replicated volumes completely unavailable. In mixed environments where only some volumes require replication, this can be catastrophic.
Limited Granularity: You cannot selectively fail over specific tenants, applications, or individual volumes without affecting everything else on the backend.
Service Coordination Complexity: Backend failover requires careful coordination between Cinder services and can necessitate maintenance windows.
The OpenStack-DR Alternative: Volume Management Approach
The OpenStack-DR project takes a fundamentally different approach. Instead of using Cinder’s built-in failover mechanisms, it leverages direct volume management to achieve surgical precision in disaster recovery operations.
Core Methodology:
- Individual Volume Promotion: Ansible playbooks identify and promote specific volume copies within Pure Storage Protection Groups on the target array
- Volume Import: The
volume_manageAnsible module imports promoted individual volumes into the target OpenStack cloud - Instance Management: For bootable volumes, instances are automatically created and started on the target cloud
This approach provides several revolutionary advantages:
- Surgical Precision: Only specific volumes are promoted and affected
- Non-Replicated Volume Protection: Volumes without replication remain completely unaffected
- Operational Flexibility: DR operations can be performed during business hours
- Tenant Isolation: Individual tenants can be recovered independently by promoting only their volumes
Multi-Backend Strategy for Granular DR
Problem: Standard Cinder failover is per-backend, limiting flexibility for multi-tenant environments.
Solution: Configure multiple Cinder backends pointing to the same physical FlashArray, each with different Protection Group (async) or Pod (sync) configurations.
Asynchronous Replication Implementation:
[DEFAULT]
enabled_backends = fa_tenant_1_async, fa_tenant_2_async, fa_tenant_3_sync
[fa_tenant_1_async]
volume_backend_name = fa_tenant_1_async
volume_driver = cinder.volume.drivers.pure.PureISCSIDriver
san_ip = <FA_PRIMARY_MGMT_IP>
replication_device = backend_id:fa-2,san_ip:<FA_DR_MGMT_IP>,api_token:<FA_DR_API_TOKEN>,type:async
pure_api_token = <FA_PRIMARY_API_TOKEN>
pure_replication_pg_name = tenant_1_async
[fa_tenant_2_async]
volume_backend_name = fa_tenant_2_async
volume_driver = cinder.volume.drivers.pure.PureISCSIDriver
san_ip = <FA_PRIMARY_MGMT_IP>
replication_device = backend_id:fa-2,san_ip:<FA_DR_MGMT_IP>,api_token:<FA_DR_API_TOKEN>,type:async
pure_api_token = <FA_PRIMARY_API_TOKEN>
pure_replication_pg_name = tenant_2_async
Synchronous Replication Implementation:
[fa_tenant_3_sync]
volume_backend_name = fa_tenant_3_sync
volume_driver = cinder.volume.drivers.pure.PureISCSIDriver
san_ip = <FA_PRIMARY_MGMT_IP>
replication_device = backend_id:fa-2,san_ip:<FA_DR_MGMT_IP>,api_token:<FA_DR_API_TOKEN>,type:sync
pure_api_token = <FA_PRIMARY_API_TOKEN>
pure_replication_pod_name = tenant_3_sync_pod
Each backend uses the same physical arrays but maintains separate Protection Groups (for async) or Pods (for sync), enabling independent replication policies and failover operations.
Tenant-Specific Volume Types
Create dedicated volume types for each tenant with specific backend assignments and replication types:
Asynchronous Replication Volume Types:
# Tenant 1 Async Volume Type
openstack volume type create Tenant1_Async
openstack volume type set --property volume_backend_name=fa_tenant_1_async Tenant1_Async
openstack volume type set --property replication_enabled='<is> True' Tenant1_Async
openstack volume type set --property replication_type='<in> async' Tenant1_Async
openstack volume type set --project <tenant_1_name> Tenant1_Async
# Tenant 2 Async Volume Type
openstack volume type create Tenant2_Async
openstack volume type set --property volume_backend_name=fa_tenant_2_async Tenant2_Async
openstack volume type set --property replication_enabled='<is> True' Tenant2_Async
openstack volume type set --property replication_type='<in> async' Tenant2_Async
openstack volume type set --project <tenant_2_name> Tenant2_Async
Synchronous Replication Volume Types:
# Tenant 3 Sync Volume Type
openstack volume type create Tenant3_Sync
openstack volume type set --property volume_backend_name=fa_tenant_3_sync Tenant3_Sync
openstack volume type set --property replication_enabled='<is> True' Tenant3_Sync
openstack volume type set --property replication_type='<in> sync' Tenant3_Sync
openstack volume type set --project <tenant_3_name> Tenant3_Sync
OpenStack-DR Workflow Process
Prerequisites:
- Authentication Setup: Configure
clouds.yamlwith credentials for both OpenStack clouds - Volume Management Module: Deploy
volume_manageAnsible module (currently pending upstream) - Network Connectivity: Ensure Pure FlashArray replication is configured between sites
- Volume Types: Create appropriate volume types on both source and target clouds
Asynchronous Replication DR Execution Steps:
- Pre-Failover Assessment:
- Verify Protection Group replication status and latest snapshots
- Identify individual volumes requiring promotion within Protection Groups
- Confirm target cloud resource availability
- Asynchronous Failover Process:
- Volume Identification: Ansible playbooks identify specific volumes within tenant Protection Groups
- Individual Volume Promotion: Promote volume copies from latest Protection Group snapshots
- Volume Import: Import promoted volumes into target OpenStack cloud using
volume_manage - Instance Recovery: Automatically start instances for bootable volumes (source instances can remain running)
- Post-Failover Validation:
- Verify imported volumes are accessible
- Validate application functionality with acceptable RPO (typically 15 minutes)
Synchronous Replication DR Execution Steps:
- Pre-Failover Assessment:
- Verify ActiveCluster Pod status and stretched volume health
- Confirm zero replication lag for sync volumes
- Plan instance quiescing strategy
- Synchronous Failover Process:
- Instance Quiescing: Stop or pause source instances to ensure data consistency
- Pod Volume Promotion: Ansible playbooks clone individual volumes within stretched Pods
- Volume Import: Import cloned sync volumes with zero data loss guarantee
- Instance Recovery: Start instances on target cloud with complete data integrity
- Post-Failover Validation:
- Verify zero data loss (RPO = 0)
- Validate application startup with complete data consistency
Operational Workflow: DR Without Disruption
The OpenStack-DR workflow represents a paradigm shift in disaster recovery operations:
Pre-Failover Preparation:
- Verify replication status via Pure Storage management interfaces
- Confirm target cloud capacity and volume type availability
- Validate Ansible playbook configuration and authentication
DR Execution:
- Individual Volume Identification: Current Ansible playbooks identify specific volumes within the relevant tenant’s Protection Group that need to be promoted
- Volume Promotion: Promote individual volume copies on the target FlashArray (not the entire Protection Group in the current alpha code)
- Volume Import: Import the promoted individual volumes into the target OpenStack cloud using the
volume_managemodule - Instance Recovery: For bootable volumes, automatically create and start instances using the imported volumes
- Validation: Confirm application functionality and update network configurations as needed
Key Operational Advantages:
- No Service Disruption: Cinder services continue normal operation
- Selective Recovery: Only affected tenant volumes are processed
- Business Hours Operation: No maintenance window required for most scenarios
- Parallel Processing: Multiple tenants can be recovered simultaneously if needed
Real-World Scenario: Tenant-Specific Application Failure
Consider a scenario where Tenant 1’s critical application requires immediate disaster recovery activation:
Traditional Approach Would:
- Require failing over the entire backend
- Make all non-replicated volumes unavailable
- Affect all other tenants on the same backend
- Necessitate a maintenance window
OpenStack-DR Approach:
- Identifies specific volumes within the
tenant_1Protection Group - Promotes only those individual volume copies on the target array
- Imports only Tenant 1’s promoted volumes to the target cloud
- Leaves all other tenants and non-replicated volumes unaffected
- Completes in 10-15 minutes during business hours
Implementation Benefits
Advantages of OpenStack-DR Approach:
- No Cinder Service Disruption: Avoids backend-wide failover that affects all volumes
- Selective Volume Recovery: Import only specific replicated volumes without impacting others
- Non-Replicated Volume Protection: Non-replicated volumes remain accessible and unaffected
- Mixed Environment Support: Perfect for environments with both replicated and non-replicated storage
- Operational Flexibility: Can perform partial DR operations during business hours
- Reduced Complexity: Eliminates need for complex Cinder failover coordination
- Cross-Cloud Compatibility: Works between different OpenStack deployments without version dependencies
Comparison with Traditional Cinder Failover:
| Aspect | Traditional Cinder Failover | OpenStack-DR Volume Management |
|---|---|---|
| Granularity | Backend-level (all volumes) | Individual volume level |
| Service Impact | Requires Cinder service coordination | No Cinder service disruption |
| Tenant Isolation | Cannot isolate specific tenants | Full tenant isolation |
| Non-Replicated Volumes | Makes non-replicated volumes unavailable | No impact on non-replicated volumes |
| Operational Window | Requires maintenance window | Can operate during business hours |
| Rollback Complexity | Complex backend rollback | Simple volume re-import |
Critical Limitation of Traditional Cinder Failover: Traditional Cinder failover operations affect the entire backend, making ALL volumes on that backend unavailable, including non-replicated volumes. This means:
- Any volumes without replication configuration become inaccessible
- Mixed environments with both replicated and non-replicated volumes face complete disruption
- Recovery requires full backend restoration, affecting all tenants and applications
OpenStack-DR Advantage: The OpenStack-DR volume management approach operates only on specifically targeted replicated volumes:
- Non-replicated volumes remain completely unaffected and accessible
- Mixed storage environments can maintain partial operations during DR scenarios
- Only the specific volumes being recovered are impacted during the import process
- Provides true surgical precision for disaster recovery operations
Implementation Prerequisites
The OpenStack-DR approach requires several components:
Authentication: Properly configured clouds.yaml with credentials for both source and target OpenStack environments
Ansible Integration: The volume_manage module (currently pending upstream integration) for importing volumes across OpenStack clouds
Network Connectivity: Pure Storage replication must be properly configured between arrays
Target Cloud Preparation: Appropriate volume types and capacity must be available on the target OpenStack deployment
In my next series of posts, I’ll explore specific disaster recovery scenarios, enterprise orchestration solutions, and comprehensive implementation strategies for production environments.
If you would like to know more about the progress of Project Aegis, or even get involved with it’s development, please let me know in the comments section below.