OpenStack Cinder Replication and Disaster Recovery, Pt. 3

Revolutionizing Disaster Recovery with OpenStack: Project Aegis

This is part 3 of our series on OpenStack disaster recovery. Read Part 2: Pure Storage FlashArray Implementation Deep Dive

Beyond Traditional Failover: A New Approach to Disaster Recovery

Traditional OpenStack Cinder replication has a significant limitation: failover operations occur at the backend level, affecting all volumes on a storage backend simultaneously. For multi-tenant environments or organizations with diverse application requirements, this all-or-nothing approach creates operational challenges and unnecessary risk.

The OpenStack-DR project (Project Aegis), currently in proof-of-concept phase, introduces a revolutionary approach that achieves granular, per-tenant disaster recovery without the limitations of traditional Cinder failover mechanisms.

The Problem with Traditional Cinder Failover

Before exploring the solution, it’s important to understand the limitations of standard Cinder replication failover:

Backend-Level Operations: Traditional openstack volume service failover command affect entire backends, meaning all volumes – replicated and non-replicated – are impacted.

Non-Replicated Volume Disruption: Perhaps most critically, failover operations make non-replicated volumes completely unavailable. In mixed environments where only some volumes require replication, this can be catastrophic.

Limited Granularity: You cannot selectively fail over specific tenants, applications, or individual volumes without affecting everything else on the backend.

Service Coordination Complexity: Backend failover requires careful coordination between Cinder services and can necessitate maintenance windows.

The OpenStack-DR Alternative: Volume Management Approach

The OpenStack-DR project takes a fundamentally different approach. Instead of using Cinder’s built-in failover mechanisms, it leverages direct volume management to achieve surgical precision in disaster recovery operations.

Core Methodology:

  1. Individual Volume Promotion: Ansible playbooks identify and promote specific volume copies within Pure Storage Protection Groups on the target array
  2. Volume Import: The volume_manage Ansible module imports promoted individual volumes into the target OpenStack cloud
  3. Instance Management: For bootable volumes, instances are automatically created and started on the target cloud

This approach provides several revolutionary advantages:

  • Surgical Precision: Only specific volumes are promoted and affected
  • Non-Replicated Volume Protection: Volumes without replication remain completely unaffected
  • Operational Flexibility: DR operations can be performed during business hours
  • Tenant Isolation: Individual tenants can be recovered independently by promoting only their volumes

Multi-Backend Strategy for Granular DR

Problem: Standard Cinder failover is per-backend, limiting flexibility for multi-tenant environments.

Solution: Configure multiple Cinder backends pointing to the same physical FlashArray, each with different Protection Group (async) or Pod (sync) configurations.

Asynchronous Replication Implementation:

Synchronous Replication Implementation:

Each backend uses the same physical arrays but maintains separate Protection Groups (for async) or Pods (for sync), enabling independent replication policies and failover operations.

Tenant-Specific Volume Types

Create dedicated volume types for each tenant with specific backend assignments and replication types:

Asynchronous Replication Volume Types:

Synchronous Replication Volume Types:

OpenStack-DR Workflow Process

Prerequisites:

  1. Authentication Setup: Configure clouds.yaml with credentials for both OpenStack clouds
  2. Volume Management Module: Deploy volume_manage Ansible module (currently pending upstream)
  3. Network Connectivity: Ensure Pure FlashArray replication is configured between sites
  4. Volume Types: Create appropriate volume types on both source and target clouds

Asynchronous Replication DR Execution Steps:

  1. Pre-Failover Assessment:
    • Verify Protection Group replication status and latest snapshots
    • Identify individual volumes requiring promotion within Protection Groups
    • Confirm target cloud resource availability
  2. Asynchronous Failover Process:
    • Volume Identification: Ansible playbooks identify specific volumes within tenant Protection Groups
    • Individual Volume Promotion: Promote volume copies from latest Protection Group snapshots
    • Volume Import: Import promoted volumes into target OpenStack cloud using volume_manage
    • Instance Recovery: Automatically start instances for bootable volumes (source instances can remain running)
  3. Post-Failover Validation:
    • Verify imported volumes are accessible
    • Validate application functionality with acceptable RPO (typically 15 minutes)

Synchronous Replication DR Execution Steps:

  1. Pre-Failover Assessment:
    • Verify ActiveCluster Pod status and stretched volume health
    • Confirm zero replication lag for sync volumes
    • Plan instance quiescing strategy
  2. Synchronous Failover Process:
    • Instance Quiescing: Stop or pause source instances to ensure data consistency
    • Pod Volume Promotion: Ansible playbooks clone individual volumes within stretched Pods
    • Volume Import: Import cloned sync volumes with zero data loss guarantee
    • Instance Recovery: Start instances on target cloud with complete data integrity
  3. Post-Failover Validation:
    • Verify zero data loss (RPO = 0)
    • Validate application startup with complete data consistency

Operational Workflow: DR Without Disruption

The OpenStack-DR workflow represents a paradigm shift in disaster recovery operations:

Pre-Failover Preparation:

  • Verify replication status via Pure Storage management interfaces
  • Confirm target cloud capacity and volume type availability
  • Validate Ansible playbook configuration and authentication

DR Execution:

  1. Individual Volume Identification: Current Ansible playbooks identify specific volumes within the relevant tenant’s Protection Group that need to be promoted
  2. Volume Promotion: Promote individual volume copies on the target FlashArray (not the entire Protection Group in the current alpha code)
  3. Volume Import: Import the promoted individual volumes into the target OpenStack cloud using the volume_manage module
  4. Instance Recovery: For bootable volumes, automatically create and start instances using the imported volumes
  5. Validation: Confirm application functionality and update network configurations as needed

Key Operational Advantages:

  • No Service Disruption: Cinder services continue normal operation
  • Selective Recovery: Only affected tenant volumes are processed
  • Business Hours Operation: No maintenance window required for most scenarios
  • Parallel Processing: Multiple tenants can be recovered simultaneously if needed

Real-World Scenario: Tenant-Specific Application Failure

Consider a scenario where Tenant 1’s critical application requires immediate disaster recovery activation:

Traditional Approach Would:

  • Require failing over the entire backend
  • Make all non-replicated volumes unavailable
  • Affect all other tenants on the same backend
  • Necessitate a maintenance window

OpenStack-DR Approach:

  • Identifies specific volumes within the tenant_1 Protection Group
  • Promotes only those individual volume copies on the target array
  • Imports only Tenant 1’s promoted volumes to the target cloud
  • Leaves all other tenants and non-replicated volumes unaffected
  • Completes in 10-15 minutes during business hours

Implementation Benefits

Advantages of OpenStack-DR Approach:

  • No Cinder Service Disruption: Avoids backend-wide failover that affects all volumes
  • Selective Volume Recovery: Import only specific replicated volumes without impacting others
  • Non-Replicated Volume Protection: Non-replicated volumes remain accessible and unaffected
  • Mixed Environment Support: Perfect for environments with both replicated and non-replicated storage
  • Operational Flexibility: Can perform partial DR operations during business hours
  • Reduced Complexity: Eliminates need for complex Cinder failover coordination
  • Cross-Cloud Compatibility: Works between different OpenStack deployments without version dependencies

Comparison with Traditional Cinder Failover:

AspectTraditional Cinder FailoverOpenStack-DR Volume Management
GranularityBackend-level (all volumes)Individual volume level
Service ImpactRequires Cinder service coordinationNo Cinder service disruption
Tenant IsolationCannot isolate specific tenantsFull tenant isolation
Non-Replicated VolumesMakes non-replicated volumes unavailableNo impact on non-replicated volumes
Operational WindowRequires maintenance windowCan operate during business hours
Rollback ComplexityComplex backend rollbackSimple volume re-import

Critical Limitation of Traditional Cinder Failover: Traditional Cinder failover operations affect the entire backend, making ALL volumes on that backend unavailable, including non-replicated volumes. This means:

  • Any volumes without replication configuration become inaccessible
  • Mixed environments with both replicated and non-replicated volumes face complete disruption
  • Recovery requires full backend restoration, affecting all tenants and applications

OpenStack-DR Advantage: The OpenStack-DR volume management approach operates only on specifically targeted replicated volumes:

  • Non-replicated volumes remain completely unaffected and accessible
  • Mixed storage environments can maintain partial operations during DR scenarios
  • Only the specific volumes being recovered are impacted during the import process
  • Provides true surgical precision for disaster recovery operations

Implementation Prerequisites

The OpenStack-DR approach requires several components:

Authentication: Properly configured clouds.yaml with credentials for both source and target OpenStack environments

Ansible Integration: The volume_manage module (currently pending upstream integration) for importing volumes across OpenStack clouds

Network Connectivity: Pure Storage replication must be properly configured between arrays

Target Cloud Preparation: Appropriate volume types and capacity must be available on the target OpenStack deployment

In my next series of posts, I’ll explore specific disaster recovery scenarios, enterprise orchestration solutions, and comprehensive implementation strategies for production environments.

If you would like to know more about the progress of Project Aegis, or even get involved with it’s development, please let me know in the comments section below.

Leave a Reply

Your email address will not be published. Required fields are marked *