OpenStack Cinder Replication and Disaster Recovery, Pt. 6

This is part 6 of the series on OpenStack disaster recovery. Read Part 5: Operational Excellence and Future Considerations

Beyond Manual Processes: Enterprise DR Orchestration

While the OpenStack-DR project (Aegis) provides surgical precision for disaster recovery operations, enterprise environments often require higher-level orchestration, automated workflows, and self-service capabilities for tenants. This is where Trilio‘s upcoming Disaster Recovery as a Service (DRaaS) solution complements the technical foundation we’ve explored, providing enterprise-grade orchestration on top of Pure Storage’s replication capabilities.

Note that we are discussion an in-development project for Trilio and some concepts and commands may change as development continues.

The Enterprise DR Gap

Organizations migrating from VMware to OpenStack frequently discover that disaster recovery represents a significant capability gap. VMware’s Site Recovery Manager (SRM) provided comprehensive, tenant-driven DR workflows that many OpenStack environments struggle to replicate. The challenge isn’t just technical—it’s operational:

  • Self-Service Requirements: Tenants need the ability to manage their own DR policies without requiring infrastructure team intervention
  • Workflow Orchestration: DR operations involve complex sequences of storage, compute, and network configuration changes
  • Metadata Management: Successful DR requires preserving and replicating VM configurations, network mappings, and security policies
  • Testing and Validation: Regular DR testing without production impact is essential but operationally complex

Trilio’s DRaaS Architecture

Trilio plans to address these challenges by providing an orchestration layer that leverages the Pure Storage and OpenStack integration we’ve discussed while adding enterprise workflow capabilities.

Core Architecture Components:

  1. Trilio DR Service: Installed per OpenStack region/cluster, manages metadata replication and workflow orchestration
  2. Pure Storage Integration: Leverages FlashArray Protection Groups and Pods for storage replication
  3. OpenStack Native APIs: Uses Cinder replication APIs and volume management capabilities
  4. Tenant Self-Service: Horizon dashboard panels and CLI/API interfaces for tenant-driven operations

Trilio’s Approach to Tenant Isolation

Trilio will build upon the multi-backend strategy we explored in the OpenStack-DR project (Aegis), but adds sophisticated orchestration for both replication types:

Asynchronous Replication Per-Tenant Configuration:

# Trilio leverages multi-backend approach for async replication
[fa_tenant_finance_async]
volume_backend_name = fa_tenant_finance_async
volume_driver = cinder.volume.drivers.pure.PureISCSIDriver
pure_replication_pg_name = finance_protection_group
replication_device = backend_id:fa-dr,type:async

[fa_tenant_ecommerce_async]
volume_backend_name = fa_tenant_ecommerce_async
volume_driver = cinder.volume.drivers.pure.PureISCSIDriver
pure_replication_pg_name = ecommerce_protection_group
replication_device = backend_id:fa-dr,type:async

Synchronous Replication Per-Tenant Configuration:

# Trilio configuration for sync replication using Pods
[fa_tenant_trading_sync]
volume_backend_name = fa_tenant_trading_sync
volume_driver = cinder.volume.drivers.pure.PureISCSIDriver
pure_replication_pod_name = trading_critical_pod
replication_device = backend_id:fa-dr,type:sync

[fa_tenant_banking_sync]
volume_backend_name = fa_tenant_banking_sync
volume_driver = cinder.volume.drivers.pure.PureISCSIDriver
pure_replication_pod_name = banking_critical_pod
replication_device = backend_id:fa-dr,type:sync

Automated Volume Type Assignment:

  • Each tenant receives default replicated volume types (async OR sync based on requirements)
  • Only VMs with volumes entirely of the replicated type are eligible for protection
  • Prevents mixed-storage scenarios that complicate DR operations

Trilio’s Orchestrated Workflow

Asynchronous Replication Workflow:

  1. VM Quiescing: Trilio automatically quiesces VM I/O operations at the primary site (optional for async)
  2. Pure Storage Coordination:
    • Identifies latest Protection Group snapshots
    • Clones remote snapshots to read-write volumes on target array
    • Coordinates individual volume promotion across the tenant’s Protection Group
  3. Cinder Integration: Uses cinder manage commands to import promoted volumes into target cloud
  4. VM Reconstruction: Rebuilds VMs using replicated metadata including flavors, networks, and security groups
  5. VM Thawing: Resumes operations (for planned failover scenarios)

Recovery Characteristics:

  • RPO: Based on Protection Group replication interval (15 minutes default, configurable)
  • RTO: 10-15 minutes including volume promotion and VM reconstruction
  • Source Impact: Minimal – source VMs can continue running during async DR

Synchronous Replication Workflow:

  1. VM Quiescing: Mandatory I/O quiescing to ensure data consistency for sync replication
  2. Pod Management:
    • Validates stretched Pod status and zero replication lag
    • Clones volumes within Pure Storage Pods to read-write volumes
    • Manages Pod-to-container relationships on target array
  3. Volume Assignment: Assigns volumes to non-replicated volume types using Cinder manage
  4. VM Restoration: Restores VM state using replicated metadata with zero data loss guarantee
  5. Operations Resume: Thaws VMs at primary site (planned scenarios only)

Recovery Characteristics:

  • RPO: Zero data loss (synchronous replication guarantee)
  • RTO: 5-10 minutes including quiescing and VM restoration
  • Source Impact: Requires VM quiescing for data consistency

Self-Service Tenant Experience

Failover Group Management: Tenants create and manage “Failover Groups” – logical collections of VMs that failover together:

# Tenant CLI experience
trilio failover-group create --name "Finance-Production" --description "Critical finance applications"
trilio failover-group add-vm --group "Finance-Production" --vm "finance-db-01"
trilio failover-group add-vm --group "Finance-Production" --vm "finance-app-01"
trilio failover-group show "Finance-Production"

Horizon Dashboard Integration:

  • Protection Status: Visual dashboards showing replication health per VM
  • Group Management: GUI for creating and managing failover groups
  • DR Testing: One-click DR testing without production impact
  • Replication Monitoring: Real-time visibility into replication lag and status

Advanced Enterprise Features

DR Testing Without Production Impact: One of Trilio’s key innovations will be the ability to test DR workflows without affecting production systems:

  • Creates isolated test environments using replicated data
  • Validates application functionality in DR site
  • Provides detailed test reports and validation metrics
  • Automatic cleanup of test resources

Multiple Replication Types: Trilio supports mixed protection strategies within single or across multiple tenants:

# Tenant with critical databases (sync) and application servers (async)
trilio failover-group create --name "Finance-Critical" --replication-type sync
trilio failover-group create --name "Finance-Standard" --replication-type async

# Add VMs based on criticality
trilio failover-group add-vm --group "Finance-Critical" --vm "finance-db-01"  # Sync replication
trilio failover-group add-vm --group "Finance-Standard" --vm "finance-web-01" # Async replication

Reprotection After Failover: Post-failover reprotection ensures continued DR capability for both replication types:

Async Reprotection:

# Option 1: New async volume type assignment
trilio reprotect --group "Finance-Standard" --method volume-type --target async

# Option 2: Cinder retype operation for async
trilio reprotect --group "Finance-Standard" --method retype --target async

Sync Reprotection:

# Option 1: New sync volume type assignment
trilio reprotect --group "Finance-Critical" --method volume-type --target sync

# Option 2: Cinder retype operation for sync (pending patch support)
trilio reprotect --group "Finance-Critical" --method retype --target sync

Beyond Disaster Recovery: Additional Use Cases

Cloud Upgrades and Migrations:

  • Seamlessly migrate workloads between OpenStack versions
  • Zero-downtime region migrations
  • Planned maintenance failovers

Compliance and Data Locality:

  • Automated failover based on regulatory requirements
  • Geographic data sovereignty management
  • Audit trail for compliance reporting

Development and Testing:

  • Replicate production environments for testing
  • Validate patches and upgrades safely
  • Security response testing without production impact

Infrastructure as Code Integration:

  • CI/CD pipeline integration for automated DR testing
  • API-driven DR validation
  • Automated compliance verification

Implementation Requirements

Minimum System Requirements:

  • At least two Pure Storage FlashArrays
  • Trilio DR Service deployed per region/cluster

Operational Considerations:

  • Volume Type Strategy: VMs must use replicated volume types exclusively
  • Encryption Limitations: Encrypted volumes not supported (no Barbican integration)
  • Pod Limitations: Synchronous replication tenant count limited by FlashArray model
  • Failover Granularity: Initial MVP operates at volume type level

Comparing Approaches: OpenStack-DR vs. Trilio

AspectOpenStack-DR ProjectTrilio DRaaS
Target AudienceInfrastructure teams, custom automationEnterprise tenants, self-service
OrchestrationManual Ansible playbooksAutomated workflow engine
User InterfaceCLI/API onlyHorizon dashboard + CLI/API
Metadata ManagementManual configurationAutomated metadata replication
Async Replication SupportFull support via Protection GroupsFull support with orchestration
Sync Replication SupportFull support via ActiveCluster PodsFull support with orchestration
Replication Type MixingManual configuration per tenantAutomated mixed-type support
Testing FrameworkCustom scripting requiredBuilt-in DR testing (both sync/async)
Tenant ExperienceAdmin-managed onlySelf-service tenant management
VM QuiescingManual process in playbooksAutomated (optional for async, required for sync)
Enterprise FeaturesBasic DR functionalityAdvanced orchestration, reporting, compliance

Operational Excellence with Trilio

Automated Monitoring and Alerting:

  • Real-time replication health monitoring
  • Automated alert generation for protection failures
  • SLA compliance tracking and reporting

Policy-Based Management:

  • Template-based DR policies
  • Automated enforcement of protection requirements
  • Tag-based resource management

Comprehensive Reporting:

  • DR readiness assessments
  • Recovery time tracking and optimization
  • Compliance audit reports

Future Roadmap and Evolution

Planned Enhancements:

  • Per-VM Failover: Granular control beyond volume type level
  • Automated Failback: Orchestrated return to primary site
  • Enhanced Horizon UX: Improved tenant self-service experience
  • SLA-Driven Scheduling: Policy-based DR execution timing

Integration Opportunities:

  • Kubernetes Integration: Container workload protection
  • Multi-Cloud Support: Protection across different cloud platforms
  • AI/ML Optimization: Intelligent DR timing and resource allocation

In my final post of this series, I provide a comprehensive synthesis of all approaches and create a strategic implementation guide that helps you choose the right combination of technologies and methodologies for your specific organizational needs.

Leave a Reply

Your email address will not be published. Required fields are marked *