This is part 6 of the series on OpenStack disaster recovery. Read Part 5: Operational Excellence and Future Considerations
Beyond Manual Processes: Enterprise DR Orchestration
While the OpenStack-DR project (Aegis) provides surgical precision for disaster recovery operations, enterprise environments often require higher-level orchestration, automated workflows, and self-service capabilities for tenants. This is where Trilio‘s upcoming Disaster Recovery as a Service (DRaaS) solution complements the technical foundation we’ve explored, providing enterprise-grade orchestration on top of Pure Storage’s replication capabilities.
Note that we are discussion an in-development project for Trilio and some concepts and commands may change as development continues.
The Enterprise DR Gap
Organizations migrating from VMware to OpenStack frequently discover that disaster recovery represents a significant capability gap. VMware’s Site Recovery Manager (SRM) provided comprehensive, tenant-driven DR workflows that many OpenStack environments struggle to replicate. The challenge isn’t just technical—it’s operational:
- Self-Service Requirements: Tenants need the ability to manage their own DR policies without requiring infrastructure team intervention
- Workflow Orchestration: DR operations involve complex sequences of storage, compute, and network configuration changes
- Metadata Management: Successful DR requires preserving and replicating VM configurations, network mappings, and security policies
- Testing and Validation: Regular DR testing without production impact is essential but operationally complex
Trilio’s DRaaS Architecture
Trilio plans to address these challenges by providing an orchestration layer that leverages the Pure Storage and OpenStack integration we’ve discussed while adding enterprise workflow capabilities.
Core Architecture Components:
- Trilio DR Service: Installed per OpenStack region/cluster, manages metadata replication and workflow orchestration
- Pure Storage Integration: Leverages FlashArray Protection Groups and Pods for storage replication
- OpenStack Native APIs: Uses Cinder replication APIs and volume management capabilities
- Tenant Self-Service: Horizon dashboard panels and CLI/API interfaces for tenant-driven operations
Trilio’s Approach to Tenant Isolation
Trilio will build upon the multi-backend strategy we explored in the OpenStack-DR project (Aegis), but adds sophisticated orchestration for both replication types:
Asynchronous Replication Per-Tenant Configuration:
# Trilio leverages multi-backend approach for async replication
[fa_tenant_finance_async]
volume_backend_name = fa_tenant_finance_async
volume_driver = cinder.volume.drivers.pure.PureISCSIDriver
pure_replication_pg_name = finance_protection_group
replication_device = backend_id:fa-dr,type:async
[fa_tenant_ecommerce_async]
volume_backend_name = fa_tenant_ecommerce_async
volume_driver = cinder.volume.drivers.pure.PureISCSIDriver
pure_replication_pg_name = ecommerce_protection_group
replication_device = backend_id:fa-dr,type:async
Synchronous Replication Per-Tenant Configuration:
# Trilio configuration for sync replication using Pods
[fa_tenant_trading_sync]
volume_backend_name = fa_tenant_trading_sync
volume_driver = cinder.volume.drivers.pure.PureISCSIDriver
pure_replication_pod_name = trading_critical_pod
replication_device = backend_id:fa-dr,type:sync
[fa_tenant_banking_sync]
volume_backend_name = fa_tenant_banking_sync
volume_driver = cinder.volume.drivers.pure.PureISCSIDriver
pure_replication_pod_name = banking_critical_pod
replication_device = backend_id:fa-dr,type:sync
Automated Volume Type Assignment:
- Each tenant receives default replicated volume types (async OR sync based on requirements)
- Only VMs with volumes entirely of the replicated type are eligible for protection
- Prevents mixed-storage scenarios that complicate DR operations
Trilio’s Orchestrated Workflow
Asynchronous Replication Workflow:
- VM Quiescing: Trilio automatically quiesces VM I/O operations at the primary site (optional for async)
- Pure Storage Coordination:
- Identifies latest Protection Group snapshots
- Clones remote snapshots to read-write volumes on target array
- Coordinates individual volume promotion across the tenant’s Protection Group
- Cinder Integration: Uses cinder manage commands to import promoted volumes into target cloud
- VM Reconstruction: Rebuilds VMs using replicated metadata including flavors, networks, and security groups
- VM Thawing: Resumes operations (for planned failover scenarios)
Recovery Characteristics:
- RPO: Based on Protection Group replication interval (15 minutes default, configurable)
- RTO: 10-15 minutes including volume promotion and VM reconstruction
- Source Impact: Minimal – source VMs can continue running during async DR
Synchronous Replication Workflow:
- VM Quiescing: Mandatory I/O quiescing to ensure data consistency for sync replication
- Pod Management:
- Validates stretched Pod status and zero replication lag
- Clones volumes within Pure Storage Pods to read-write volumes
- Manages Pod-to-container relationships on target array
- Volume Assignment: Assigns volumes to non-replicated volume types using Cinder manage
- VM Restoration: Restores VM state using replicated metadata with zero data loss guarantee
- Operations Resume: Thaws VMs at primary site (planned scenarios only)
Recovery Characteristics:
- RPO: Zero data loss (synchronous replication guarantee)
- RTO: 5-10 minutes including quiescing and VM restoration
- Source Impact: Requires VM quiescing for data consistency
Self-Service Tenant Experience
Failover Group Management: Tenants create and manage “Failover Groups” – logical collections of VMs that failover together:
# Tenant CLI experience
trilio failover-group create --name "Finance-Production" --description "Critical finance applications"
trilio failover-group add-vm --group "Finance-Production" --vm "finance-db-01"
trilio failover-group add-vm --group "Finance-Production" --vm "finance-app-01"
trilio failover-group show "Finance-Production"
Horizon Dashboard Integration:
- Protection Status: Visual dashboards showing replication health per VM
- Group Management: GUI for creating and managing failover groups
- DR Testing: One-click DR testing without production impact
- Replication Monitoring: Real-time visibility into replication lag and status
Advanced Enterprise Features
DR Testing Without Production Impact: One of Trilio’s key innovations will be the ability to test DR workflows without affecting production systems:
- Creates isolated test environments using replicated data
- Validates application functionality in DR site
- Provides detailed test reports and validation metrics
- Automatic cleanup of test resources
Multiple Replication Types: Trilio supports mixed protection strategies within single or across multiple tenants:
# Tenant with critical databases (sync) and application servers (async)
trilio failover-group create --name "Finance-Critical" --replication-type sync
trilio failover-group create --name "Finance-Standard" --replication-type async
# Add VMs based on criticality
trilio failover-group add-vm --group "Finance-Critical" --vm "finance-db-01" # Sync replication
trilio failover-group add-vm --group "Finance-Standard" --vm "finance-web-01" # Async replication
Reprotection After Failover: Post-failover reprotection ensures continued DR capability for both replication types:
Async Reprotection:
# Option 1: New async volume type assignment
trilio reprotect --group "Finance-Standard" --method volume-type --target async
# Option 2: Cinder retype operation for async
trilio reprotect --group "Finance-Standard" --method retype --target async
Sync Reprotection:
# Option 1: New sync volume type assignment
trilio reprotect --group "Finance-Critical" --method volume-type --target sync
# Option 2: Cinder retype operation for sync (pending patch support)
trilio reprotect --group "Finance-Critical" --method retype --target sync
Beyond Disaster Recovery: Additional Use Cases
Cloud Upgrades and Migrations:
- Seamlessly migrate workloads between OpenStack versions
- Zero-downtime region migrations
- Planned maintenance failovers
Compliance and Data Locality:
- Automated failover based on regulatory requirements
- Geographic data sovereignty management
- Audit trail for compliance reporting
Development and Testing:
- Replicate production environments for testing
- Validate patches and upgrades safely
- Security response testing without production impact
Infrastructure as Code Integration:
- CI/CD pipeline integration for automated DR testing
- API-driven DR validation
- Automated compliance verification
Implementation Requirements
Minimum System Requirements:
- At least two Pure Storage FlashArrays
- Trilio DR Service deployed per region/cluster
Operational Considerations:
- Volume Type Strategy: VMs must use replicated volume types exclusively
- Encryption Limitations: Encrypted volumes not supported (no Barbican integration)
- Pod Limitations: Synchronous replication tenant count limited by FlashArray model
- Failover Granularity: Initial MVP operates at volume type level
Comparing Approaches: OpenStack-DR vs. Trilio
| Aspect | OpenStack-DR Project | Trilio DRaaS |
| Target Audience | Infrastructure teams, custom automation | Enterprise tenants, self-service |
| Orchestration | Manual Ansible playbooks | Automated workflow engine |
| User Interface | CLI/API only | Horizon dashboard + CLI/API |
| Metadata Management | Manual configuration | Automated metadata replication |
| Async Replication Support | Full support via Protection Groups | Full support with orchestration |
| Sync Replication Support | Full support via ActiveCluster Pods | Full support with orchestration |
| Replication Type Mixing | Manual configuration per tenant | Automated mixed-type support |
| Testing Framework | Custom scripting required | Built-in DR testing (both sync/async) |
| Tenant Experience | Admin-managed only | Self-service tenant management |
| VM Quiescing | Manual process in playbooks | Automated (optional for async, required for sync) |
| Enterprise Features | Basic DR functionality | Advanced orchestration, reporting, compliance |
Operational Excellence with Trilio
Automated Monitoring and Alerting:
- Real-time replication health monitoring
- Automated alert generation for protection failures
- SLA compliance tracking and reporting
Policy-Based Management:
- Template-based DR policies
- Automated enforcement of protection requirements
- Tag-based resource management
Comprehensive Reporting:
- DR readiness assessments
- Recovery time tracking and optimization
- Compliance audit reports
Future Roadmap and Evolution
Planned Enhancements:
- Per-VM Failover: Granular control beyond volume type level
- Automated Failback: Orchestrated return to primary site
- Enhanced Horizon UX: Improved tenant self-service experience
- SLA-Driven Scheduling: Policy-based DR execution timing
Integration Opportunities:
- Kubernetes Integration: Container workload protection
- Multi-Cloud Support: Protection across different cloud platforms
- AI/ML Optimization: Intelligent DR timing and resource allocation
In my final post of this series, I provide a comprehensive synthesis of all approaches and create a strategic implementation guide that helps you choose the right combination of technologies and methodologies for your specific organizational needs.