How to Configure Storage Failback, Freeze Operations & Volume Management
When your primary storage backend goes down at 3 AM, you don’t want to be fumbling through documentation. OpenStack Cinder’s failover capabilities can be your lifeline, but only if you understand how to use them properly. Today, we’re diving deep into the world of Cinder failover, failback, and freeze operations—complete with real-world examples and a critical exception that could save your data.
The Big Picture: What Are We Talking About?
If you’re running OpenStack with replication-capable storage backends, Cinder gives you four powerful tools to manage disaster recovery and maintenance scenarios:
- Failover: Seamlessly switch from your primary to secondary backend when things go wrong
- Failback: Return operations to your original primary backend once it’s healthy
- Freeze/Thaw: Lock down management operations while keeping your data flowing
- Strategic volume creation: Know when it’s safe (and when it’s not) to create new volumes during failover
Think of it as having a well-rehearsed emergency plan for your storage infrastructure.
The Command Arsenal: Your Failover Toolkit
Let’s get practical. Here are the commands you need to master:
Failover Operations
# Switch to secondary backend
openstack volume service failover --backend <host@backend> [--secondary-backend-id <secondary_id>]
# Return to primary backend
openstack volume service failover --backend <host@backend> --secondary-backend-id default
Freeze/Thaw Operations
# Lock down management operations
openstack volume service freeze --host <host@backend>
# Resume full operations
openstack volume service thaw --host <host@backend>
What Actually Happens During Failover?
Understanding the mechanics is crucial for successful operations. During a failover:
Your replicated volumes: These become available from the secondary backend. Your applications should be able to continue working with existing data, though there are important connectivity considerations covered in the next section.
Your non-replicated volumes: These might become unavailable until the primary backend returns. This is why replication planning is so important.
Volumes that struggle: If something goes wrong during the transition, Cinder marks these volumes with status = error and replication_status = FAILOVER_ERROR. You’ll need to investigate these individually.
During failback, the reverse happens—but only if your driver supports it. If not, you’ll get an InvalidReplicationTarget error, and you’ll need to plan alternative recovery strategies.
Volume Connectivity After Failover: What You Need to Know
Here’s a critical detail that can impact your application availability during failover operations:
Asynchronous replication: Volumes will need to be reconnected to their Nova instances during failover. This means there will be a connectivity interruption that requires intervention to restore VM access to storage.
Synchronous replication: Reconnection may also be required depending on your specific storage backend implementation.
Pure Storage exception: If you’re using Pure Storage with uniform=True (which is the default setting), no reconnection is required. Your applications can continue working seamlessly without any connectivity interruption during failover operations.
This connectivity behavior is an important factor when planning maintenance windows and estimating downtime during disaster recovery scenarios.
The Freeze/Thaw Safety Net
Here’s where things get interesting. When you freeze a backend, you’re essentially putting up a “do not disturb” sign for management operations while keeping the data flowing:
What gets blocked: Creating volumes, deleting volumes, taking snapshots, and other management operations What keeps working: I/O operations, volume attachments and detachments—the stuff your applications actually need
This is incredibly useful during maintenance windows or when you want to prevent accidental changes during a failover scenario.
⚠️ The Golden Rule (With One Important Exception)
Here’s the rule that will save you from data consistency nightmares: Don’t create new volumes on the secondary backend during failover.
Why? With asynchronous replication, any writes happening during failover might not properly sync back to your primary. After failback, those volumes you created could become unavailable or, worse, contain inconsistent data.
But Wait—There’s an Exception!
If you’re using Pure Storage with synchronous replication (specifically ActiveCluster or tri-sync configurations), you can safely create new volumes during failover. Pure Storage’s documentation explicitly confirms that these volumes will remain available after failback.
The key configuration parameters to look for:
pure_replication_pod_namepure_trisync_pg_nameuniform=True(the default setting that enables seamless failover without reconnection)
This works because synchronous replication ensures that all writes are immediately replicated to both locations, maintaining consistency even during failover scenarios. The uniform=True setting provides the additional benefit of eliminating the need for volume reconnection during failover.
Your Step-by-Step Failover Playbook
Here’s a battle-tested workflow you can adapt for your environment:
Pre-Failover Preparation
- Verify your replication setup: Confirm whether you’re using synchronous or asynchronous replication
- Document your backend IDs: Know your primary and secondary backend identifiers
- Test driver capabilities: Ensure your backend driver supports failover, failback, and freeze operations
The Failover Process
- Optional but recommended—freeze the primary:
openstack volume service freeze --host <host@primary_backend> - Execute the failover:
openstack volume service failover --backend <host@primary_backend> [--secondary-backend-id <secondary_id>] - During failover operations:
- Let existing volumes operate normally (if your replication supports it)
- Be aware of volume connectivity requirements (see Volume Connectivity section above)
- Avoid creating new volumes unless you’re using Pure Storage with synchronous replication
- Consider keeping the backend frozen to prevent accidental changes
- Once failover stabilizes:
openstack volume service thaw --host <host@secondary_backend>
The Return Journey: Failback
- Initiate failback:
openstack volume service failover --backend <host@primary_backend> --secondary-backend-id default - Post-failover verification:
- Check volume statuses, especially any created during failover
- Look for volumes stuck in
errororFAILOVER_ERRORstates - Verify data consistency and replication status
Driver Considerations: Know Your Backend
Not all storage drivers are created equal when it comes to failover capabilities:
Failback support: Your driver must explicitly support failback operations. If it doesn’t, you’ll get an InvalidReplicationTarget error when you try to return to the primary.
Freeze/thaw implementation: Some drivers might not implement these operations, defaulting to no-ops. Test these features in your environment before relying on them in production.
Pure Storage specific: If you’re using Pure Storage, verify your backend is configured for synchronous replication with the correct pod names and protection group parameters. The uniform=True setting (default) eliminates volume reconnection requirements during failover, providing truly seamless operations.
Wrapping Up: Your Failover Success Strategy
Mastering Cinder failover isn’t just about memorizing commands—it’s about understanding your storage architecture, testing your procedures, and knowing the specific capabilities and limitations of your backend drivers.
The key takeaways:
- Always test failover and failback procedures in a non-production environment first
- Understand the difference between synchronous and asynchronous replication in your setup
- Be aware of volume connectivity requirements during failover operations
- Use freeze operations to prevent accidental changes during maintenance
- Remember the volume creation rule (and its Pure Storage exception)
- Document your backend IDs and driver capabilities
When disaster strikes or maintenance windows arrive, having this knowledge at your fingertips will make the difference between a smooth operation and a long night of troubleshooting.
Have you implemented Cinder failover in your environment? Share your experiences and any additional tips in the comments below!
