Breaking the Data Migration Shackle

Mar 8, 2016

NOTE: The zero-downtime storage-cutover feature mentioned in this article is included with DMS version 4.0 and is called “pMotion.”

We have many posts on our website describing data migration and our DMS appliance. So far all of our storage migration administrators love our DMS product because it can be inserted transparently with no disruption, and without any changes to the environment. Their end-users love it because it has a mechanism to minimize impact during migration by yielding to production I/O. The entire process requires no downtime until cutover – the moment when the application hosts are moved from the old storage to the new storage. As far as we are aware, there is no product in the market that can complete the entire migration – including cutover – without downtime.

Now some of our users are asking for total zero-downtime migration capability in DMS – including cutover to new storage – with no application stoppage at all. Many enterprise environments desperately need this ability, since they normally have very little scheduled downtime during an entire year for system maintenance.

Scheduling additional cutover time for any data migration project is a huge deal. However, frequently customers have to switch over to new storage as soon as possible. One typical scenario is when the old storage has reached an end-of-lease. In this instance, not having to wait for the maintenance window for cutover is economically significant.

Nevertheless, until now, the actual cutover to new storage always requires downtime. There is currently no true zero-downtime capability anywhere in the industry. If anyone claims to offer zero-downtime migration, just ask them a few simple questions and the claim will crumble. To understand why, we can examine what is required to pull off such a feat.

We previously published two articles describing the concept of LUNs, the identity of the LUNs, and the SCSI nature of all storage devices in today’s operating systems. Those posts were presented to provide the technical background for this post, and for other forthcoming detailed technical discussions. Moving from that baseline, let’s now examine the challenges involved in creating true zero-downtime data migration.

For all current operating systems, a LUN is presented as a SCSI device. In fact, to provide redundancy, each LUN actually has multiple devices from different paths. They are then aggregated by a multipath driver and presented to the OS as one multipath device.

The multipath driver uses the vital product data (VPD) in one of the inquiry pages to identify the devices as one specific LUN. We loosely call that the GUID. Each LUN has its own GUID, where “GU” stands for “Globally Unique.” When one LUN is migrated to another, amongst many other parameters, it will have its own, different “GUID.” Simply because of this, there is no way for the OS – and, by extension, the application – to just continue its I/O without having to stop, recognize the new device, and reconfigure, before using the new migrated LUNs.

In theory, migration software developers such as Cirrus Data Solutions may work with the vendor providing the new storage to perhaps allow configuration of the VPD information to impersonate the original storage’s identity. However, that is only the first strand of the intricate web of identity information and capability that must be supported. Until there is a standard developed to allow such a transition, this is already an insurmountable obstacle to overcome. If even this first hurdle cannot be cleared, the rest is moot. That is why it has been impossible to have true zero-downtime migration directly from old storage to new storage.

But is this truly an impossible dream?

This is where our CEO, Wayne, likes to push the envelope, and force us to innovate in a manner that is really “outside the box.” If we step back and rethink the reasons for zero-downtime cutover, we then can realize that the objective is really to control the timing of cutover to avoid having to sync up with the rigid and short maintenance window. Suppose, for example, after migration is completed, the migration appliance can sustain the I/O in a continuous manner using the new storage. In this case, the old storage then can be disconnected and removed. Wouldn’t this achieve the objective?

With this inspiration, a design can be devised, such that, upon completion of the data transfer, the migration appliance will maintain the appearance of the old storage, yet all the I/O will be directed to the new storage – LUN by LUN, path by path. In this instance, as far as the application servers are concerned, nothing has changed (except the potentially better performance). This configuration can be kept in place until the maintenance window opens and the servers can be configured to directly access the new storage in its native identity. The migration appliance is responsible for all the impersonation and sustaining of I/O. It is indeed a simple and elegant solution!

Of course, this again looks simple, but what is involved is quite technically sophisticated, and requires a deep understanding of the SCSI protocol, command and information structure, multipath behavior, and cluster operation specifics. All of these aspects must be carefully, methodically, and precisely considered and handled. Nevertheless, it is possible and practicable. We know because Cirrus Data has already created a beta version of the new DMS that functions exactly in this manner.

So here is the end-to-end data migration operation, using DMS, the ultimate migration tool:

DMS appliances are transparently inserted into the storage system, and they automatically discover the entire SAN configuration, including all servers, LUNs, and paths.
The storage admin specifies what LUNs to migrate, exact time slots to migrate, and how much impact to production is allowed.
For supported new storage, the admin provides the IP address and credentials so the destination LUNs can be automatically provisioned and matched to the selected source LUNs.
After migration has completed, the old storage can be removed, while production continues on the new storage without interruption.
During the next scheduled maintenance window, the DMS appliance automatically transfers the entire storage configuration to the new storage system, including all server entities, WWPNs, and LUN masking, so there is no manual operation is required from beginning to end.

Again, as far as we are aware, there is no other migration product in the world that even comes close.

About the Author:

Wai Lam

Before joining Cirrus Data Solutions, Wai co-founded FalconStor Software in 2000, where he served as CTO and VP of Engineering. Wai was the chief architect, holding 18 of the 21 FalconStor patents. His inventions and innovations include many of industry’s “firsts,” in areas of advanced storage virtualization, data protection, and disaster recovery. Wai received a MSEE from UCLA, 1984, and BSEE from SUNY Stony Brook, 1982. He was honored with the Distinguished Alumni Award from Stony Brook in 2008.