Server Snapshots and Service Interruptions

By Enabliser Albert

I write this in the hope that those who read these words, heed the advice and venture forth forearmed.

I write not of what I experienced.

I write of a young man, hailing from the island nation of Kee Wii, who aided by the Oracle of the Repository of All Knowledge, Googelia, ventured forth to rid a land of a troublesome menace disrupting its peace and harmony.

Not long ago, at a kingdom (client) that shall remain nameless in the interests of confidentiality, a strange phenomenon was observed that brought about much consternation to the King (business owner), and to the Trusted Advisors entrusted with maintaining the infrastructure  (programs and services) that bridged two busy townships (applications) vital to the prosperity of the land (the business).

To the tune of bells ringing at the top and bottom of the hour, a sudden tremendous wave surged down the river and swept away the great bridge (integration) between the essential townships. The Advisors scrambled to repeatedly rebuild the bridge whenever the King vociferously complained from the castle balcony, “the system is still down”, or dispatched a message by carrier pigeon – err, e-mail – expressing his disappointment that the danger to his land’s wellbeing was hitherto unresolved.

This troublesome phenomenon persisted for much of the morning, fouling the demeanor of the afflicted townsfolk found stagnating at the mouths of the open bridge, until one intrepid soul, a young Advisor recent to the ranks, stepped forth. With reckless bravery befitting the young, he declared he would seek out and vanquish the unseen menace wreaking havoc to the vital bridgeway between the two townships.

Furnished with his youthful stamina, mental prowess, and clad in armor woven of the fabric of knowledge and experience, he ventured forth with a resolute heart upon the arduous journey; his derriere firmly planted upon his trusty swiveling steed, the reins of his keyboard firmly in hand, and eyes keenly surveying the myriad landscape of numbers, names and terminology that lay before him. With perilous acronym pitfalls to be found behind every hill, and the subversive whisper of Forum Sylphs in the air, he approached the dark forest at the foot of the mountains that sheltered the mouth of the great river. Accompanied and aided by Googelia, who had pledged him her unwavering support, he entered the realm of ancient trees and journeyed into the shadows.

Passage through the forest was fraught with danger, yet with Googelia at his side, the young Advisor – a renowned CRM specialist – persevered until at last he stumbled upon a hidden waterfall feeding the mouth of the river, and observed at work the infernal mechanism that was repeatedly bringing down the bridge between the two townships of great importance to the kingdom. As he stood on the bank, a cold dread running like ice through his veins, he hefted his enchanted weapon of choice – the Telephone – and challenged the one responsible for instigating the process that had laid waste to the Advisor’s peaceful morning hours – the Sorcerer commissioned by the King to protect the vast collection of roads and bridges across his lands (the network).

Defeated by the words of the young Advisor, the Sorcerer had no choice but to bring down the mechanism he had implemented.

That mechanism was known as…the Server System Snapshot.

It is perhaps the strangest behavior I’ve seen in many years as part of the I.T. industry, that an application’s services would repeatedly stop and fail to restart after a server snapshot took place. However, the problem was regular, and it was found by the CRM specialist to be triggered inadvertently by the implementation of a server snapshot which was simply intended to add a measure of security to the system by creating a backup of the server that the client could resort to in the event of a server failure. It seemed innocent enough, but what no one knew or suspected was that this server snapshot would cause severe problems for the software that used message queues to integrate two distinct applications.

This begs the question: how best to provide a reliable system backup that goes beyond replicating or backing up databases. There’s fault tolerance applications and hardware, and countless proponents of myriad configurations of the aforementioned hardware. However, while investigating these and other issues, we stumbled upon a site that claimed to debunk many SQL myths.

One of the myths being debunked was: “After a failover, any in-flight transactions are continued.”

Apparently this is not the case, and fault recovery systems don’t support picking up the pieces after recovery takes place. Once lost it’s lost. However, the author proposed there was one system that could support the continued “uninterrupted” processing of in-flight transactions.

In his words:

The only technology that allows unbroken connections to a database when a failure occurs, is using virtualization with a live migration feature, where the VM comes up and the connections don’t know they’re talking to a different physical host server of the VM.”

So what is VM live migration?

It can be simplified to mean the “live” copying of virtual machines (VM’s in use) from one node to another within a clustered environment. That is, copying a running VM from one server to another.

Those of you familiar with this terminology and technology need read no further. However, those of you like me who wondered if it was possible to take a server and copy it while it’s actively in use, might want to look into VMWare’s vMotion which has been a part of vSphere (now in its sixth generation). It might seem surprising that a server can be copied in “real-time” and indeed there are limitations, but with solid-state drives slowly becoming more prevalent in high-end servers, the performance they deliver makes such virtualization with migration feasible to an extent previously not achieved. In fact, one of the requirements to make vMotion work within vSphere was that the roundtrip network latency had to be below five (5) milliseconds. It was possible to extend this requirement to ten (10) milliseconds if Metro vMotion was used in vSphere Enterprise Plus. The transfer of data was aided by vSphere supporting the use of multiple NIC’s at once.

With all the above said, we never confirmed how the Sorcerer implemented the server snapshot. It’s quite possible he was working in a virtualized environment. Why the snapshot caused the integration applications various services to fail is still a mystery, and time and financial constraints shackle any attempts to investigate the matter further. However, it’s clear that when implementing server snapshots, careful considering should be made to test the mechanism for unwanted side effects. The issue my colleague experienced could have been resolved more easily had the other party deigned to inform him of the snapshot put in place. Lack of communication has cost many a victory in the past.

It certainly would have saved the valiant Advisor from a morning of trials and tribulations.

For more information or to speak to an Enabling Trusted Advisor click here

Add new comment