Downtime not an option? Learn the basics of VMware's Fault Tolerance and what you will need to get up and running

By Tom McDonald | Mar 25, 2011 11:32:00 AM

Is a server crash not an option for your company? Is having your server up and running the life and soul of your business? Then you may want to consider VMware’s Fault Tolerance (FT) feature. VMware Fault Tolerance is a step up from VMware High Availability (HA), with High Availability being VMware’s backup for a VM crash, if a server running a VM happens to go down then the host reboots on a different host. This allows for only a minute or two of downtime as the Virtual Machine starts up on a new server and the primary host that has crashed is restarted, if possible. This is extremely useful and can keep a business functioning with only a moment of downtime. What Fault Tolerance does is eliminate that couple minutes of downtime so that even if a server crashes, nothing is felt by the user. This feature gives companies that can’t stop functioning, even for a minute, the security they need to run their businesses.

How does FT work? Well with HA there is a primary server who runs the VM and a dedicated secondary host that is there in case of failure, if/when that failure occurs the secondary host is started and the VM is restarted on the new host. The failure is detected by using VMware’s heartbeat function that pings the server every second to ensure it is still active on the network, if the host stops responding it is considered to have failed and the VMs are moved to a new machine.  FT continues this trend, but instead of waiting for a host to fail and then restart it uses vLockstep to keep both hosts in sync that way if one was to fail than the other would continue running without having the user notice the server failure. By sharing a virtualized storage, all the files are accessible to both hosts and the primary host updates the secondary host constantly in order to keep both hosts RAM in sync. FT has a few rules to ensure it works properly:

  • Hosts must be in an HA cluster
  • Primary and secondary VMs must run on different hosts
  • Anti- affinity must be enabled (A configuration that ensures that the VM cannot be started on the same host)
  • The VMs must be stored on a shared storage
  • Minimum of 2 Gbps Nics, this is to allow vMotion and FT logging
  • Additional NICs for VM and management network traffic
Read More >

Human Aspect of Disaster Recovery Part 2

By Tom McDonald | Mar 16, 2011 2:00:00 PM

If you missed it, Check out Part 1 on setting up your DR Plan

Setting up your team

Read More >

Creating a Disaster Recovery Plan, How to Setup the Right Team

By Tom McDonald | Mar 16, 2011 10:09:00 AM

NSI specializes in Virtualization, Disaster Recovery, Managed Print and is a New England IT Consulting company, this has allowed us to be exposed to an array of IT problems all over Connecticut, New York, and Massachusetts areas and has led us to see not only the problems within different companies IT departments, but also a good understanding of where most companies lack in IT policy and general flaws in their DR Plans. We hope this will help people in preparing for disaster from the human aspect of things.

When most people think of a network going down they general attribute the problem to a hardware failure, whether it be a server or hard drive failure most people blame the devices themselves. But 29% of all data loss is contributed to human error, either from an IT professional who forgot to perform the correct backup, or an office employee who accidently deletes an important file; data loss is real and happens all too often. Unplanned downtime occurs whenever something serious happens to your network that wasn’t planned and can have different effects based on when and how bad the incident is, if a server crashes at 2am and your business operates 9-5 then you are probably alright, but if things were to shut down at 10am during the holiday season then there can be some serious revenue loss for the company. Juniper Networks reports that Human Error is the cause of 50-80% of all downtime, this, along with our 29% of data loss also being human error, shows that large amounts of money loss and headaches can be avoided by implementing policies that don’t just focus on hardware and software fixes, but instead to add policies to help avoid mistakes from happening.

Read More >

Prevent IT Disasters. How VMware High Availability protects your data center

By Tom McDonald | Mar 9, 2011 10:46:00 AM

VMware HA (High Availability) is a major step in setting up a disaster recovery objective. With HA enabled, each ESXi host checks in on the other hosts and looks for a failure, if a failure should occur the VMs on the failed host are restarted on another server. To enable HA on your network a few prerequisites are required; All VMs and their configuration files must reside on a shared storage, this is required so that all the hosts have access to the VM if the host running it should fail; Each host in a VMware HA cluster must have a host name and a static IP, this will guarantee that each host can monitor each other without having false positives on failure if a host changes IP address; Hosts must be configured to have access to the VM network; Finally VMware recommends a redundant network connection, if a network card should fail this would allow communication to the host it is associated with, without this redundancy the host would seen as failing.

Read More >