Monday, February 20, 2012

UPS, vSphere, APCUPSD, PowerCLI and IPMI

 

One area that I feel is overlooked when dealing with VMware vSphere environments is UPS. Many organizations have a UPS and also a backup generator. But they don’t have a nice way of shutting everything down gracefully. I have heard that they don’t need that because of the generator, so the power will never be out.  But what happens if the generator doesn’t work correctly?

Anyway…I wanted a system for my home lab to shut everything down nicely and then bring it back up in the correct order.

I own a APC UPS that is a 1500 VA. Off of that, I run 2 of the Tyan motherboard white box ESXi 5 servers, 1 HP ML 350 G4p running ESX 4.0, a Linksys 48 port switch, one small iSCSI server using the Starwind software and one large storage server also running the Starwind Software iSCSI target.

All that runs about 68% load capacity on the UPS – I should probably get another one and split the load – but I will do that later.  This UPS is connected to the large storage server via a Serial cable so it can handle the monitoring duties.

So….I want have the following procedure in the event of a power failure:

  1. Shut down the VMs.
  2. Shut down the ESXi hosts.
  3. Send a “kill” command to the UPS so it turns off the power after a grace period (3 mins).
  4. Shut down the storage servers.

So….Point #3 – this is important because I want the servers to come back on automatically once power is restored.  If the UPS itself doesn’t shut off then the servers won’t see that as a power failure and won’t come back on once power is restored.

Also…it is important to note that the BIOS setting of the storage servers must be adjusted so that they are set to “Power On” in the event of a power loss. However, I must set the ESXi servers to NOT “Power On” in the event of a power loss – why?  You ask?

The problem is that the new ESXi 5 servers boot way faster than the iSCSI storage servers, so they are up before the storage is. And if that happens then the VM’s obviously won’t start.

So….since my fancy new servers (actually under $500 each!) have builtin IPMI – I will have a script on the storage server that will power them on and start up the VM’s once the storage server is started back up. This will ensure that everything comes back up in the correct order.

APCUPSD

First I downloaded and installed the apcupsd application – I found it much easier to use and understand than the standard PowerChute software – however, PowerChute or any other UPS monitoring application that can run scripts will work just fine.

Once installed, you have to adjust the apcupsd.conf file.  Since I installed my application in C:\apcupsd, mine was in c:\apcupsd\etc\apcupsd\apcupsd.conf and looks like this:

UPSCABLE 940-0095A
UPSTYPE apcsmart
DEVICE COM1
UPSCLASS standalone
UPSMODE disable
TIMEOUT 30

This configuration only gives me 30 seconds of running on the batteries before everything is going to start shutting down.  For me, I don’t have a lot of run time and any power outage longer than 30 seconds is likely to last longer than the run time of my batteries so it works for me….do what works for you.

Next, you have to adjust the apccontrol.bat file in that same directory. There is one spot that allows you to “Kill” the UPS (have it shut itself off) after a shutdown so that everything will come back up when the power is restored.  It is the section labelled “doshutdown”

:doshutdown
rem
rem  If you want to try to power down your UPS, uncomment
rem    out the following lines, but be warned that if the
rem    following shutdown -h now doesn't work, you may find
rem    the power being shut off to a running computer :-(
rem  Also note, we do this in the doshutdown case, because
rem    there is no way to get control when the machine is
rem    shutdown to call this script with --killpower. As
rem    a consequence, we do both killpower and shutdown
rem    here.
rem  Note that Win32 lacks a portable way to delay for a
rem    given time, so we use the trick of pinging a
rem    non-existent IP address with a given timeout.
rem

   %APCUPSD% /kill
   ping -n 1 -w 5000 10.255.255.254 > NUL
   %POPUP% "Doing %APCUPSD% --killpower"
   %APCUPSD% --killpower
   ping -n 1 -w 12000 10.255.255.254 > NUL
   %SHUTDOWN% -h now
   GOTO :done

Also…you need to change the apcupsd server to have a “-p” in order for the “KILL” UPS command to work:

apcupsd.exe /service –p

You can hack the registry, but probably the easiest way is to just delete the service and re-create it with the SC command:

sc create test binPath= "C:\apcupsd\bin\apcupsd.exe /service -p"

Once that is done, you need to create your .BAT file and your PowerCLI script to actually perform the shutdown actions.  The above script – apccontrol.bat – will “CALL” other scripts based on the action – so….my 30 second timeout is an action called “Timeout”.  So…I create a timeout.bat in that same directory.  Here is what mine looks like:

%SystemRoot%\system32\windowspowershell\v1.0\powershell.exe -psc "C:\Program Files (x86)\VMware\Infrastructure\vSphere PowerCLI\vim.psc1" -NoLogo -NonInteractive -ExecutionPolicy RemoteSigned -Command "C:\UPS\PowerOffAll.ps1"

shutdown /s /m \\192.168.1.56

The first command runs the PowerCLI script I have called “PowerOffAll.ps1” in the C:\UPS directory. Since we are running this from the APCUPSD service, it is important that we have the –ExecutionPolicy RemoteSigned in there or it won’t work – even if you have already set that policy. The next line simply connects to the 2nd storage server to shut it down too – the reason it works is because I am using the same username/password combination on both storage servers.

Thanks needs to go to Mike Foley for pointing out the ExecutionPolicy tip!  Thanks!

PowerCLI

For shutting everything down, I wanted to use PowerCLI. First I downloaded/installed PowerCLI from VMware. I found a number of scripts that did almost what I wanted and just modified and combined till I got what I wanted:

start-transcript -path c:\UPS\shutdownlog.txt

Set-PowerCLIConfiguration -DefaultVIServerMode multiple -Confirm:$false

# Connect to each ESX(i) server
Connect-VIServer 192.168.1.60,192.168.1.65 -user 'root' -password 'password'

# set the startup options on each VM - changed to using a auto start PowerCLI script on boot instead
#Get-VM -Name "SBS" | Get-VMStartPolicy | Set-VMStartPolicy -StartAction PowerOn -StartOrder 1
#Get-VM -Name "vCenter" | Get-VMStartPolicy | Set-VMStartPolicy -StartAction PowerOn -StartOrder 2
#Get-VM -Name "PBX In A Flash" | Get-VMStartPolicy | Set-VMStartPolicy -StartAction PowerOn -StartOrder 3
#Get-VM -Name "YNWP" | Get-VMStartPolicy | Set-VMStartPolicy -StartAction PowerOn -StartOrder 4
#Get-VM -Name "WHS" | Get-VMStartPolicy | Set-VMStartPolicy -StartAction PowerOn -StartOrder 5

# For each of the VMs on the ESX hosts
Foreach ($VM in Get-VM){
    # Shutdown the guest cleanly
    $VM | Shutdown-VMGuest -Confirm:$false
}

# Set the amount of time to wait before assuming the remaining powered on guests are stuck
$waittime = 180 #Seconds
sleep 10.0
do {
    # Wait for the VMs to be Shutdown cleanly
    sleep 10.0
    $waittime = $waittime - 10
    $numvms = @(Get-VM | Where { $_.PowerState -eq "PoweredOn" }).Count
    Write "Waiting for shutdown of $numvms VMs or until $waittime seconds"
} until ((@(Get-VM | Where { $_.PowerState -eq "PoweredOn" }).Count) -eq 0 -or $waittime -le 0)

Write "About to shutdown ESXi hosts"
# Shutdown the ESX Hosts
sleep 5.0
Get-VMHost | Foreach {Get-View $_.ID} | Foreach {$_.ShutdownHost_Task($TRUE)}

Write-Host "Shutdown Complete"
Disconnect-VIServer 192.168.1.60,192.168.1.65 -Confirm:$False

Stop-Transcript

I wanted a log of everything, so I used the Start-Transcript command. I also want to handle all of the ESXi servers directly in one group instead of talking to the vCenter – this is for 2 reasons – 1) my vCenter is virtual, so if it shuts down as part of the process, I won’t be able to talk to it to handle the rest of the shutdown process and 2) I don’t want to have to connect to each ESXi server individually. The other benefit is that when I do a “Get-VM”, it does it across all the ESXi servers. One easy modification would be to get the list of ESXi hosts from vCenter and then connect to them. That way you wouldn’t have to modify/hard code the addresses into this script.

My original idea was to set the startup/shutdown options in each ESXi server and allow it to start them up in the correct order.  However, with HA, the startup/shutdown options don’t work anymore and besides, if a machine gets vMotioned, then it won’t necessarily be the 1st to come up out of all the VM's on all the ESXi servers. So…I decided to go with a script to start things up on boot instead. I left the code in case someone else wants to use it.

So…next it shuts down all the VMs (this requires that VMware Tools is installed in all VMs!). Then it waits up to 180 seconds for all VMs to poweroff. Once that has happened, or the time has expired, it shuts down all the ESXi hosts.

Then my original batch file shuts off the 2nd storage server and then finally the apccontrol.bat sends the kill command to the UPS and shuts off the main storage server.

Startup and IPMI

Now…when the power comes back on, the BIOS of the 2 ESXi servers is set to stay off, but the BIOS of the 2 storage servers is set to Power On. So…they come back on. Then, I have a scheduled task in the main storage server that is set to run on boot.

All I need now is a command from Windows that this scheduled task can run to turn the ESXi servers back on, and then another PowerCLI command to actually start up the VMs in the order I want.

I had a heck of a time getting ahold of a copy of IPMITool for Windows. Sun used to make one, but Oracle bought out Sun a few years ago. They still have it, but it is much harder to find. So…find it here.

Once you have it unzipped, you can run a command like the one below to power on your server:

ipmitool-1.8.10.3-3.win.i386.exe -I lan -H 192.168.1.160 -U root –P password chassis power on

Lastly, we need a PowerCLI script to power on all the VMs in the right order. This isn’t written yet – so I will have to update this post later on once I have completed it.

I hope this is useful to someone!

Jim

No comments: