Monday, April 9, 2012

High Availability Windows Share Using Linux Samba/DRBD/CTDB and GFS

 

This combination should allow me to have a single DNS name that maps to a single Virtual IP that can move back and forth between two Virtual Machines (running on my vSphere environment). This way I can have them on different storage (I.E. one on local storage and the other on iSCSI storage) and be able to turn one storage unit off for maintenance and still allow access to the Windows share. This setup automatically handles the replication between the nodes.

I was trying to follow these instructions: http://www.howtoforge.com/setting-up-an-active-active-samba-ctdb-cluster-using-gfs-and-drbd-centos-5.5 – but there are a number of errors in the instructions, plus some things weren’t clear to me.

We will be using Samba running on top of the GFS clustered filesystem, using CTDB (the clustered version of the TDB database used by Samba) and DRBD to handle all the replication duties.

We will have Static IPs for each machine, in my case

smb1 - 192.168.1.30 and 10.10.10.1

smb2 - 192.168.1.31 and 10.10.10.2

The 2nd IP on each VM is for the DRBD interface for replication

Lastly, there will be two Virtual IP’s that is shared between them – 192.168.1.28 and 192.168.1.29

- First installed 32-bit version of Centos 5.5 with no options checked. Single hard drive only – 20 GB Thin provisioned. Set a static IP. Give it 2 NICs (one for network connectivity and the other for DRDB replication).

- setup – turn off SELinux

- Next, perform a:  yum update, then reboot. This updates to Centos 5.8 (as of this writing).

- Install VMware Tools.

- Clone the machine

- Add 2nd hard drive, add 2nd NIC to both VMs

- Boot the VM, fdisk /dev/sdb, n (new), p (primary partition), 1, accept defaults for cylinder start/end, w (write)

- no need to create filesystem or format the partition as we will get DRBD to use the new partition directly (/dev/sdb1)

- make sure that /etc/hosts file only has 127.0.0.1 localhost.localdomain localhost and not the hostname of the machine

- Adjust /etc/hosts on 2nd VM

- Adjust IP of 2nd NIC on both VMs (setup and then service network restart – you might want to erase ifcfg-eth0.bak from /etc/sysconfig/network-scripts on the 2nd VM)

- Confirm that you can ping both machines from each other using the private IP as well as the Public IP

- allow root login (edit /etc/ssh/sshd_config and change PermitRootLogin to yes, we are behind a firewall right?!), then service sshd restart

- yum -y install drbd82 kmod-drbd82 samba joe autoconf automake gcc-c++

(I like the joe editor)

- yum -y groupinstall "Cluster Storage" "Clustering"

- Adjust /etc/drbd.conf on both nodes:

   1: global {
   2:     usage-count yes;
   3: }
   4:  
   5: common {
   6:   syncer {
   7:                 rate 100M;
   8:                 al-extents 257;
   9:          }
  10: }
  11:  
  12: resource r0 {
  13:  
  14:   protocol C;
  15:  
  16:   startup {
  17:     become-primary-on both;              ### For Primary/Primary ###
  18:     degr-wfc-timeout 60;
  19:     wfc-timeout  30;
  20:   }
  21:  
  22:   disk {
  23:     on-io-error   detach;
  24:   }
  25:  
  26:   net {
  27:     allow-two-primaries;                 ### For Primary/Primary ###
  28:     cram-hmac-alg sha1;
  29:     shared-secret "mysecret";

30: after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;

  33:   }
  34:  
  35:   on smb1.yniw.local {
  36:     device     /dev/drbd0;
  37:     disk       /dev/sdb1;
  38:     address    10.10.10.1:7788;
  39:     meta-disk  internal;
  40:   }
  41:  
  42:   on smb2.yniw.local {
  43:     device     /dev/drbd0;
  44:     disk       /dev/sdb1;
  45:     address    10.10.10.2:7788;
  46:     meta-disk  internal;
  47:   }
  48: }

On both nodes do:


drbdadm create-md r0


To put the two nodes as primary, on both nodes do:


drbdsetup /dev/drbd0 primary –o


On both nodes (at almost the same time), do:


service drbd start


Make drbd service start automatically at boot:


chkconfig --level 35 drbd on


Check on status of the drbd replication:


cat /proc/drbd or you can do:


service drbd status


Next we configure the GFS filesystem. Put the following into the /etc/cluster/cluster.conf on each system.



   1: <?xml version="1.0"?>
   2: <cluster name="cluster1" config_version="3">
   3:  
   4: <cman two_node="1" expected_votes="1"/>
   5:  
   6: <clusternodes>
   7: <clusternode name="smb1.yniw.local" votes="1" nodeid="1">
   8:         <fence>
   9:                 <method name="single">
  10:                         <device name="manual" ipaddr="192.168.1.30"/>
  11:                 </method>
  12:         </fence>
  13: </clusternode>
  14:  
  15: <clusternode name="smb2.yniw.local" votes="1" nodeid="2">
  16:         <fence>
  17:                 <method name="single">
  18:                         <device name="manual" ipaddr="192.168.1.31"/>
  19:                 </method>
  20:         </fence>
  21: </clusternode>
  22: </clusternodes>
  23:  
  24: <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
  25:  
  26: <fencedevices>
  27:         <fencedevice name="manual" agent="fence_manual"/>
  28: </fencedevices>
  29:  
  30: </cluster>

Next, we start cman on both systems:


service cman start


And then we start the other services:


service clvmd start
service gfs start
service gfs2 start


chkconfig --level 35 cman on
chkconfig --level 35 clvmd on
chkconfig --level 35 gfs on
chkconfig --level 35 gfs2 on

Then we format the cluster filesystem (only on one node):

gfs_mkfs -p lock_dlm -t cluster1:gfs -j 2 /dev/drbd0

Then we create the mount point and mount the drbd device (on both nodes):

mkdir /clusterdata
mount -t gfs /dev/drbd0 /clusterdata

Then we insert the following line into the /etc/fstab (mine is different than the instructions):

/dev/drbd0              /clusterdata            gfs


I found that the gfs argument was necessary – however…do not add the other items – default 1 1 –as this will cause the auto-mounting system to try and fsck them at startup which won’t work.

Now…you should be able to copy data onto that /clusterdata mount point on one node and have it show up on the other automatically.

Next we configure samba. Again, my working file is different than the original instructions.  Edit /etc/samba/smb.conf



   1: [global]
   2:  
   3: clustering = yes
   4: idmap backend = tdb2
   5: private dir=/clusterdata/ctdb
   6: fileid:mapping = fsname
   7: use mmap = no
   8: nt acl support = yes
   9: ea support = yes
  10: security = user
  11: map to guest = Bad Password
  12: max protocol = SMB2
  13:  
  14: [public]
  15: comment = public share
  16: path = /clusterdata/public
  17: public = yes
  18: writeable = yes
  19: only guest = yes
  20: guest ok = yes

Next, create the directories needed by samba:


mkdir /clusterdata/ctdb
mkdir /clusterdata/public
chmod 777 /clusterdata/public

Follow the same instructions to install CTDB:

First, we need to download it:

cd /usr/src
rsync -avz samba.org::ftp/unpacked/ctdb .
cd ctdb/

Then we can compile it:

cd /usr/src/ctdb/
./autogen.sh
./configure
make
make install

Creating the init scripts and config links to /etc:

cd /usr/src/ctdb
cp config/ctdb.sysconfig /etc/sysconfig/ctdb
cp config/ctdb.init /etc/rc.d/init.d/ctdb
chmod +x /etc/init.d/ctdb
ln -s /usr/local/etc/ctdb/ /etc/ctdb
ln -s /usr/local/bin/ctdb /usr/bin/ctdb
ln -s /usr/local/sbin/ctdbd /usr/sbin/ctdbd

Next, we need to config /etc/sysconfig/ctdb on both nodes:

joe /etc/sysconfig/ctdb

Again…there are mistakes in the example originally given and I have provided my copy:



   1: CTDB_RECOVERY_LOCK="/clusterdata/ctdb.lock"
   2: CTDB_PUBLIC_INTERFACE=eth0
   3: CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
   4: CTDB_MANAGES_SAMBA=yes
   5: ulimit -n 10000
   6: CTDB_NODES=/etc/ctdb/nodes
   7: CTDB_LOGFILE=/var/log/log.ctdb
   8: CTDB_DEBUGLEVEL=2
   9: CTDB_PUBLIC_NETWORK="192.168.1.0/24"
  10: CTDB_PUBLIC_GATEWAY="192.168.1.8"

Now, config /etc/ctdb/public_addresses on both nodes:

vi /etc/ctdb/public_addresses

10.0.0.183/24
10.0.0.184/24

Then, config /etc/ctdb/nodes on both nodes:

vi /etc/ctdb/nodes

10.0.0.181
10.0.0.182

Then, config /etc/ctdb/events.d/11.route on both nodes:

vi /etc/ctdb/events.d/11.route



   1: #!/bin/sh
   2:  
   3: . /etc/ctdb/functions
   4: loadconfig ctdb
   5:  
   6: cmd="$1"
   7: shift
   8:  
   9: case $cmd in
  10:     takeip)
  11:          # we ignore errors from this, as the route might be up already when we're grabbing
  12:          # a 2nd IP on this interface
  13:          /sbin/ip route add $CTDB_PUBLIC_NETWORK via $CTDB_PUBLIC_GATEWAY dev $1 2> /dev/null
  14:          ;;
  15: esac
  16:  
  17: exit 0

Set +x permission on script:

chmod +x /etc/ctdb/events.d/11.route

Next…start the ctdb service:

service ctdb start

Here is one place I differ from those other instructions – he says to have the samba service auto-start, but the ctdb service handles the starting and stopping of samba, so you don’t do that.


Plus, I couldn’t make everything start properly when done from the init.d (using the chkconfig –level commands). The problem is that the GFS filesystem tries to be mounted by the fstab, but other things aren’t ready yet, so it doesn’t work. So I wrote the following lines into /etc/rc.local:


service drbd start
mount –a
service ctdb start


Also….to stop one of the servers and take it offline:


service ctdb stop
umount /clusterdata
service drdb stop


You should now have a working active/active Windows style share available that is fully redundant.


You can get to it by using a Windows PC and going to \\virtualip\public


So…in my example:  \\192.168.1.29\public


Enjoy!


Jim

Thursday, March 1, 2012

GoPro Camera found after 10 months in the Ocean – very cool picture!

 

So I was listening to the radio and found out about this guy who lost his GoPro camera while taking some pics with it on vacation. He had it on the front of his board while he was paddle surfing. The camera came off and he lost it and figured it was gone. Some 10 months later, a couple found it 15 miles away on the beach…they put the SD card in a computer and pulled up the pics.  The camera had continued to take pictures until it ran out of battery/space on the SD card.

The reason I even mention it here is because he is from my home town!

You can read more about the story here and here.

Anyway…the picture is the amazing thing – check this out!

 

Zachs-lost-GoPro-600x450

Windows 8 Consumer Preview on vSphere (ESXi) 5

 

So Microsoft just released Windows 8 Consumer Preview for people to download and try.

You can get it here.

So….I decided there was no harm in installing it in a virtual machine on my vSphere 5 home lab.

When you first setup the VM, pick Windows 7 as your OS. This makes sure that you use the correct ethernet (e1000). Then you must make sure you check the little “Enable 3D Support” on the Video card when you edit the settings of the virtual machine.

After installation, don’t install the WDDM driver or you will likely experience problems.

The installation looks a lot like Windows 7 (except for the initial boot with the fish).

fish

Once you are in, if you click the Desktop, you will see a desktop screen that looks very much like Windows 7.  All the same commands work – services.msc etc.

Seems to work pretty good – just looks to me like they have put a front-end on it to make it look snazzier.

Jim

Tuesday, February 28, 2012

VMware View 5 Design - Use cases, Pools

 

In creating a good VMware View 5 design, the most important part is deciding on use cases and mapping those use cases to pools.

You can think of a use case as a individual master image of an OS.

Mapping a use case to a pool is anything that would force us to put a user in a separate pool.

In my analysis, I am assuming the use of ThinApp as well.

Here are all the factors I could think of that will cause us to create additional pools:

  • Type of User
    o    Call center/help desk user
    o    Task user
    o    Power user/Executive
    o    Technical ?
    o    Any other “classes”
  • The # of users in a pool – maximum 500 users in a single pool, but you might want to set a lower limit
  • Internal and/or External access – these will be different Pools – the reason is that you can only currently use Tags to restrict access and Tags are done at the Pool level
    o    If we need external but not internal, then that is a pool (my guess is that you won’t have/use this one)
    o    If we need internal but not external, then that is a pool
    o    If we need Internal and External, then that is a pool
  • Virtual Machine configuration (I recommend limiting this to these 3 or maybe only the 1st two)
    o    A 2 GB, 1 vCPU machine would be a separate pool
    o    A 4 GB, 1 vCPU machine would be a separate pool
    o    A 4 GB, 2 vCPU machine would be a separate pool
  • OS, Applications and Drivers
    • Operating System
      • If you are going to have a Windows 7 and a Windows XP, these will be different pools
      • If you are going to have a Windows 7 32bit and a Windows 7 64 bit, these will be different pools   
    • Applications (not including ThinApp’ed Apps. Any app that will delivered via ThinApp does not need to be included in the base image).
      • Base image with standard apps that are not ThinApp’ed (make a list of applications)
      • Above Base image plus specialized apps that are not ThinApp’ed
    • Base image plus any special drivers required – You might consider just adding these to the Base image anyway. It will increase the size, but it will reduce the number of Base images that we have to maintain.
  • Local Admin access required (static pool). For example, if a user needs to be able to install applications into their desktop
  • Users requiring 1-to-1 mapping of specific VM - Static Pool (You should try and use Dynamic pools as much as possible and with the new Persona Management in View 5 it should be possible to do this much more than you could previously).
  • Whether Offline desktop mode (Also called Local Mode desktop) allowed

Your next steps would be to analyze each user and decide what the pool for that user will need to look like.

For example, for a standard call center worker: he doesn’t need external access, has access to a standard set of applications (either already included in the base image or ThinApp’ed) and only needs a basic Window 7 64 bit single CPU machine with 2 GB of RAM.

So…that is Pool 1. We can then put all other Call Center workers into that same pool up to 500.

Then we perform the same analysis on the each of the users (or group of users) in your organization. Once we have put each user into a pool, we add up the pools and we then know how many we need of each and what the config for those pools willl look like.

This in turn dictates what the CPU/RAM, Network design and storage requirements will look like.

I will be back to this topic later on.

Jim

Monday, February 20, 2012

VMware PEX 2012

 

clip_image002

Training and competencies are important, and I am pleased to attend various training courses and product events throughout the year, one of them being the VMware Partner Exchange (VMware PEX), held this year in Las Vegas from Feb. 10-17, 2012. It is the second PEX event I’ve attended.

VMware PEX is a showcase event run by VMware and designed specifically for VMware Partners. It is sponsored by many of the vendors—including EMC and Cisco. There were over 4,000 attendees and it was a great opportunity to network with other like-minded individuals and VMware professionals to discuss issues and problems that occur on a daily basis. It was also a chance to upgrade my credentials: this year I took and passed the VMware Certified Professional 5 (VCP 5) exam.

At VMware PEX (Partner Exchange), some interesting statistics were revealed:

  • Someone in the world powers up a new Virtual Machine every 6 seconds
  • There are now more than 25 million Virtual Machines running on VMware
  • More than 50% of the world’s workloads are now run on Virtual Machines
  • 81% of all virtualized workloads run on VMware

VMware has built their virtualization foundation upon vSphere. I have been actively assisting customers to virtualize their environments. After virtualizing their servers, many customers are further interested in virtualizing their desktops using VMware View—also called VDI (Virtual Desktop Infrastructure). Many organizations are using VMware View to allow for centralized management and easy access to virtual desktops.

As a leader in the virtualization space, VMware is always looking to the future, and this year was no exception. They are building upon their solid foundation and have created two new products: VMware vCloud Director (vCD) and VMware vCentre Operations Server (vCOPS).

VMware vCloud Director (vCD) allows organizations to look at their datacentres in a whole new way—rather than refer to individual servers, or even clusters of servers, vCD allows an organization to view their datacentre as a group of resources that they can utilize and deploy in any fashion they wish. They can give individual departments access to infrastructure without those departments having to a) purchase new gear or b) know anything about the underlying infrastructure. The IT department takes care of the underlying infrastructure and adds more servers, networking and storage as required. Other departments can simply consume these resources as needed using an easy to understand web-based interface. vSphere virtualizes the server; vCloud Director virtualizes the entire datacentre.

VMware vCentre Operations Server (vCOPS) is a monitoring and analytic engine that gives the organziation the ability to watch over the datacentre (or multiple datacentres) and collect metrics. It learns what is “normal” in your environment and will alert you to anomalies. It also provides trending so the organization can be made proactively aware that they will, for example, run out of diskspace in 6 months based on previous trends. This allows the organization to actively plan for the future instead of guessing at potential issues or dealing with them on the fly. In addition, when problems do arise, it allows the organization to pinpoint problems and know exactly what piece is causing the issue.clip_image004

In addition, VMware PEX hosts break-out sessions and hands-on labs where Partners can test out products in a live environment. This year, there were over 190 breakout sessions on many different topics and 27 different hands-on labs.

On the 2nd last night, VMware had a party for all attendees with the Bare Naked Ladies – and I took some footage.

Overall, the experience was fantastic and I would highly recommend that customers attend the customer-open event “VMWorld” in 2012.

Jim Nickel

VMware VCP5

UPS, vSphere, APCUPSD, PowerCLI and IPMI

 

One area that I feel is overlooked when dealing with VMware vSphere environments is UPS. Many organizations have a UPS and also a backup generator. But they don’t have a nice way of shutting everything down gracefully. I have heard that they don’t need that because of the generator, so the power will never be out.  But what happens if the generator doesn’t work correctly?

Anyway…I wanted a system for my home lab to shut everything down nicely and then bring it back up in the correct order.

I own a APC UPS that is a 1500 VA. Off of that, I run 2 of the Tyan motherboard white box ESXi 5 servers, 1 HP ML 350 G4p running ESX 4.0, a Linksys 48 port switch, one small iSCSI server using the Starwind software and one large storage server also running the Starwind Software iSCSI target.

All that runs about 68% load capacity on the UPS – I should probably get another one and split the load – but I will do that later.  This UPS is connected to the large storage server via a Serial cable so it can handle the monitoring duties.

So….I want have the following procedure in the event of a power failure:

  1. Shut down the VMs.
  2. Shut down the ESXi hosts.
  3. Send a “kill” command to the UPS so it turns off the power after a grace period (3 mins).
  4. Shut down the storage servers.

So….Point #3 – this is important because I want the servers to come back on automatically once power is restored.  If the UPS itself doesn’t shut off then the servers won’t see that as a power failure and won’t come back on once power is restored.

Also…it is important to note that the BIOS setting of the storage servers must be adjusted so that they are set to “Power On” in the event of a power loss. However, I must set the ESXi servers to NOT “Power On” in the event of a power loss – why?  You ask?

The problem is that the new ESXi 5 servers boot way faster than the iSCSI storage servers, so they are up before the storage is. And if that happens then the VM’s obviously won’t start.

So….since my fancy new servers (actually under $500 each!) have builtin IPMI – I will have a script on the storage server that will power them on and start up the VM’s once the storage server is started back up. This will ensure that everything comes back up in the correct order.

APCUPSD

First I downloaded and installed the apcupsd application – I found it much easier to use and understand than the standard PowerChute software – however, PowerChute or any other UPS monitoring application that can run scripts will work just fine.

Once installed, you have to adjust the apcupsd.conf file.  Since I installed my application in C:\apcupsd, mine was in c:\apcupsd\etc\apcupsd\apcupsd.conf and looks like this:

UPSCABLE 940-0095A
UPSTYPE apcsmart
DEVICE COM1
UPSCLASS standalone
UPSMODE disable
TIMEOUT 30

This configuration only gives me 30 seconds of running on the batteries before everything is going to start shutting down.  For me, I don’t have a lot of run time and any power outage longer than 30 seconds is likely to last longer than the run time of my batteries so it works for me….do what works for you.

Next, you have to adjust the apccontrol.bat file in that same directory. There is one spot that allows you to “Kill” the UPS (have it shut itself off) after a shutdown so that everything will come back up when the power is restored.  It is the section labelled “doshutdown”

:doshutdown
rem
rem  If you want to try to power down your UPS, uncomment
rem    out the following lines, but be warned that if the
rem    following shutdown -h now doesn't work, you may find
rem    the power being shut off to a running computer :-(
rem  Also note, we do this in the doshutdown case, because
rem    there is no way to get control when the machine is
rem    shutdown to call this script with --killpower. As
rem    a consequence, we do both killpower and shutdown
rem    here.
rem  Note that Win32 lacks a portable way to delay for a
rem    given time, so we use the trick of pinging a
rem    non-existent IP address with a given timeout.
rem

   %APCUPSD% /kill
   ping -n 1 -w 5000 10.255.255.254 > NUL
   %POPUP% "Doing %APCUPSD% --killpower"
   %APCUPSD% --killpower
   ping -n 1 -w 12000 10.255.255.254 > NUL
   %SHUTDOWN% -h now
   GOTO :done

Also…you need to change the apcupsd server to have a “-p” in order for the “KILL” UPS command to work:

apcupsd.exe /service –p

You can hack the registry, but probably the easiest way is to just delete the service and re-create it with the SC command:

sc create test binPath= "C:\apcupsd\bin\apcupsd.exe /service -p"

Once that is done, you need to create your .BAT file and your PowerCLI script to actually perform the shutdown actions.  The above script – apccontrol.bat – will “CALL” other scripts based on the action – so….my 30 second timeout is an action called “Timeout”.  So…I create a timeout.bat in that same directory.  Here is what mine looks like:

%SystemRoot%\system32\windowspowershell\v1.0\powershell.exe -psc "C:\Program Files (x86)\VMware\Infrastructure\vSphere PowerCLI\vim.psc1" -NoLogo -NonInteractive -ExecutionPolicy RemoteSigned -Command "C:\UPS\PowerOffAll.ps1"

shutdown /s /m \\192.168.1.56

The first command runs the PowerCLI script I have called “PowerOffAll.ps1” in the C:\UPS directory. Since we are running this from the APCUPSD service, it is important that we have the –ExecutionPolicy RemoteSigned in there or it won’t work – even if you have already set that policy. The next line simply connects to the 2nd storage server to shut it down too – the reason it works is because I am using the same username/password combination on both storage servers.

Thanks needs to go to Mike Foley for pointing out the ExecutionPolicy tip!  Thanks!

PowerCLI

For shutting everything down, I wanted to use PowerCLI. First I downloaded/installed PowerCLI from VMware. I found a number of scripts that did almost what I wanted and just modified and combined till I got what I wanted:

start-transcript -path c:\UPS\shutdownlog.txt

Set-PowerCLIConfiguration -DefaultVIServerMode multiple -Confirm:$false

# Connect to each ESX(i) server
Connect-VIServer 192.168.1.60,192.168.1.65 -user 'root' -password 'password'

# set the startup options on each VM - changed to using a auto start PowerCLI script on boot instead
#Get-VM -Name "SBS" | Get-VMStartPolicy | Set-VMStartPolicy -StartAction PowerOn -StartOrder 1
#Get-VM -Name "vCenter" | Get-VMStartPolicy | Set-VMStartPolicy -StartAction PowerOn -StartOrder 2
#Get-VM -Name "PBX In A Flash" | Get-VMStartPolicy | Set-VMStartPolicy -StartAction PowerOn -StartOrder 3
#Get-VM -Name "YNWP" | Get-VMStartPolicy | Set-VMStartPolicy -StartAction PowerOn -StartOrder 4
#Get-VM -Name "WHS" | Get-VMStartPolicy | Set-VMStartPolicy -StartAction PowerOn -StartOrder 5

# For each of the VMs on the ESX hosts
Foreach ($VM in Get-VM){
    # Shutdown the guest cleanly
    $VM | Shutdown-VMGuest -Confirm:$false
}

# Set the amount of time to wait before assuming the remaining powered on guests are stuck
$waittime = 180 #Seconds
sleep 10.0
do {
    # Wait for the VMs to be Shutdown cleanly
    sleep 10.0
    $waittime = $waittime - 10
    $numvms = @(Get-VM | Where { $_.PowerState -eq "PoweredOn" }).Count
    Write "Waiting for shutdown of $numvms VMs or until $waittime seconds"
} until ((@(Get-VM | Where { $_.PowerState -eq "PoweredOn" }).Count) -eq 0 -or $waittime -le 0)

Write "About to shutdown ESXi hosts"
# Shutdown the ESX Hosts
sleep 5.0
Get-VMHost | Foreach {Get-View $_.ID} | Foreach {$_.ShutdownHost_Task($TRUE)}

Write-Host "Shutdown Complete"
Disconnect-VIServer 192.168.1.60,192.168.1.65 -Confirm:$False

Stop-Transcript

I wanted a log of everything, so I used the Start-Transcript command. I also want to handle all of the ESXi servers directly in one group instead of talking to the vCenter – this is for 2 reasons – 1) my vCenter is virtual, so if it shuts down as part of the process, I won’t be able to talk to it to handle the rest of the shutdown process and 2) I don’t want to have to connect to each ESXi server individually. The other benefit is that when I do a “Get-VM”, it does it across all the ESXi servers. One easy modification would be to get the list of ESXi hosts from vCenter and then connect to them. That way you wouldn’t have to modify/hard code the addresses into this script.

My original idea was to set the startup/shutdown options in each ESXi server and allow it to start them up in the correct order.  However, with HA, the startup/shutdown options don’t work anymore and besides, if a machine gets vMotioned, then it won’t necessarily be the 1st to come up out of all the VM's on all the ESXi servers. So…I decided to go with a script to start things up on boot instead. I left the code in case someone else wants to use it.

So…next it shuts down all the VMs (this requires that VMware Tools is installed in all VMs!). Then it waits up to 180 seconds for all VMs to poweroff. Once that has happened, or the time has expired, it shuts down all the ESXi hosts.

Then my original batch file shuts off the 2nd storage server and then finally the apccontrol.bat sends the kill command to the UPS and shuts off the main storage server.

Startup and IPMI

Now…when the power comes back on, the BIOS of the 2 ESXi servers is set to stay off, but the BIOS of the 2 storage servers is set to Power On. So…they come back on. Then, I have a scheduled task in the main storage server that is set to run on boot.

All I need now is a command from Windows that this scheduled task can run to turn the ESXi servers back on, and then another PowerCLI command to actually start up the VMs in the order I want.

I had a heck of a time getting ahold of a copy of IPMITool for Windows. Sun used to make one, but Oracle bought out Sun a few years ago. They still have it, but it is much harder to find. So…find it here.

Once you have it unzipped, you can run a command like the one below to power on your server:

ipmitool-1.8.10.3-3.win.i386.exe -I lan -H 192.168.1.160 -U root –P password chassis power on

Lastly, we need a PowerCLI script to power on all the VMs in the right order. This isn’t written yet – so I will have to update this post later on once I have completed it.

I hope this is useful to someone!

Jim