Tuesday, November 13, 2007

Rebooting virtual desktops without a VDI connection broker

One problem I ran into after deploying a few thin clients to users: how do they reboot their virtual machines? Usually they don't need to, but after a week or two of just logging off, things can get a little stale. And we instruct our (normal) users to restart their computer every night, so if we push an update out, then their computer is in the perfect state.

Some connection brokers actually interface directly with VirtualCenter, and can power on and off machines as users need them. We don't really have a connection broker per say. We have a few HP thin clients that we purchased, and we are transforming older PC's into thin clients (using the RDP client in Windows XP). We're also testing 2x's ThinClientServer and ThinClientOS, but it's not really ready for prime-time.

So my idea is to have a script that would run nightly. It would check all of our virtual desktops, and reboot any of them that are not logged on. It will be a batch script, so I ran some tests to see what programs show up in TASKLIST during different stages. After looking over the results, I think I can use the existence of explorer.exe in TASKLIST to decide if a user is logged in or not.

Now I need a list of virtual desktops. I looked through the SQL tables for VirtualCenter. One table has a list of all virtual machines in the inventory: VPX_VM. There are a couple of columns we can use in here:

  • DNS_NAME: we can pass this on to the shutdown command to reboot the machine
  • POWER_STATE: no sense in trying to reboot a machine that is off
  • ANNOTATION: this notes field could be used to specify which VMs are eligible for a reboot (I'm not going to use these, but it might be useful for someone else)
  • GUEST_OS: maybe we only want to reboot machines matching "winXPProGuest"
I need to make a SQL query to return a list of virtual desktops. I don't want any NULLs from DNS_NAME, machines that are powered off (POWER_STATE <> 0), or machines that don't match winXPProGuest. Here's what I came up with:
select DNS_NAME from VirtualCenter.dbo.VPX_VM where POWER_STATE <> 0 and GUEST_OS = 'winXPProGuest'
I'm going to run the script once a day on our VirtualCenter server. I typically keep a directory called C:\Scripts for these sorts of things, so that I don't lose track of them.

SQL 2005 has a command-line utility called SQLCMD.EXE. It's pretty picky about dependencies, so I installed the workstation/client components from the SQL disc to the VirtualCenter server, and after that it worked fine. After the install, I copied the SQLCMD files (SQLCMD.EXE, batchparser90.dll, and SQLCMD.rll) from our SQL 2005 Server installation to this directory. They can be found under \90\Tools\Binn\.

SQLCMD and TASKLIST can each take a specified username and account. For SQLCMD, any user with read access to the VirtualCenter DB will be fine. And for TASKLIST, I'll use the local administrator on each of the machines. TASKLIST doesn't have the ability to specific an account to use, so use whatever is good for your situation. I'll add the computer account for my VirtualCenter server (example: VCSRV$) to the Administrators group on each of our virtual desktops, and run the scheduled task as Local System.

Here's the command to get the list of machines using SQLCMD. I added a FIND pipe to remove any unnecessary clutter from the output. Using most of the FQDN in FIND will return just our VMs.

sqlcmd -S SQLSERVER -U Username -P Password -Q "select DNS_NAME from VirtualCenter.dbo.VPX_VM where POWER_STATE <> 0 and GUEST_OS = 'winXPProGuest'" -W | find /i ".subdomain.domain.com"
This returns a list of servers, sort of like this:
JOHNDvm.subdomain.domain.com
JANEDvm.subdomain.domain.com
So for each result, I want to first check if the user is not logged on (winlogon.exe is not running), and then issue a shutdown command. The batch script will be formed generically like so:
For each line of the SQL results, call :ParseResult
Goto the end of the file

:ParseResult
Is this machine running winlogon.exe?
If it is not, then reboot it.
Wait for a minute so that we don't overload the ESX server
Exit this sub so I can check the next one.
So here's the actual code. I'm nesting the SQLCMD inside of the FOR statement, which might seem confusing, but is really the best way to do it.
for /F %%a IN ('sqlcmd -S SQLSERVER -U Username -P Password -Q "select DNS_NAME from VirtualCenter.dbo.VPX_VM where POWER_STATE <> 0 and GUEST_OS = 'winXPProGuest'" -W ^| find /i ".subdomain.domain.com"') do call :ParseResult %%a
::Notice the ^ before the pipe, this is required inside of a FOR statements
goto :EOF
::Go to the end of the file when the FOR statement finished
:ParseResult
::This is called by the FOR statement and gets passed the DNS name for each VM
set ThisHost=%1
::I like to make it an actual variable before doing anything with it
tasklist /S %ThisHost% /U %ThisHost%\Administrator /P Password /FI "IMAGENAME eq explorer.exe" | find /i "explorer.exe" >nul
::Lists if explorer.exe is running on this host. The find is there to set the errorlevel.
if %errorlevel%==0 exit /b
::If it finds explorer.exe, forget about it.
shutdown -f -r -t 300 -m \\%ThisHost% -c "Message to users"
::Shutdown the machine: force, restart, wait 5 minutes, the target, and a message
sleep 120
::Wait for a minute before going on to the next one.
exit /b
::Continues to the next line inside of the FOR statement.
Obviously, you'll need to change usernames, passwords, and server names to match your environment.

You might ask yourself why I specify a filter for explorer.exe even though I'm going to do a FIND right after that. Well, I've noticed some issues with TASKLIST and FIND; it seems inconsistent. After switching to this method, it works fine, so that's enough reason for me.

Also, you'll need the SLEEP command to run the above script. If you only have 3 or 4 virtual desktops, staggering them isn't really an issue. But what if your ESX servers and VirtualCenter have 50 or 100 machines rebooting at the same time? It's good to stagger them a little, and you should adjust the time depending on your environment.

Some things that could be better about all of this:
  1. POWER_STATE might not be 0, but the machine could still be in another state that wouldn't let us reboot it. I can't readily find a schema for this field/table, but it's not a big deal.
  2. Maybe we want VMs to reboot once a week. Well if I schedule it for once a week, and someone stays logged on over the weekend, now we're at two weeks. So if the script still ran nightly, but skipped machine that haven't been on for a week yet, that would be cool. SYSTEMINFO | FIND /i "System Up Time:" would do this. But again, not a big deal.
  3. Maybe it should turn on VMs that are off every morning before users get in to work. This would require API scripting (or a real connection broker), so maybe something for the future.
All that's left is to make a scheduled task for the script (and pipe the output to a log file if you're interested). It does a very simple specific task, and I'm pretty happy with the result. I'd love to hear feedback if anyone else tries this.

Tuesday, November 06, 2007

Buying stuff

We made the decision for our new iSCSI SAN: an EqualLogic PS3700X. It has 16 400GB 10k rpm SAS drives, for a total of 4.8 TB usable space. We were planning on purchasing a PS3600X (300GB drives), and I'm glad we decided to wait as long as possible for newer products. For the extra 1.4 TB we only paid about $1k more. We are also looking at purchasing a PS83E for off-site replication, but that's not a definite yet.

Yesterday there was an announcement that Dell was purchasing EqualLogic. I've heard a few people's opinions on this, and everyone for the most part seems to be apathetic. I don't have the impression that Dell will screw up what a great company EqualLogic is, and I'm also not worried about support. From my perspective, everything should stay the same. Marc Farley from EqualLogic is keeping a good summary of the reactions from the internet over at his Storage @ Work blog. On a side note, our LeftHand salesman (which I had recently informed that they lost to EqualLogic) was emailing me this morning trying to use the situation to their benefit. Nice.

We are also in the process of upgrading our network infrastrucure. We are purchasing a Cisco 4510 with 6 48-port/gigabit/PoE cards and 2 48-port/gigabit cards. We are also purchasing 3 Cisco Catalyst 3560G-48PS switches for our IDF closets, and a 3560G-48TS for our iSCSI SAN.

One of the things we wanted to implement was a way to dynamically assign VLANs based on MAC address. I've already written a script that is currently storing all of our MAC addresses in a SQL database when users log in. There are a couple different technologies that will accomplish this (VMPS, URT, 802.1x), but we haven't picked one yet. It seems that Cisco is trying to phase out VMPS in favor of 802.1x, and I can't find a lot of information on the internet regarding VMPS usage. URT is in the same boat, and one guy on the internet seems to think they were both "Cisco trying to figure out which direction to go in". 802.1x seems great, but it's more than we really need. It dynamically assigns VLANs as an afterthought; it's main purpose is port-security regulated by user authentication. That's great, but what about our 40 or so network printers that don't support 802.1x? What about our thin clients that need to PXE boot? So yeah, I'm still researching other possibilities. When it's done, I'll be sure to post a detailed explanation of how we did it, because I can't really find anything that is straightforward and gives a complete picture (most miss the section about how to get MAC addresses into the actual database).