Tuesday, November 13, 2007

Rebooting virtual desktops without a VDI connection broker

One problem I ran into after deploying a few thin clients to users: how do they reboot their virtual machines? Usually they don't need to, but after a week or two of just logging off, things can get a little stale. And we instruct our (normal) users to restart their computer every night, so if we push an update out, then their computer is in the perfect state.

Some connection brokers actually interface directly with VirtualCenter, and can power on and off machines as users need them. We don't really have a connection broker per say. We have a few HP thin clients that we purchased, and we are transforming older PC's into thin clients (using the RDP client in Windows XP). We're also testing 2x's ThinClientServer and ThinClientOS, but it's not really ready for prime-time.

So my idea is to have a script that would run nightly. It would check all of our virtual desktops, and reboot any of them that are not logged on. It will be a batch script, so I ran some tests to see what programs show up in TASKLIST during different stages. After looking over the results, I think I can use the existence of explorer.exe in TASKLIST to decide if a user is logged in or not.

Now I need a list of virtual desktops. I looked through the SQL tables for VirtualCenter. One table has a list of all virtual machines in the inventory: VPX_VM. There are a couple of columns we can use in here:

  • DNS_NAME: we can pass this on to the shutdown command to reboot the machine
  • POWER_STATE: no sense in trying to reboot a machine that is off
  • ANNOTATION: this notes field could be used to specify which VMs are eligible for a reboot (I'm not going to use these, but it might be useful for someone else)
  • GUEST_OS: maybe we only want to reboot machines matching "winXPProGuest"
I need to make a SQL query to return a list of virtual desktops. I don't want any NULLs from DNS_NAME, machines that are powered off (POWER_STATE <> 0), or machines that don't match winXPProGuest. Here's what I came up with:
select DNS_NAME from VirtualCenter.dbo.VPX_VM where POWER_STATE <> 0 and GUEST_OS = 'winXPProGuest'
I'm going to run the script once a day on our VirtualCenter server. I typically keep a directory called C:\Scripts for these sorts of things, so that I don't lose track of them.

SQL 2005 has a command-line utility called SQLCMD.EXE. It's pretty picky about dependencies, so I installed the workstation/client components from the SQL disc to the VirtualCenter server, and after that it worked fine. After the install, I copied the SQLCMD files (SQLCMD.EXE, batchparser90.dll, and SQLCMD.rll) from our SQL 2005 Server installation to this directory. They can be found under \90\Tools\Binn\.

SQLCMD and TASKLIST can each take a specified username and account. For SQLCMD, any user with read access to the VirtualCenter DB will be fine. And for TASKLIST, I'll use the local administrator on each of the machines. TASKLIST doesn't have the ability to specific an account to use, so use whatever is good for your situation. I'll add the computer account for my VirtualCenter server (example: VCSRV$) to the Administrators group on each of our virtual desktops, and run the scheduled task as Local System.

Here's the command to get the list of machines using SQLCMD. I added a FIND pipe to remove any unnecessary clutter from the output. Using most of the FQDN in FIND will return just our VMs.

sqlcmd -S SQLSERVER -U Username -P Password -Q "select DNS_NAME from VirtualCenter.dbo.VPX_VM where POWER_STATE <> 0 and GUEST_OS = 'winXPProGuest'" -W | find /i ".subdomain.domain.com"
This returns a list of servers, sort of like this:
JOHNDvm.subdomain.domain.com
JANEDvm.subdomain.domain.com
So for each result, I want to first check if the user is not logged on (winlogon.exe is not running), and then issue a shutdown command. The batch script will be formed generically like so:
For each line of the SQL results, call :ParseResult
Goto the end of the file

:ParseResult
Is this machine running winlogon.exe?
If it is not, then reboot it.
Wait for a minute so that we don't overload the ESX server
Exit this sub so I can check the next one.
So here's the actual code. I'm nesting the SQLCMD inside of the FOR statement, which might seem confusing, but is really the best way to do it.
for /F %%a IN ('sqlcmd -S SQLSERVER -U Username -P Password -Q "select DNS_NAME from VirtualCenter.dbo.VPX_VM where POWER_STATE <> 0 and GUEST_OS = 'winXPProGuest'" -W ^| find /i ".subdomain.domain.com"') do call :ParseResult %%a
::Notice the ^ before the pipe, this is required inside of a FOR statements
goto :EOF
::Go to the end of the file when the FOR statement finished
:ParseResult
::This is called by the FOR statement and gets passed the DNS name for each VM
set ThisHost=%1
::I like to make it an actual variable before doing anything with it
tasklist /S %ThisHost% /U %ThisHost%\Administrator /P Password /FI "IMAGENAME eq explorer.exe" | find /i "explorer.exe" >nul
::Lists if explorer.exe is running on this host. The find is there to set the errorlevel.
if %errorlevel%==0 exit /b
::If it finds explorer.exe, forget about it.
shutdown -f -r -t 300 -m \\%ThisHost% -c "Message to users"
::Shutdown the machine: force, restart, wait 5 minutes, the target, and a message
sleep 120
::Wait for a minute before going on to the next one.
exit /b
::Continues to the next line inside of the FOR statement.
Obviously, you'll need to change usernames, passwords, and server names to match your environment.

You might ask yourself why I specify a filter for explorer.exe even though I'm going to do a FIND right after that. Well, I've noticed some issues with TASKLIST and FIND; it seems inconsistent. After switching to this method, it works fine, so that's enough reason for me.

Also, you'll need the SLEEP command to run the above script. If you only have 3 or 4 virtual desktops, staggering them isn't really an issue. But what if your ESX servers and VirtualCenter have 50 or 100 machines rebooting at the same time? It's good to stagger them a little, and you should adjust the time depending on your environment.

Some things that could be better about all of this:
  1. POWER_STATE might not be 0, but the machine could still be in another state that wouldn't let us reboot it. I can't readily find a schema for this field/table, but it's not a big deal.
  2. Maybe we want VMs to reboot once a week. Well if I schedule it for once a week, and someone stays logged on over the weekend, now we're at two weeks. So if the script still ran nightly, but skipped machine that haven't been on for a week yet, that would be cool. SYSTEMINFO | FIND /i "System Up Time:" would do this. But again, not a big deal.
  3. Maybe it should turn on VMs that are off every morning before users get in to work. This would require API scripting (or a real connection broker), so maybe something for the future.
All that's left is to make a scheduled task for the script (and pipe the output to a log file if you're interested). It does a very simple specific task, and I'm pretty happy with the result. I'd love to hear feedback if anyone else tries this.

Tuesday, November 06, 2007

Buying stuff

We made the decision for our new iSCSI SAN: an EqualLogic PS3700X. It has 16 400GB 10k rpm SAS drives, for a total of 4.8 TB usable space. We were planning on purchasing a PS3600X (300GB drives), and I'm glad we decided to wait as long as possible for newer products. For the extra 1.4 TB we only paid about $1k more. We are also looking at purchasing a PS83E for off-site replication, but that's not a definite yet.

Yesterday there was an announcement that Dell was purchasing EqualLogic. I've heard a few people's opinions on this, and everyone for the most part seems to be apathetic. I don't have the impression that Dell will screw up what a great company EqualLogic is, and I'm also not worried about support. From my perspective, everything should stay the same. Marc Farley from EqualLogic is keeping a good summary of the reactions from the internet over at his Storage @ Work blog. On a side note, our LeftHand salesman (which I had recently informed that they lost to EqualLogic) was emailing me this morning trying to use the situation to their benefit. Nice.

We are also in the process of upgrading our network infrastrucure. We are purchasing a Cisco 4510 with 6 48-port/gigabit/PoE cards and 2 48-port/gigabit cards. We are also purchasing 3 Cisco Catalyst 3560G-48PS switches for our IDF closets, and a 3560G-48TS for our iSCSI SAN.

One of the things we wanted to implement was a way to dynamically assign VLANs based on MAC address. I've already written a script that is currently storing all of our MAC addresses in a SQL database when users log in. There are a couple different technologies that will accomplish this (VMPS, URT, 802.1x), but we haven't picked one yet. It seems that Cisco is trying to phase out VMPS in favor of 802.1x, and I can't find a lot of information on the internet regarding VMPS usage. URT is in the same boat, and one guy on the internet seems to think they were both "Cisco trying to figure out which direction to go in". 802.1x seems great, but it's more than we really need. It dynamically assigns VLANs as an afterthought; it's main purpose is port-security regulated by user authentication. That's great, but what about our 40 or so network printers that don't support 802.1x? What about our thin clients that need to PXE boot? So yeah, I'm still researching other possibilities. When it's done, I'll be sure to post a detailed explanation of how we did it, because I can't really find anything that is straightforward and gives a complete picture (most miss the section about how to get MAC addresses into the actual database).

Wednesday, October 17, 2007

SQL server down, detectives continue the search for answers

I woke up this morning to find our database server down (which took VirtualCenter with it). This is a production SQL server, and our accounting software depends on it.

I logged in directly to the ESX host using the VirtualCenter Client. The VM was still running, but I couldn't do anything on the console. I saw the screensaver but it wasn't moving, so I reset the VM, and it started up fine.

The server went down at 6:21 and was down for about 45 min until I caught it.

The ESX performance logs showed normal (i.e. little or no) disk or network activity. However, memory usage dropped to zero, and the SMP processors (2 of them) were at ~2.66 Ghz (max) and ~1.7 Ghz for those 45 min (total average was ~65%).

Windows event logs show nothing, except that the shutdown was unexpected.

No other VM was affected. So it was either an error on the ESX process hosting the VM, or the OS itself.

Unique facts about this VM:

  • It was switched from 4 to 2 processors at one point due to SQL licensing (over a month ago). I researched this and assumed that the NT multiprocessor kernal wouldn't be affected unless we went to 1.
  • Hourly (for our purposes 6:00 am), SQL full backups occur. They create about 2.5 gigs worth of data that takes about 2 min. Should have been done long before the server crashed.
  • I have VCplus (http://www.run-virtual.com/?page_id=184) running on our VirtualCenter server. It was using a domain admin login for DB access during testing, so it had rights to break something if it wanted I suppose.
  • The SQL services are running under different logins than default. We were trying to get SCE 2007 to work with our SQL server about a month ago, and they suggested trying this.
SQL Server Integration Services
NT AUTHORITY\NetworkService
SQL Server FullText Search
LocalSystem
SQL Server
LocalSystem
SQL Server Analysis Services
LocalSystem
SQL Server Reporting Services
LocalSystem
SQL Server Browser
LocalSystem
SQL Server Agent
LocalSystem
I'm going to rebuild it this weekend just to be safe, considering the CPU went crazy, and we changed the processor count after the kernel was installed...

I've posted a question on the VMTN forums, so hopefully someone has some insight.

Tuesday, October 09, 2007

Thin clients and virtual desktops

We started testing virtual desktops for our company. It all originally started as a joke, saying we already had ESX server, why not throw all of our desktops on it as well. Then one of the co-founders of the company asked my boss about it, saying he read about it somewhere and it seemed neat. Then we were planning on virtualizing all of our desktops. But alot of users need CD drives. So as of this writing, we are planning on using virtual desktops for any contractors or new users that do not require a CD drive.

So, for the thin client, we are going to be reusing old PC's where possible, or buying an HP t5135 for new hires. I couldn't get my hands on any version of VMware's VDM 2 Beta from their VDI initiative, and I'm not sure if that will include a thin client when it's released towards the end of this year. So for now, we are using 2X's ThinClientServer, along with their ThinClientOS.

We first installed ThinClientServer on one of our Windows VMs. It does a couple things. First, when your thin client boots up, it acts as a DHCP helper to listen for DHCP requests, and returns a TFTP server and boot image filename. It also includes the TFTP server that hosts the ThinClientOS. Then, after the OS boots and the user trys to log in, it makes decisions based on the login as to which machine to redirect the user to.

I had no problem getting the ThinClientOS up and running on our test HP t5135. But getting it to work on our PCs is a different story. I first tried it on my PC, a relativly new HP dc5700, and it didn't recognize my NIC (after it successfully downloaded the boot image). I then tried it on a much older PC that one of our scanners is hooked up to, and it couldn't detect the video card properly (while I was looking at the error message). So I'll spend a lot of time troubleshooting these driver issues if we go live using ThinClientOS.
I'm looking forward to seeing exactly how VMware's VDM 2 is going to fit into all of this when it gets released in December. I know that they are going to do the same things as ThinClientServer (connection broker), but I'm not sure about a PXE boot image (but I would imagine they would have to include this in order to compete).

Friday, September 28, 2007

Windows Server 2008 Core not what I had expected.

I decided to poke around the "Core" mode of Windows Server 2008 RC0 this morning. I had expected the graphical installer to finish, and then to reboot into a command line similar to Linux. Nope. It boots like Windows, but without the shell. So you just see the background and a command prompt window.

The resolution is set at 800x600 at this point. I decide to see if VMware Tools will work. It didn't automatically launch, which I didn't expect it to anyway. I switched to the CD drive and ran it manually:
The install gave me a warning about my help files being out of date or something, so I just hit "No" on that dialog and everything else seemed to work fine, and I rebooted. Then when Windows came back up, I had a 640x480 resolution. Yay. I imagine I could change this through the registry or something but I haven't looked into it too deep.

Next step is to get the network up and running. It's still too new to join to my domain, so I'll skip that part. I used this guide for instruction.

netsh interface ipv4 show interfaces
netsh interface ipv4 set address name="2" source=static address=192.168.1.223 mask=255.255.255.0 gateway=192.168.1.1
netsh interface ipv4 add dnsserver name="2" address=192.168.1.5 index=1
netsh interface ipv4 add dnsserver name="2" address=192.168.1.6 index=2
netdom renamecomputer %computername% /newname:W2K82


According to the manual, you can use the Core installation for the following:

  • Active Directory Domain Services (AD DS)
  • Active Directory Lightweight Directory Services (AD LDS)
  • DHCP Server
  • DNS Server
  • File Services
  • Print Services
  • Streaming Media Services
As of now we plan to use it for our 2 AD/DNS/DHCP servers, and our file server. It's a pain to setup, but still easier than Linux. And for these services we won't be touching the machines at the console level very much.

Thursday, September 20, 2007

NTP Time synchronization for Windows domains and ESX Server


We ran into a problem last week that our phone system was out of sync with the time on our computers, and I was asked to fix it. Unfortunately I don't have access to the inner workings of our phone system, but here's how to do it on VMware ESX Server and a Windows 2003 domain (probably Windows 2000 too). Our clients are all Windows XP.

I chose to use the NTP pool from pool.ntp.org. It does a DNS round-robin to a list of donated servers. Most of them are web or DNS servers that also act as time servers. We use 3 different DNS servers in case we happen to be given a bad server (0, 1, and 2) and we append "us" to the FQDN so we only get US servers (visit pool.ntp.org to look up other countries):

0.us.pool.ntp.org
1.us.pool.ntp.org
2.us.pool.ntp.org
For your Windows domain, you need to do the following...

On your Windows domain controllers:
net time /setsntp:"0.pool.ntp.org 1.pool.ntp.org 2.pool.ntp.org"
For your Windows clients, they will typically get their time info from the PDC. But just to be sure, create or edit an existing GPO that is applied to all of your workstations and servers. You can use the "Default Domain Policy" if you like:

Open "Computer Settings > Administrative Templates > System > Windows Time Service > Time Providers".
Set "Enable Windows NTP Client" to Enabled.
Open the properties for "Configure Windows NTP Client". I set the following:
NtpServer = (Set to your domain name, which will direct it one of your domain controllers)
Type = NT5DS
CrossSiteSyncFlags = 2
ResolvePeerBackoffMinutes = 15
ResolvePeerBackoffMaxTimes = 7
SpecialPollInterval = 900 (I set this to 15 minutes, but the default might be better for larger environments)
EventLogFlags = 0
After making the GPO changes, you can apply it to a computer by issuing "gpupdate /force", or just give it a few hours or so.

On the ESX Server, in the service console, I used root privileges (su -). You can use this handy script by VMColonel, or do the following manually...

Open /etc/ntp.conf with your favorite text editor, and make it look like this:
restrict 127.0.0.1
restrict default kod nomodify notrap
server 0.us.pool.ntp.org
server 1.us.pool.ntp.org
server 2.us.pool.ntp.org
driftfile /var/lib/ntp/drift
And then open /etc/ntp/step-tickers and do the same:
0.us.pool.ntp.org
1.us.pool.ntp.org
2.us.pool.ntp.org
Then run these commands:
esxcfg-firewall --enableService ntpClient
service ntpd restart
chkconfig --level 345 ntpd on
hwclock --systohc
And that's pretty much it. To see the offset between your computer and the timeservers, you can issue these commands...

ESX Server (and most Linux distros):
watch "ntpq -p"
On any Windows 2003/XP machine:
w32tm /stripchart /computer:pool.ntp.org
You might need to set your Command Prompt window width to 100 for proper display.

All that's left is to get our phone system synced up to the same servers...

Tuesday, September 18, 2007

EqualLogic performance confusion

I spoke with our EqualLogic sales rep yesterday, and they are suggesting we go with a PS300E. It has 7,200 rpm drives. Not 10k. We always avoided 7,200 drives like the plague. But here is EqualLogic, one of the biggest players in the iSCSI industry, telling us we should put their 7,200 rpm drives in. Ok...

We had originally planned on buying a PS3900XV, which has 15k drives. But now it's between the PS3600X (10k drives) and one of the 7,200 rpm models. They are all the same box, but with different drive sizes. Right now it looks like the 7TB raw model (PS300E) is the one we should buy, but we might splurge and get something larger.

But back to the performance...7,200 rpm! I had to ask around, so I threw a post up on to the VMTN forums. And here's what I have so far:

  • joergriether tested a PS300E and a PS3900XV, and says they were about the same...
  • Yps has 90 VMs running on one PS400E, and has no trouble with it...
  • christianZ pointed me over to the thread he had started for SAN comparison, and also suggested the PS3600X over the PS3900XV. Not sure if he meant over the PS300E also...
So right now it looks like the PS300E is going to be what we go with. Sigh...

Monday, September 17, 2007

Configuring LUNs for virtual machines

Two nice folks responded to my post on the VMTN forums regarding how to set up VMFS partitions and LUNS for virtual machines on our SAN.

Before I talk about all of this, Stephan Stelter from LeftHand Networks also included a link about "How to Align Exchange I/O with Storage Track Boundaries". Good to know...

But anyway, this is apparently what most people do:

  • Create VMFS partitions for system drives, and place virtual machines that you would back up together on the same LUN.
  • For data drives, map to the LUN directly with the OS iSCSI initiator.
I haven't decided if we should place all our system drives on a single LUN or break them up into groups. I think I want to place domain controllers on their own LUN so I can restore our entire AD without the possibility of a USN rollback. And all our other servers don't really fall into groups... if start to think about splitting them up, it gets to the point where I might as well make a LUN for each of them. And from what I've read so far, performance isn't really a problem if you have less than 16 VMs on a LUN.

So with all the system drives (except AD) on a single LUN, restores become a little more complicated. Not too hard.. just harder than using individual LUNs. We would need to bring up a snapshot on a new LUN, mount the LUN on ESX server (or maybe somewhere else), and copy the VMDK of the machine we're restoring over to the production LUN.

If there is a LUN for each system drive, restoring from a snapshot would just require a click on the EqualLogic array's web interface, and everything would be back.

So what are we going to do? I still don't know yet. And there's plenty of time to decide, and neither way is the "wrong way" (nor the "right" way). So we'll see...

Friday, September 14, 2007

EqualLogic wants to sell us stuff

EqualLogic invited themselves in to our conference room yesterday to show us one of their iSCSI SAN arrays. All I can say is...wow. If you're looking at a SAN solution, definitely consider EqualLogic. And they will jump at the chance to give you a 2-hour in-house demo.

But regarding my post the other day...boy was I way off. Apparently I didn't understand storage virtualization. I was debating how to split the drives up, half RAID 5, half RAID 10, or all of it RAID 10...

EqualLogic introduced me to RAID 50. It requires a minimum of 6 drives not including a spare, so I'm not surprised I never came across it before. It basically takes 2 RAID 5 sets and stripes between them (RAID 0).

The EqualLogic PS3900XV has 16 300 GB drives. 2 RAID 5 sets is 8+8. A spare on each leaves us with 7+7. And the parity bit brings us down to 6+6. So out of the 16 drives, we get to use about 12 of them space-wise, at least 3.4 TB. EqualLogic also reserves some space on each drive for housekeeping, but this shouldn't end up being more than 200 gig or so (the tech couldn't remember if it was 20 megs or 20 gigs reserved per drive).

RAID 5 would have given us 4.5 TB but would have been pretty slow on write speed. And RAID 10 would have given us 2.4 TB. So 3.4 is pretty good...

And even better is the fact that its all virtual. We make a LUN for our file server, one for Exchange, one for SQL, etc. But we can grow these LUNs at any time, so instead of saying the file server gets 2 TB and Exchange gets 600 GB, we can just set them to about 150% of what they are right now and then expand them as they need it.

My only question now, and one that EqualLogic couldn't answer for me, was do we put all of our system drive VMDKs on a single LUN or make a seperate LUN for each server? I've posted a question on the VMware forums, and I'm sure a lot of people thought about the same thing while setting up VI3 and a SAN.

Thursday, September 13, 2007

VMware demos Continuous Availability @ VMworld

VMware HA (High Availability) is one of the main benefits of moving to a virtualized environment. In a typical environment, the VM's disks reside on shared storage. 2 or more ESX servers monitor each other, and if there is a hardware failure on an ESX box, another ESX server can take over and start the machine back up. This is great, because it typically means that downtime for applications is reduced to the amount of time it takes the OS to boot.

I've been watching Scott Lowe's blog as he liveblogs today's VMworld keynote, and apparently they showed something called Continuous Availability.

Continuous Availability actually keeps a running copy of the same VM on a second ESX server. The secondary VM is in a standby state, and the execution stream from the active VM is constantly replicated over to the secondary. If a failure is detected on the primary VM, the secondary takes over with almost no downtime at all.

I can't wait for this to be implemented as a feature in ESX (maybe 3.5?). It takes availability and disaster recovery to a new level, and further justifies our move to a virtual infrastructure.

Wednesday, September 12, 2007

HP 4x4 virtualization beast

HP announced their DL 580 G5, a quad-socket, quad-core server, with 16 drives and 128 GB of memory.

Could be in our future...

Tuesday, September 11, 2007

VMware announces ESX Server 3i

VMware announced ESX Server 3i today, and it's basically an embedded version of ESX server. We just purchased an 8-core HP server with 16 gigs of memory for our first ESX box. We were planning on boosting it to 32 gigs and purchasing a 2nd ESX server along with an EqualLogic iSCSI SAN. After today's announcement, I think we will wait to see what HP's ESX 3i offering looks like. I'm hoping that it will lower our acquisition cost when we do make our 2nd ESX purchase.

And I'm grateful I had a chance to use ESX 3.0.2 in the meantime, I've learned a lot from setting it up, and can now say that I am comfortable in a Linux environment. I can't imagine learning as much about VMware from an appliance rather than actually building the box from scratch. But the simplified installation and upgrades should be nice. Plus there's less shit to break.

LeftHand introduces iSCSI target appliance for ESX server

LeftHand Networks recently announced a software iSCSI target for ESX Server. It basically lets you reclaim all your unused disk space in your ESX box, and stick it on your SAN. I'm excited about this, we maxed out our ESX box to hold us over until we purchase our SAN, and this is a perfect way to use all that space. I'm thinking VM backups in case we lost the SAN for some reason. Or maybe IT storage (ISOs, Ghost images). We'll evaluate our needs after our EqualLogic box gets here and the local drives on the ESX server are emptied.