Wednesday, October 17, 2007

SQL server down, detectives continue the search for answers

I woke up this morning to find our database server down (which took VirtualCenter with it). This is a production SQL server, and our accounting software depends on it.

I logged in directly to the ESX host using the VirtualCenter Client. The VM was still running, but I couldn't do anything on the console. I saw the screensaver but it wasn't moving, so I reset the VM, and it started up fine.

The server went down at 6:21 and was down for about 45 min until I caught it.

The ESX performance logs showed normal (i.e. little or no) disk or network activity. However, memory usage dropped to zero, and the SMP processors (2 of them) were at ~2.66 Ghz (max) and ~1.7 Ghz for those 45 min (total average was ~65%).

Windows event logs show nothing, except that the shutdown was unexpected.

No other VM was affected. So it was either an error on the ESX process hosting the VM, or the OS itself.

Unique facts about this VM:

  • It was switched from 4 to 2 processors at one point due to SQL licensing (over a month ago). I researched this and assumed that the NT multiprocessor kernal wouldn't be affected unless we went to 1.
  • Hourly (for our purposes 6:00 am), SQL full backups occur. They create about 2.5 gigs worth of data that takes about 2 min. Should have been done long before the server crashed.
  • I have VCplus (http://www.run-virtual.com/?page_id=184) running on our VirtualCenter server. It was using a domain admin login for DB access during testing, so it had rights to break something if it wanted I suppose.
  • The SQL services are running under different logins than default. We were trying to get SCE 2007 to work with our SQL server about a month ago, and they suggested trying this.
SQL Server Integration Services
NT AUTHORITY\NetworkService
SQL Server FullText Search
LocalSystem
SQL Server
LocalSystem
SQL Server Analysis Services
LocalSystem
SQL Server Reporting Services
LocalSystem
SQL Server Browser
LocalSystem
SQL Server Agent
LocalSystem
I'm going to rebuild it this weekend just to be safe, considering the CPU went crazy, and we changed the processor count after the kernel was installed...

I've posted a question on the VMTN forums, so hopefully someone has some insight.

Tuesday, October 09, 2007

Thin clients and virtual desktops

We started testing virtual desktops for our company. It all originally started as a joke, saying we already had ESX server, why not throw all of our desktops on it as well. Then one of the co-founders of the company asked my boss about it, saying he read about it somewhere and it seemed neat. Then we were planning on virtualizing all of our desktops. But alot of users need CD drives. So as of this writing, we are planning on using virtual desktops for any contractors or new users that do not require a CD drive.

So, for the thin client, we are going to be reusing old PC's where possible, or buying an HP t5135 for new hires. I couldn't get my hands on any version of VMware's VDM 2 Beta from their VDI initiative, and I'm not sure if that will include a thin client when it's released towards the end of this year. So for now, we are using 2X's ThinClientServer, along with their ThinClientOS.

We first installed ThinClientServer on one of our Windows VMs. It does a couple things. First, when your thin client boots up, it acts as a DHCP helper to listen for DHCP requests, and returns a TFTP server and boot image filename. It also includes the TFTP server that hosts the ThinClientOS. Then, after the OS boots and the user trys to log in, it makes decisions based on the login as to which machine to redirect the user to.

I had no problem getting the ThinClientOS up and running on our test HP t5135. But getting it to work on our PCs is a different story. I first tried it on my PC, a relativly new HP dc5700, and it didn't recognize my NIC (after it successfully downloaded the boot image). I then tried it on a much older PC that one of our scanners is hooked up to, and it couldn't detect the video card properly (while I was looking at the error message). So I'll spend a lot of time troubleshooting these driver issues if we go live using ThinClientOS.
I'm looking forward to seeing exactly how VMware's VDM 2 is going to fit into all of this when it gets released in December. I know that they are going to do the same things as ThinClientServer (connection broker), but I'm not sure about a PXE boot image (but I would imagine they would have to include this in order to compete).

Friday, September 28, 2007

Windows Server 2008 Core not what I had expected.

I decided to poke around the "Core" mode of Windows Server 2008 RC0 this morning. I had expected the graphical installer to finish, and then to reboot into a command line similar to Linux. Nope. It boots like Windows, but without the shell. So you just see the background and a command prompt window.

The resolution is set at 800x600 at this point. I decide to see if VMware Tools will work. It didn't automatically launch, which I didn't expect it to anyway. I switched to the CD drive and ran it manually:
The install gave me a warning about my help files being out of date or something, so I just hit "No" on that dialog and everything else seemed to work fine, and I rebooted. Then when Windows came back up, I had a 640x480 resolution. Yay. I imagine I could change this through the registry or something but I haven't looked into it too deep.

Next step is to get the network up and running. It's still too new to join to my domain, so I'll skip that part. I used this guide for instruction.

netsh interface ipv4 show interfaces
netsh interface ipv4 set address name="2" source=static address=192.168.1.223 mask=255.255.255.0 gateway=192.168.1.1
netsh interface ipv4 add dnsserver name="2" address=192.168.1.5 index=1
netsh interface ipv4 add dnsserver name="2" address=192.168.1.6 index=2
netdom renamecomputer %computername% /newname:W2K82


According to the manual, you can use the Core installation for the following:

  • Active Directory Domain Services (AD DS)
  • Active Directory Lightweight Directory Services (AD LDS)
  • DHCP Server
  • DNS Server
  • File Services
  • Print Services
  • Streaming Media Services
As of now we plan to use it for our 2 AD/DNS/DHCP servers, and our file server. It's a pain to setup, but still easier than Linux. And for these services we won't be touching the machines at the console level very much.

Thursday, September 20, 2007

NTP Time synchronization for Windows domains and ESX Server


We ran into a problem last week that our phone system was out of sync with the time on our computers, and I was asked to fix it. Unfortunately I don't have access to the inner workings of our phone system, but here's how to do it on VMware ESX Server and a Windows 2003 domain (probably Windows 2000 too). Our clients are all Windows XP.

I chose to use the NTP pool from pool.ntp.org. It does a DNS round-robin to a list of donated servers. Most of them are web or DNS servers that also act as time servers. We use 3 different DNS servers in case we happen to be given a bad server (0, 1, and 2) and we append "us" to the FQDN so we only get US servers (visit pool.ntp.org to look up other countries):

0.us.pool.ntp.org
1.us.pool.ntp.org
2.us.pool.ntp.org
For your Windows domain, you need to do the following...

On your Windows domain controllers:
net time /setsntp:"0.pool.ntp.org 1.pool.ntp.org 2.pool.ntp.org"
For your Windows clients, they will typically get their time info from the PDC. But just to be sure, create or edit an existing GPO that is applied to all of your workstations and servers. You can use the "Default Domain Policy" if you like:

Open "Computer Settings > Administrative Templates > System > Windows Time Service > Time Providers".
Set "Enable Windows NTP Client" to Enabled.
Open the properties for "Configure Windows NTP Client". I set the following:
NtpServer = (Set to your domain name, which will direct it one of your domain controllers)
Type = NT5DS
CrossSiteSyncFlags = 2
ResolvePeerBackoffMinutes = 15
ResolvePeerBackoffMaxTimes = 7
SpecialPollInterval = 900 (I set this to 15 minutes, but the default might be better for larger environments)
EventLogFlags = 0
After making the GPO changes, you can apply it to a computer by issuing "gpupdate /force", or just give it a few hours or so.

On the ESX Server, in the service console, I used root privileges (su -). You can use this handy script by VMColonel, or do the following manually...

Open /etc/ntp.conf with your favorite text editor, and make it look like this:
restrict 127.0.0.1
restrict default kod nomodify notrap
server 0.us.pool.ntp.org
server 1.us.pool.ntp.org
server 2.us.pool.ntp.org
driftfile /var/lib/ntp/drift
And then open /etc/ntp/step-tickers and do the same:
0.us.pool.ntp.org
1.us.pool.ntp.org
2.us.pool.ntp.org
Then run these commands:
esxcfg-firewall --enableService ntpClient
service ntpd restart
chkconfig --level 345 ntpd on
hwclock --systohc
And that's pretty much it. To see the offset between your computer and the timeservers, you can issue these commands...

ESX Server (and most Linux distros):
watch "ntpq -p"
On any Windows 2003/XP machine:
w32tm /stripchart /computer:pool.ntp.org
You might need to set your Command Prompt window width to 100 for proper display.

All that's left is to get our phone system synced up to the same servers...

Tuesday, September 18, 2007

EqualLogic performance confusion

I spoke with our EqualLogic sales rep yesterday, and they are suggesting we go with a PS300E. It has 7,200 rpm drives. Not 10k. We always avoided 7,200 drives like the plague. But here is EqualLogic, one of the biggest players in the iSCSI industry, telling us we should put their 7,200 rpm drives in. Ok...

We had originally planned on buying a PS3900XV, which has 15k drives. But now it's between the PS3600X (10k drives) and one of the 7,200 rpm models. They are all the same box, but with different drive sizes. Right now it looks like the 7TB raw model (PS300E) is the one we should buy, but we might splurge and get something larger.

But back to the performance...7,200 rpm! I had to ask around, so I threw a post up on to the VMTN forums. And here's what I have so far:

  • joergriether tested a PS300E and a PS3900XV, and says they were about the same...
  • Yps has 90 VMs running on one PS400E, and has no trouble with it...
  • christianZ pointed me over to the thread he had started for SAN comparison, and also suggested the PS3600X over the PS3900XV. Not sure if he meant over the PS300E also...
So right now it looks like the PS300E is going to be what we go with. Sigh...

Monday, September 17, 2007

Configuring LUNs for virtual machines

Two nice folks responded to my post on the VMTN forums regarding how to set up VMFS partitions and LUNS for virtual machines on our SAN.

Before I talk about all of this, Stephan Stelter from LeftHand Networks also included a link about "How to Align Exchange I/O with Storage Track Boundaries". Good to know...

But anyway, this is apparently what most people do:

  • Create VMFS partitions for system drives, and place virtual machines that you would back up together on the same LUN.
  • For data drives, map to the LUN directly with the OS iSCSI initiator.
I haven't decided if we should place all our system drives on a single LUN or break them up into groups. I think I want to place domain controllers on their own LUN so I can restore our entire AD without the possibility of a USN rollback. And all our other servers don't really fall into groups... if start to think about splitting them up, it gets to the point where I might as well make a LUN for each of them. And from what I've read so far, performance isn't really a problem if you have less than 16 VMs on a LUN.

So with all the system drives (except AD) on a single LUN, restores become a little more complicated. Not too hard.. just harder than using individual LUNs. We would need to bring up a snapshot on a new LUN, mount the LUN on ESX server (or maybe somewhere else), and copy the VMDK of the machine we're restoring over to the production LUN.

If there is a LUN for each system drive, restoring from a snapshot would just require a click on the EqualLogic array's web interface, and everything would be back.

So what are we going to do? I still don't know yet. And there's plenty of time to decide, and neither way is the "wrong way" (nor the "right" way). So we'll see...

Friday, September 14, 2007

EqualLogic wants to sell us stuff

EqualLogic invited themselves in to our conference room yesterday to show us one of their iSCSI SAN arrays. All I can say is...wow. If you're looking at a SAN solution, definitely consider EqualLogic. And they will jump at the chance to give you a 2-hour in-house demo.

But regarding my post the other day...boy was I way off. Apparently I didn't understand storage virtualization. I was debating how to split the drives up, half RAID 5, half RAID 10, or all of it RAID 10...

EqualLogic introduced me to RAID 50. It requires a minimum of 6 drives not including a spare, so I'm not surprised I never came across it before. It basically takes 2 RAID 5 sets and stripes between them (RAID 0).

The EqualLogic PS3900XV has 16 300 GB drives. 2 RAID 5 sets is 8+8. A spare on each leaves us with 7+7. And the parity bit brings us down to 6+6. So out of the 16 drives, we get to use about 12 of them space-wise, at least 3.4 TB. EqualLogic also reserves some space on each drive for housekeeping, but this shouldn't end up being more than 200 gig or so (the tech couldn't remember if it was 20 megs or 20 gigs reserved per drive).

RAID 5 would have given us 4.5 TB but would have been pretty slow on write speed. And RAID 10 would have given us 2.4 TB. So 3.4 is pretty good...

And even better is the fact that its all virtual. We make a LUN for our file server, one for Exchange, one for SQL, etc. But we can grow these LUNs at any time, so instead of saying the file server gets 2 TB and Exchange gets 600 GB, we can just set them to about 150% of what they are right now and then expand them as they need it.

My only question now, and one that EqualLogic couldn't answer for me, was do we put all of our system drive VMDKs on a single LUN or make a seperate LUN for each server? I've posted a question on the VMware forums, and I'm sure a lot of people thought about the same thing while setting up VI3 and a SAN.