Wednesday, October 17, 2007

SQL server down, detectives continue the search for answers

I woke up this morning to find our database server down (which took VirtualCenter with it). This is a production SQL server, and our accounting software depends on it.

I logged in directly to the ESX host using the VirtualCenter Client. The VM was still running, but I couldn't do anything on the console. I saw the screensaver but it wasn't moving, so I reset the VM, and it started up fine.

The server went down at 6:21 and was down for about 45 min until I caught it.

The ESX performance logs showed normal (i.e. little or no) disk or network activity. However, memory usage dropped to zero, and the SMP processors (2 of them) were at ~2.66 Ghz (max) and ~1.7 Ghz for those 45 min (total average was ~65%).

Windows event logs show nothing, except that the shutdown was unexpected.

No other VM was affected. So it was either an error on the ESX process hosting the VM, or the OS itself.

Unique facts about this VM:

  • It was switched from 4 to 2 processors at one point due to SQL licensing (over a month ago). I researched this and assumed that the NT multiprocessor kernal wouldn't be affected unless we went to 1.
  • Hourly (for our purposes 6:00 am), SQL full backups occur. They create about 2.5 gigs worth of data that takes about 2 min. Should have been done long before the server crashed.
  • I have VCplus (http://www.run-virtual.com/?page_id=184) running on our VirtualCenter server. It was using a domain admin login for DB access during testing, so it had rights to break something if it wanted I suppose.
  • The SQL services are running under different logins than default. We were trying to get SCE 2007 to work with our SQL server about a month ago, and they suggested trying this.
SQL Server Integration Services
NT AUTHORITY\NetworkService
SQL Server FullText Search
LocalSystem
SQL Server
LocalSystem
SQL Server Analysis Services
LocalSystem
SQL Server Reporting Services
LocalSystem
SQL Server Browser
LocalSystem
SQL Server Agent
LocalSystem
I'm going to rebuild it this weekend just to be safe, considering the CPU went crazy, and we changed the processor count after the kernel was installed...

I've posted a question on the VMTN forums, so hopefully someone has some insight.

No comments: