For our BI setup running under Microsoft BI 2005, we have an environment that comes straights from scholar manual. We have a development environment (DEV), composed of 4 servers: two are SQL 2005 database servers (staging database and a data warehouse ), one server is a SharePoint 2007 server hosting also the SharePoint database (SQL 2005) and a Visual Studio server. Similarly, we have a testing environment (TEST) composed of 3 servers: a staging database, a data warehouse, and a SharePoint server hosting the SharePoint database as well. Both the DEV and the TEST environments are virtualized. Then we have our production environment (PROD) composed of 4 physical servers: a staging server, a data warehouse, a SharePoint server, and a SharePoint database server. The only difference being that in production, we have separated the SharePoint front layer from the database layer for performance reasons.
Developers develop in the DEV environment and technical approvals are done there. Testers test in the TEST environment, which is also used for the functional approvals. Production data run in the production environment. So far so good and nothing new. I just described an out-of-the-manual architectural setup. All databases are up-to-date SQL 2005 and the SharePoint version is 2007.
We had planned to upgrade this environment to Microsoft SQL 2008 and thought it was quite well planned. We had waited for SQL 2008 SP1 to be released. We had waited for our sub-contractor to accumulate some upgrade experience and finally decided to migrate progressively. During last July, we migrated our DEV environment. Everything went smoothly. Some of us then went on vacations. The migration of our TEST environment was planned for mid-August and took place as expected. All functional tests we had prepared were passed without any problem. As this point, and as all lights were green, I decided to authorize the migration of the PROD environment.
And guess what ? Everything went fine with only one unexplained manual reboot. We later even noticed a small but noticeable performance gain with SQL 2008.
On the next days, I got an alarming phone call from the IT operations department. They reported that since the SQL upgrade all production servers kept rebooting every 2 to 12 hours without apparent reasons.
We then immediately started our investigations but quickly discovered that there was no logged Windows event or error message. A closeup monitoring of SQL did not give us any clue. A couple of days were necessary to find out that the reboots were caused by the monitoring systems because the servers were freezing for more than 10 minutes. Still, we had no idea why they were freezing. It only happened on the physical PROD servers, not on the virtual DEV or TEST servers.
The main difference between virtual and physical machines concerns the underlying drivers. After a couple of days, we upgraded several drivers on one of the blade servers: HBA StorPot card, Ilo2 card firmware, blade power management control and various other minor drivers. It worked and fixed the problem.
Conclusion of the story ? You cannot foresee everything and even if you work by the book, shit may happen. The good news is that we only lost 3 days with a very limited impact on end-users.