.
/v3-uk/news/1964805/intel-study-centres-cool
19 Sep 2008, Iain Thomson , V3
A pilot study on data centre cooling has shown that systems architects may be over-specifying data centre operations.
Intel set up a trial data centre (PDF) of 900 blade servers split into two sections. One side was run in the traditional manner, with air-conditioners and dust filters, and the other used an air economiser which simply sucked out hot air and replaced it with outside air at normal temperature without adjusting for temperature or humidity.
The results were surprising. The unfiltered, un-cooled data centre machines suffered a failure rate of 4.46 per cent, compared to 3.83 per cent for the air conditioned centre, despite having an internal air temperature of around 90 degrees Fahrenheit.
Air in the un-cooled centre was passed through simple household dust filters. The servers became covered with a thick layer of dust over the 10-month trial period but this did not seem to affect reliability.
If the results are extrapolated to a 10-megawatt data centre, Intel estimates that it could cut power consumption by 67 per cent, yielding a cost saving of $2.87m per annum.
"A data centre equipped with an air economiser could substantially reduce Intel's environmental footprint by reducing consumption of power and water," said Don Atwood, regional data centre manager at Intel Information Technology.
"In dry climates, traditional air-conditioned data centres typically include evaporative cooling using water towers as a pre-cooling stage.
"With an economiser, this would not typically be used, potentially saving up to 76 million gallons of water annually in a 10-megawatt data centre."
The test centre was set up in a dry, temperate desert and the results suggest that data centres are capable of operating perfectly well at temperatures approaching 90 degrees Fahrenheit and using much less expensive air conditioning equipment.
The company is now planning on building a 1-megawatt test facility and conducting a longer trial to get a more accurate picture of reliability rates.
Do you agree?
Failure rate increase of 82%
This is an interesting experiment, but it still leaves a lot of questions unanswered for me, a designer of HVAC systems for data centers.
If you read Intel's white paper you will find that the failure rate of the data center with air economizer was 82% higher than the failure rate of a similar data center with 100% recirculated air. 4.46% vs. 2.45%. Yet Intel characterizes this result with: "We observed no consistent increase in server failure rates as a result of the greater variation in temperature and humidity, and the decrease in air quality, in the trailer." (Note that the 3.83% failure rate mentioned in this article applies to the failure rate of servers in Intel's main data center at that location, which was configured differently, and not to the identically configured experimental control.)
I suspect that the increased failure rate was due to overheating caused by dust buildup on the internal surfaces of the server components. I also suspect that their failure rate was increasing over time, as the dust built up, and lengthening the experiment would have resulted in increasing failure rates. However, I am not sure, because the experiment was not designed to eliminate other possibilities. The failure rate could have been caused by humidity fluctuations, absolute humidity highs or lows, temperature fluctuations, or simply temperatures that were too high.
Lets do a back-of-the-envelope cost/benefit analysis:
According to Intel, this test was run on a 100 KW data center, with 900 servers. They say, for a typical 500 KW data center, you could save $143,000/year. Therefore, this data center could save $28,600. A 2.01% increase in server failure rate implies approximately 18 more server failures than the control group. Adjusting for 12-month operation implies an "excess failure rate" of 21.6 servers per year. Unless you are buying servers for less than $1325 each, and the labor to replace them is free, you are not saving any money. If your server downtime costs you anything, you are losing money.
I am not about to sell my clients a system that results in an 82% increase in their server failure rate, even if the absolute level of server failures is low. If Intel, (or anyone else) can identify the key reasons for server failure, and a design modification (e.g. better filters) could minimize the impact of the problem, I would jump at a chance to implement an energy-saving approach such as this.
Intel is designing a larger data center using this approach, with the intent to look into failure rates over a longer time scale. I am anxiously awaiting the results of that experiment. I hope they continue to use a control group, so we get good data. I would suggest they try to isolate the variables that they want to study, so we designers can make better use of their results. Designing the system with high-efficiency air filters would be would be interesting, too.
John Hazucha
Ellerbe Becket
Posted by John Hazucha, 02 Oct 2008