The prospect of a business losing its most important data and, as a result, losing important business or money, is not one that any IT professional takes lightly, or so you would think. Yet many businesses run so called "servers", based on non-fault tolerant designs, and in so doing, risk days without access to their data. Looked at in its simplest sense, fault tolerance means that your server can cope with what would otherwise be a disaster scenario, and still give you complete access to your data. Where, for instance, a power supply unit (PSU) falling over in a non-fault tolerant machine would cause it to be unusable until the supply was fixed, a single PSU falling over on a fault tolerant server would cause no more than an Emailed message to go to the support department, alerting them to the fact that it is not well. The server would just keep on going. In a non-fault tolerant server, if the CPU is dangerously close to thermal overload, and DMI services are in place to warn the system administrator (failing which, it would simply go Phuttt! and die), there would be no option but to close the machine down to fix the problem. This may entail a full day without access to data - not good news for a company that needs constant access to its data in order to make its daily bread. In the fault tolerant server, however, access to data would be restricted for only as long as it takes the operating system to close down and reboot. Such a restart is a scenario most concerns can handle, especially when all users connected to the server are pre-warned and can make a controlled exit from their programs. So, what constitutes fault tolerance? In simplest terms, the server can tolerate faults, and, in the wider sense, can keep on working when what would normally be a fatal fault occurs. Taking it subsystem by subsystem, the remainder of this article examines how fault tolerance can be defined and achieved. Central Processing Units A processor failure in a single processor machine is fatal. Fault tolerance backs up the single processor by adding another. There are performance advantages, too, with a multi-processor operating system, so the total processor count can be as high as 6 for Intel Pentium Pro based machines - Advanced Logic Research's Six-way is an example. With its Infor-Manager software in place and running under Windows NT, should a single processor's core temperature exceed a pre-set value, then that processor is taken out of operation and the machine keeps on running. If a processor fails completely, then it takes just a few moments to reboot the machine using only the good processors. Processors cannot be changed while the machine is running, but downtime is kept to the bare minimum - minutes as opposed to days. Power Supplies As with processors, a power supply unit failure in a single PSU machine is fatal. Fault tolerance backs up the PSU with a second, or third (or more) PSU that shares the load with the others in the PSU array. If one should fail, then the other(s) continue to supply power. PSUs can - in our ALR Six-way example - be hot-swappable, that is to say, they can be changed while the machine is powered up, which, in turn, means that no downtime is necessary in the event of a failure. Additionally, Uninterruptable Power Supplies (UPS') attached to any one of an array of redundant power supplies, assures adequate power for an orderly shut-down in case of a total power outage at the site, although it's wise to use different wiring circuits fed from different phases, and fuses for each supply in an array in order to minimise server downtime from planned power cuts. In truth, the use of a UPS with a fault tolerant server is best viewed as a method of ensuring that mains spikes and other forms of mains-related interference are isolated from the server. In a situation where, for instance, 24-hour remote operation is required, as in the case of a Web Server, then the UPS needs to be specified to a high enough capacity to maintain the machine's smooth running for the maximum amount of time a power cut may last. If 24-hour running is of paramount importance, then a backup generator would be a wise investment, wired separately to the +1 power supply unit via a second UPS. Memory The plethora of memory types available today can sometimes be confusing - does your PC use Page Mode, EDO, ECC, DRAM, SRAM, Parity or Non-Parity RAM? In the fault tolerant server, Error Checking and Correcting (ECC) RAM should be used. If a failure occurs, the affected memory is mapped out of the memory map and the machine carries on. In a non-ECC memory map, a fatal memory error may occur, requiring unscheduled downtime to replace the faulty memory. In the ALR server, Informanager can pinpoint the kind or error - correctable or non-correctable - and on which SIMM or DIMM stick it has occurred. If the error is correctable, it can re-map the affected memory locations back into the memory map without downtime. If it is non-correctable - for instance, the SIMM or DIMM is unserviceable - it can easily be located and replaced in minutes. In a redundant memory array, at least twice the design specification of RAM should be installed to guard against multiple failures. Storage A single drive - or even a collection of single drives - can never be fault tolerant. If a single, non RAID (see RAID box-outs) disk fails, whether from a head crash or some other similarly fatal happenstance, the data on it is lost to immediate usage, and unscheduled downtime will be required in order to get the operating system back up and running, much less the data restored from backups. Data can be recovered from crashed drives - companies like Dr. Solomon's, for instance, have made a business of being able to get data off dead disks, but the time-to-fix is measured in days or weeks, with no guarantee of a fix. By the same token, backup procedures in some installations are woefully inadequate. When was the last time your backup was fully restored to check whether it was completely reliable (a verify or compare does not assure you of 100% reliability)? The only reliable way of making sure that your data is available 99.9% of the time on a single server, is to use RAID storage coupled with fully verified and reliable backups RAID arrays can be added to a server as an upgrade. The disks themselves do not need to be built into the server's box. However, as in the case of Storage Dimensions RAIDPro storage system, they can be purchased as a SCSI RAID controller and external enclosure which feature all of the advantages of a Bus-based RAID system, including hot-swapping and hot-spare, as well as line expansion. At RAID level 5 with 3x2Gb disks making up a 4Gb volume, it performs very quickly and offers excellent redundancy. As a guide to its performance a full format of the 4Gb volume (which, in fact, is actually 6Gb) takes between eight and 10 seconds. Data access across a 10Mb Ethernet network is very quick, in the order of three to four times quicker than accessing the same data on a standard, single IDE drive. Taken all together, a server machine that exhibits all of the fault tolerance features discussed, is highly unlikely to suffer unplanned downtime. With multiple redundant power supplies, multiple redundant processors, error checking and correcting memory a properly specified RAID array for storage, and a sensible and proven reliable backup regime, there's very little prospect of unplanned downtime costing more than a few minutes in the event that it does occur. Given the reliability that such a machine would provide, even under fault conditions, any remedial maintenance can be planned for and scheduled at a time of minimum disruption to the company. As with most pieces of technology, what was considered very high-end a few years ago is now commonplace on PC servers. Paul Stowe, general manager for server development at Fujitsu, said: "A few years ago, features like redundant power supplies and hot-pluggable disks were only available on minicomputers and high-end servers. Now they are a commodity on PC servers." A fully fault tolerant server may seem like a belt, braces, staples, bit of string and a spare pair of trousers approach, but when your business depends on your data, you can never take too many safety measures. Today, with the price of building in fault tolerance coming down to affordable levels, more and more businesses can reap the benefits on commodity servers. RAID: fact file RAID is an acronym for a Redundant Array of Inexpensive Disks - an array of smaller, less expensive disks that gives a performance boost over a Single Large Expensive Drive (SLED), and can also provide varying levels of fault tolerance. There are three basic RAID disk array types: - Software based - the array is managed by a software application that might be built into the operating system, or may be an additional application - Windows NT provides basic RAID capabilities in software, for instance. The drawbacks include the load on the host processor(s) and the temptation to opt for less reliable IDE drives. - SCSI-to-SCSI - a dedicated array controller is located in an enclosure separate from the host computer, and communicates via a SCSI adapter in the host. RAID functions are transparent to the host and independant of the operating system. Drawbacks include the limited bandwidth between the RAID controller and SCSI host adapter - Bus-based - the RAID controller is situated on an expansion card on the host computer's bus (PCI, EISA or MCA, for example) and uses its own processor to drive array management firmware. Such a controller is obviously limited to the bus for which it was designed, but is capable of transferring data at the speed of the bus. Drawbacks include limitations in the speed of data transfer from the physical drives to the controller itself - for a SCSI based controller, the current top speed per channel is 20Mb/second. RAID levels: Is Windows NT ready for the enterprise? Major businesses around the world have already piloted and adopted NT for their critical applications, and are readying enterprise-wide NT deployment. Market forecasts plainly call for Windows NT to be a predominant enterprise computing platform by the end of the decade. However, discussions about NT's suitability for the enterprise, often omit the subject of mass storage. Storage is a crucial component of any enterprise computing environment. A customer deploying NT in an enterprise environment requires enterprise-class storage. Data marts, intranets, Internet hosting, Ecommerce, messaging and workgroup computing are among the key business applications that are driving NT adoption in the enterprise. These applications demand a scalable storage architecture that ensures high availability, high capacity and high performance from the desk-top to the data-centre. Customers cannot trust their businesses to NT unless these requirements are met. Storage employs a variety of techniques under the RAID banner, in order to achieve a level of fault tolerance storage. There are varying levels performance, each using a combination of Mirroring, Striping and Spanning, as well as allowing for Hot Swap, Hot Spare and Line Expansion-Volume Growth. - RAID 0 - Disk Striping Disk striping writes data across multiple disks rather than to one disk at a time - stripes from multiple drives are interleaved in sequence, so that in a three-drive stripe set, for example, stripe 1 would be written to disk 1, stripe 2 to disk 2, stripe 3 to disk 3 and so on. The net effect would be to increase the data transfer rate. Disk striping alone, however, provides no data redundancy - it only enhances performance. - RAID 1 - Disk Mirroring The very simplest RAID level to provide redundancy and fault tolerance, RAID 1 simultaneously writes data to two identical drives. If one drive fails, the other steps in and takes over the system, and can be used to reconstruct the failed drive when it is swapped out. Disk Mirroring provides 100% data redundancy and requires at least two drives. - RAID 3 - Disk Striping with dedicated Parity Parity is a method of generating redundancy data from two or more sets of data. The data generated can be used to reconstruct one of the parent data sets, but does not fully duplicate (mirror) the data. If a single disk fails, it can be rebuilt from the parity of the remaining data on the dedicated parity disk. If the dedicated parity disk fails, it can be rebuilt from the parent disks. - RAID 5 - Disk Striping with distributed parity RAID 5 is similar to RAID 3 - it combines striping with parity, but distributes the parity data across the physical disks in the array, making data reliability higher than with Level 3. Needs a minimum of three drives. - RAID 10 RAID 10 is a combination of levels 1 and 0, and involves the striping of mirrored arrays - it needs twice the number of disks of a Level 0 array, and provides not only the 100% redundancy of Level 1, but also the enhanced performance of Level 0 striped arrays. - RAID 30 Like RAID 10, RAID 30 is a combination of two RAID Levels - in this case, Levels 3 and 0. It stripes two or more RAID 3 arrays, and requires a minimum of six disks. It provides fault tolerance and high speed, but is more suited to non-interactive processes that access large files sequentially. - RAID 50 No prizes for guessing that this is a combination of Levels 5 and 0, striping two or more RAID 5 arrays. It provides highly reliable storage, high request rates and high data transfer performance. - Hot Swap/Hot Spare A RAID array can be built with n+1 drives, where n is the number of drives needed for the array to function at whatever RAID level it has been designed for. The extra drive is designated as a Hot Spare, and automatically replaces a failed drive at the time of failure. Hot Swap means a drive can be removed from the array, and a new one added without downtime. - Line Expansion/Volume Growth A drive can be added to the system and incorporated into a RAID array on the fly, increasing the capacity of the logical drive (volume) without needing down-time to reconstruct the array - invaluable when an array is filling up. Without line expansion, adding a new disk to a volume might mean a full day's downtime. - Clustering As an adjunct to RAID arrays of disks, there also the concept of clustering. This involves establishing a backup server that mirrors the main server's devices and functionality, and also mirrors its disk contents. In the event of the primary server failing, the backup server takes over control of the network. Under Windows NT, this second, subservient server would be known as a backup domain controller, and regular replication would be required to keep the disk contents synchronised with the Primary Domain Controller.
Cotton seedling freezes to death as Chang'e-4 shuts down for the Moon's 14-day lunar night
Fortnite easily out-earns PUBG, Assassin's Creed Odyssey and Red Dead Redemption 2 in 2018
Meteor showers as a service will be visible for about 100 kilometres in all directions
Saturn's rings only formed in the past 100 million years, suggests analysis of Cassini space probe data
New findings contradict conventional belief that Saturn's rings were formed along with the planet about 4.5 billion years ago