Headroom is the amount of excess capacity that is considered routine. Any service will have usage spikes or edge conditions that require extended resource usage occasionally. To prevent these edge conditions from triggering outages, spare resources must be routinely available. How much headroom is needed for any given service is a business decision. Since excess capacity is largely unused capacity, by its very nature it represents potentially wasted investment. Thus a financially responsible company wants to balance the potential for service interruption with the desire to conserve financial resources.
Your monitoring data should be picking up these resource spikes and providing hard statistical data on when, where and how often they occur. Data on outages and postmortem reports are also key in determining reasonable headroom.
Another component in determining how much headroom is needed is the amount of time it takes to have additional resources deployed into production from the moment that someone realizes that additional resources are required. If it takes three months to make new resources available, then you need to have more headroom available than if it takes two weeks or one month. At a minimum, you need sufficient headroom to allow for the expected growth during that time period.
Reliable services also need additional capacity to meet their SLAs. The additional capacity allows for some components to fail, without the end users experiencing an outage or service degradation. The additional capacity needs to be in a different failure domain; otherwise, a single outage could take down both the primary machines and the spare capacity that should be available to take over the load.
Failure domains also should be considered at a large scale, typically at the data-center level. For example, facility-wide maintenance work on the power systems requires the entire building to be shut down. If an entire datacenter is offine, the service must be able to smoothly run from the other data centers with no capacity problems. Spreading the service capacity across many failure domains reduces the additional capacity required for handling the resiliency requirements, which is the most cost-effective way to provide this extra capacity. For example, if a service runs in one data center, a second data center is required to provide the additional capacity, about 50 percent. If a service runs in nine data centers, a tenth is required to provide the additional capacity; this configuration requires only 10 percent additional capacity.
The gold standard is to provide enough capacity for two data centers to be down at the same time. This permits one to be down for planned maintenance while the organization remains prepared for another data center going down unexpectedly.
Sign up for Computerworld eNewsletters.