When Google needed to expand its network beyond what could be supported by the commercial switches of the day, it looked back in time for a solution—deploying a decades-old architecture novel to the computer industry but widely used by telephone companies.
The company found that the Clos architecture, originally designed for telephones, also provided an effective way to connect large numbers of chatty servers, and paved the way for Google to build a single control plane to route all its traffic.
Going with Clos was one of the design choices that Google engineers will discuss Wednesday at the Association for Computing Machinery’s SIGCOMM (Special Interest Group on Data Communications) conference being held in London.
“Because we’ve been willing to go outside the box to design our network infrastructure, we’ve learned a lot of lessons,” said Amin Vahdat, a Google distinguished engineer who co-authored the paper accompanying the talk.
To date, Google has largely kept quiet about the design of its own network infrastructure, which now underpins Google’s internal and public operations.
“We don’t have individual compute infrastructures for individual applications. It’s not like we have a Gmail cluster or a Photos cluster. It all runs on shared infrastructure,” Vahdat said.
A unified platform can save the money, because it allows the company to use its compute resources more efficiently.
It has also led to the development new big data technologies such as MapReduce, which wouldn’t work if its network connections had to be configured manually for each new job.
Vahdat had sketched out, at a high level, how the Google network is designed in June at another conference, though the new paper offers the gritty details of all the work Google did to arrive at its current architecture.
The paper describes the five generations of network topologies that Google iterated through in the past decade to get to its current design.
The problem that Google faced was a rapid growth of traffic. Its requirements for network bandwidth roughly doubles every 12 to 15 months.
“We couldn’t buy, at any price, a network that would meet our performance requirements, our size requirements or meet the manageability requirements we had,” Vahdat said.
Google data centers now handle about 50 times the amount of network traffic as they did in 2008. Credit: Google
Even the largest commercially available data center switches, the ones built for telecommunications companies, weren’t really suited to accommodate such an increasing load of traffic, for a variety of reasons.
For one, commercial switches of the day were designed to be managed independently. Google wanted switches that could be managed as a group, just like their servers and storage arrays. This would allow the company to treat an entire data center as if it were a single humongous compute node.
Sign up for Computerworld eNewsletters.