For example, Alfred Spector, vice president of research and special initiatives at Google, said it's not inconceivable that a cluster of servers could someday include 16 million processors creating one MPP data warehouse.
"It doesn't seem there are any limitations other than good engineering required to get there," he said. "Moore's law or not, we have essentially unlimited computation available."
Spector sees a day when distributed computing systems will offer Web developers what he calls "totally transparent processing," where the big data analytics engines learn over time to, say, translate file or block data into whatever language is preferred based on a user's profile history, or act as a moderator on a website, identifying spam and culling it out. "We want to make these capabilities available to users through a prediction API. You can provide data sets and train machine algorithms on those data sets," he said.
Yahoo has been the largest contributor to Hadoop to date, creating about 70% of the code. The Web search company uses it across all lines of its businesses and has standardized on Apache Hadoop, because it favors that version's open-source attributes.
Papaioannou said Yahoo has 43,000 servers, many of which are configured in Hadoop clusters. By the end of the year, he expects his server farms to have 60,000 machines because the site is generating 50TB of data per day and has stored more than 200 petabytes.
"Our approach is not to throw any data away," he said.
That is exactly what other corporations want to be able to do: Use every piece of data to their advantage so nothing goes to waste.
Sign up for Computerworld eNewsletters.