"Our Web logs on how customers are using e-readers and e-books ... have produced 35TB of data and will load us up with another 20TB this year," said Parrish, who noted that e-book sales outstripped hardback book sales on Amazon.com last year.
With that kind of data, the retailer can determine consumer behavior, such as what percentage of shoppers make book-buying decisions based on their fondness for a particular author.
"We have to decide with analytics on hand how we capture the customer's imagination and how we move forward," he said.
Other companies are using big data analytics to track the use of content on their websites in order to better tailor that content to users' tastes.
Sondra Russell, a metrics analyst at National Public Radio, said she needed a way to track website audience usage trends in near real time. NPR offers podcasts, live streams, on-demand streams and other audio content on its website. Her organization had been using the Web analytics engine Omniture, but it felt like she was trying to jam log-based data into a client-side tracking system that couldn't handle the volume.
Russell said NPR experienced query delays that, at best, lasted six to 12 hours and, at worst, lasted for weeks. The organization finally switched to a reporting tool from Splunk that crawls logs, metrics and other application, server and network data and indexes it in a searchable repository.
"I just want to know how many times someone listened to a program during a certain period of time," she said. "With Splunk, I had no delays between data appearing in a query folder and data appearing in reports. I can get any number of graphs without weeks of prep time."
IBM's Jonas compared big data to puzzle pieces, which don't look like anything on their own but create a detailed picture once you put them together. That's where Hadoop, Cassandra and other analytics engines come in. Hadoop is a distributed software file system, based on Google's MapReduce algorithm, that allows large-scale computations (batch processing) to be performed across large server clusters in parallel. The computations can be performed on user or machine-generated data, whether structured or unstructured. But Hadoop works best on unstructured random data sets, allowing analytics engines to more quickly gather information from queries.
MapReduce systems differ from traditional databases in that they can quickly presort data in a batch process, regardless of the type of data: file or block. They can also interface with any number of languages: including C++, C#, Java, Perl, Python and Ruby. Once sorted, a more specific analytical application is required to perform specific queries. Traditional databases can be considerably slower, requiring table-by-table analysis. They also don't scale nearly as well.
Sign up for Computerworld eNewsletters.