Watching his teams put together the faulty puzzles, he noticed a number of interesting traits. One obvious one is that the larger the puzzle, the more time it takes to complete. "As the working space expands the computational effort increases," he said. Ambiguity also increases computational complexity. Puzzle pieces that have the same colors and shapes were harder to fit together than those with distinct details.
"Excessive ambiguity really drives up the computational cost," Jonas said.
Jonas was also impressed with how little information someone needed to get an idea of the image that the puzzle held. After assembling only four pieces, one of his teams was able to guess that its puzzle depicted a Las Vegas vista. "That is not a lot of fidelity to figure that out," he said. Having only about 50 percent of the puzzle pieces fitted together provided enough detail to show the outline of the entire puzzle image. This is good news for organizations unable to capture all the data they are studying -- even a statistical sampling might be enough to provide the big picture, so to speak.
"When you have less than half the observation space, you can make a fairly good claim about what you are seeing," Jonas said.
Also, studying how his teams finish the puzzles gave Jonas a new appreciation in batch processing, he said.
The key to analysis is a mixture of streaming and batch processing. The Apache Hadoop data framework is designed for batch processing, in which a lot of data in a static file is analyzed. This is different from stream processing, in which a continually updated string of data is observed. "Until this project, I didn't know the importance of the little batch jobs," he said.
Batch processing is a bit like "deep reflection," Jonas said. "This is no different than staying at home on the couch mulling what you already know," he said. Instead of just staring at each puzzle piece, participants would try to understand what the puzzle depicted, or how larger chunks of assembled pieces could possibly fit together.
For organizations, the lesson should be clear, Jonas explained. They should analyze data as it comes across the wire, but such analysis should be informed by the results generated by deeper batch processes, he said.
Jonas' talk, while seemingly irreverent, actually illustrated many important lessons of data analysis, said Seth Grimes, an industry analyst focusing on text and content analytics who attended the talk. Among the lessons: Data is important. Context accumulates and real-time streams of data should be augmented with deeper analysis.
"These are great lessons, communicated really effectively," Grimes said.
Sign up for Computerworld eNewsletters.