Cost differences can, however, become significant if you’re considering the cloud. A variety of pre-integrated solutions are now available as cloud-based services (or even hybrid services, where some data remains on premises). This model allows organizations to start adopting big data at a much lower upfront cost, much faster than building their own solution, or even deploying a full-scale pre-integrated solution on-premises.
Collecting and using data is not the same thing
It’s important to remember that data science takes more than just aggregating data in one place. There are many steps between collecting data and being able to use it.
Take a common example of extracting structured information from unstructured data, such as email. Here’s one way that could work: First, thousands of emails arrive in basic HTML. To extract meaningful insights, you now need to parse the documents, cleanse them, extract terms, define a meaningful vocabulary, etc.
Out-of-the-box solutions often provide prebuilt tools to manage job scheduling workflows alongside data collection to make your data analytics-ready. A more general-purpose pre-built platform is also likely to be flexible—allowing your developers to write programs using the language of their choice and be confident that they’ll work on any data in the system. So, it should be easy to create and continually update workflows around the data you’re collecting.
If you go DIY, make sure you’re infrastructure can handle all of the workflow processes around data collection, or that IT is willing to support them. And, be sure to design your custom solution to be as open as possible so you’re not limiting your options in the future.
Going from lab to production
One of the bigger risks in DIY projects comes when it is time to move from the lab to production. Here’s what can happen: You set up a demo Hadoop environment to show what you can do with it. Everyone is impressed, and you get the green light to move forward. But then, when it’s time to put it into production, you face some uncomfortable questions from IT: How will this fit into our operational workflow? How will you secure access? Is the data encrypted at rest? How will this tie into our identity infrastructure?
Enterprise IT takes many things for granted—that any database platform will have encrypted storage, integration with Active Directory, rigorous audit logging, and a means to define fine-grained access control policies. If your solution hasn’t checked all those boxes—none of which were necessary in the lab—it’s not going anywhere near your production network.
Unfortunately, stock Hadoop doesn’t offer great answers to those questions. Even basic encryption and AD integration is complicated, and default access control mechanisms are coarse-grained. There’s no mechanism to give different users different levels of access to the same data—for example, if your platform is serving customer service agents who need access to full records and analysts who are only authorized to see de-identified information.
Sign up for Computerworld eNewsletters.