Credit: Stephen Lawson
If data warehouses are for tidiness freaks (information packaged into neat inferences, sorted and stacked, the rest discarded) and data lakes are for hoarders (tip everything in, you never know what might be useful) then SAP's new Data Hub may be for the rest of us.
It's a new data management tool intended to process only the data you need -- and to go looking for it where it's created or stored, without requiring you to pull it all into one place.
Data scientists will be able to use it to analyze data from multiple sources and systems.
"Data Hub is a strong data management umbrella layer that allows for data integration, data processing and data governance," said Irfan Khan, global head of SAP database and data management sales.
"It allows us to look across all the data that you own, and access all of the information. But it doesn't look to centralize all this data in a data lake of its own; it's looking at capturing data and accessing data exactly where it resides today," said Khan, speaking ahead of the product's launch Monday.
While the notion of an enterprise data hub has been around for a while, SAP is using the term a little differently from most: Where others such as MapR or Cloudera of importing all the data into a giant Hadoop cluster or other central repository before processing, SAP intends leave data in situ until it's needed.
It will to do that by creating data pipelines -- flows of data that are composed of reusable, configurable operations to process data pulled from a variety of sources, including CSV files, web services APIs, and commercial cloud services, as well as SAP's own data stores. The operations could be connectors to different file systems or APIs, analytics or machine learning libraries such as TensorFlow, or custom-coded tasks.
SAP provides a graphical tool for modeling workflows and pipelines, and an orchestration layer for invoking jobs and restarting or rolling back tasks in the event of failure. This can take the place of workflow scheduling systems such as Apache Oozie, Khan said.
The execution of the pipeline can be pushed down to other platforms, such as SAP's Vora computing engine, he said.
Data Hub doesn't need a company to built on SAP in order to work: It can also be integrated with third-party products, he said."You don't need to be using SAP's ETL processing, you may be using Informatica," he said, or perhaps the open-source Kafka messaging layer.
Sign up for Computerworld eNewsletters.