Big-data stewardship takes the main stage at DataWorks Summit in Berlin

By James Kobielus | May 03, 2018

Big-data repositories hold much of the world’s personally identifiable data. Many data management professionals are now laser-focused on the European Union’s General Data Protection Regulation or GDPR, which will take effect in little over a month and will place strict data-stewardship mandates on any enterprise that does business in any of those nations.

Since it was founded in 2011, Hortonworks Inc. has evolved from a Hadoop big-data software distribution startup to a diversified provider big-data governance tooling for private, public, hybrid and multicloud deployments. GDPR is now the principal global focus in that regard, though other country- and sector-specific laws, such as HIPAA in the U.S., are still a driver of demand for such capabilities. As I discussed in this recent article, GDPR mandates stringent enterprise controls on processing, movement and use of the personal data of the citizen of EU member states, and imposes significant financial penalties for failure to maintain them.

At the recent DataWorks Summit in Berlin (* disclosure below), GDPR was the predominant focus, but it was far from the only topic. As John Kreisa, Hortonworks’ marketing vice president, said on theCUBE at the event, the vendor’s growing customer base is increasingly investing in its predominantly open-source-based products, as well as those of its more than 2,300 partners, to drive the “internet of things,” stream computing, data science, artificial intelligence, machine learning, data warehousing, cybersecurity and other important applications. In addressing these opportunities, Hortonworks continues to take a community-led market strategy focusing on open-source technology, as company founder Alan Gates discussed in this Cube interview.

There were well-attended breakout tracks on these and other emerging technologies at DataWorks Summit. On theCube, we had great discussions with a Hortonworks financial-service customer in Germany who’s building an enterprise data lake, as well as partners in Uruguay and South Africaworking customer projects with a data science focus. For a good discussion of how the Hortonworks and its partners are generally addressing opportunities in data science, AI and related areas, check out this Cube interview with Piotr Mierzejewski, program director for IBM’s Data Science Experience solution, for which Hortonworks is a principal reseller.

Nevertheless, there’s no denying that GDPR’s looming deadline, which is barely a month away, has caused Hortonworks customers to bump compliance-related data-stewardship projects to the top of their priority stacks. As influential big data analytics expert Bernard Marr (pictured at DataWorks) told me on theCube:

“My sense is that there is a lot of catching up to do, I think people are scrambling to get ready at the moment. But nobody really knows what getting ready really means, I think there a lot of different interpretations. I’ve been talking to a few lawyers recently, and everybody has different interpretations of how they can push the boundaries.”

That explains why the event’s main news was Hortonworks’ announcementof the new Data Steward Studio, a software-as-a-service offering that helps enterprises to automate their GDPR-compliance processes. Launched from the conference mainstage in the Day One keynote by Chief Technology Officer Scott Gnau, the new offering is now in technical preview and is slated for general availability later in this quarter. It is going to market as a component of the larger Hortonworks DataPlane Service family of services for managing complex big-data multi-clouds.

As Gnau told me on theCUBE later that day:

“There’s definitely a big tie-in. GDPR is certainly creating a milestone, a kind of a trigger for people to really think about their data assets. But it’s certainly even larger than that. Because when you’ve been thinking about driving digitization of a business, driving new business models, connecting data and finding new use cases, it’s all about finding the data you have, understanding what it is, where it came from, who has access to it, what did they do with it. These are all governance kinds of things, which are now mandated by laws such as GDPR.”

As demonstrated by Srikanth Venkat in the Day Two keynote, Data Steward Studio supports the following privacy-protecting data-stewardship practices:

Discover, catalog and maintain detailed records of the personal data that an enterprise stores and manages across one or more data lakes in private, public or hybrid clouds;
Provide a secure, comprehensive environment for subjects to access and review their personal data wherever it is stored;
Disclose to subjects why they are processing the data, where they got it from and are sending it to, when they will delete the data, why they need to retain it till that time and what rights the subjects have over that data;
Enable subjects to register or withdraw their specific, informed, and unambiguous consent to varying levels of processing, use and transfer of that data; and
Execute subjects’ consent to processing, use and transfer of the data, as well as their requests to erase all or some of it, to withdraw consent to various uses or restrict profiling and processing.

Metadata is the heart of the big-data catalog that powers Data Steward Studio. The solution enables enterprise data administrators to search, catalog, classify, tag and manage data globally based on origin, value, protection level, sensitivity or functional use, as well as other descriptive metadata. It enables data stewards to analyze data lineage and impact. It can also secure both the personal data and associated metadata in keeping with enterprise-wide authorization, data protection and anonymization policies.

For a great discussion of metadata’s central role in big-data catalogs such as the one underpinning the new Hortonworks offering, check out my Cube interview of IBM Distinguished Engineer Mandy Chessell. The gist of her discussion was the following:

“A lot of companies are trying to build a data catalog. It’s not a catalog that actually contains the data, but a catalog that describes the data. It’s a list of all the data sets, plus links to glossary definitions of what those data items mean within the data sets, plus information about the lineage of the data. It includes information about who’s using it, what they’re using it for, how it should be governed…. It is a central resource for an organization that has a strong data strategy, is interested in becoming a data-driven organization. This becomes their major catalog for how they’re using data sets. So then when the regulator comes in and asks can you show me how you’re managing personal data, the catalog has the information about where the data is located, what type of infrastructure it’s sitting on, how it’s being used by different services. So they can really show that they know what they’re doing, and then from that they can show how they’re using the metadata in order to manage the data appropriately day by day.”

To view these and other Cube interviews from DataWorks Summit 2018 Berlin, please visit this page. Here are all my interviews and keynote analyses for both Day One and Day Two. (* Disclosure: TheCUBE was a paid media partner for DataWorks Summit 2018. Neither Hortonworks, the event sponsor, nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)