AI powers the catalogs of next-generation big data

By James Kobielus | May 03, 2018

Data’s relevance doesn’t always jump out at you. It takes work to distill useful insights from enterprise data lakes that are increasingly too large, diverse and dynamic to be explored through entirely manual methods.

Discoverability and visibility are what unlocks data’s value. More enterprises are embracing big-data catalogs to harness insights that would otherwise stay dormant and overlooked. Recognizing this growing demand, more data management solution providers are building sophisticated catalogs into their solution portfolios, as discussed in Wikibon’s recent big-data market study.

Artificial intelligence is a key force driving the evolution of big-data catalogs into enterprisewide platforms for collaboration curation. Increasingly, providers are integrating AI into their offerings to help users discover, refine, explore, analyze and apply complex data sets more rapidly and intelligently to diverse applications.

Among data management vendors, Informatica LLC has set the pace in the weaving of AI-infused metadata-management capabilities into its solution portfolio. In the breadth and sophistication of its AI capabilities, Informatica stands apart from other data catalog solution providers such as Alation Inc., Cloudera Inc., Hortonworks Inc. and Microsoft Corp.

The company briefed Wikibon last summer on its roadmap to integrate AI as an enabling capability across its entire product line, with its Enterprise Data Catalog at the center. At that time, Informatica had already incorporated AI — which it brands as “CLAIRE” — into its catalog to automate data clustering, tagging, and domain/entity recognition. The AI-powered catalog intelligently scans data assets from across the enterprise and automatically adds business context metadata. In its data integration offerings, Informatica had already integrated such CLAIRE AI technologies as genetic algorithms (to identify complex data sub-structures), natural language processing algorithms (to drive semantics-based modifications to data models) and machine learning algorithms (to parse clickstream, log, system, JSON and other “internet of things” data).

At Informatica World 2017, CEO Anil Chakravarthy spoke to theCUBE about how CLAIRE figures into its product roadmap going forward. “When we built CLAIRE, “ he said, “we did not invent the artificial intelligence or the machine learning. A lot of that is already available. So we took a lot of the best algorithms in machine learning and applied them to metadata and data management. That’s the secret sauce. It’s not the building the AI itself, it’s the use of the AI for data management.”

Chakravarthy emphasized that CLAIRE is “not a product. It’s … a cloud-scale, AI-powered real time engine that powers other products.” He added that CLAIRE will be embedded in Informatica products so that customers won’t have to deploy it explicitly. “So it means once you have any product like our enterprise data catalog or data governance solutions, you’re starting to use CLAIRE and then you can use CLAIRE for other use cases as well.”

In a new product announcement today, Informatica rolled out new features that infuse CLAIRE’s AI smarts more deeply into the catalog at the heart of its solution portfolio. The company’s core announcements were twofold: It has introduced enhanced AI algorithms for improved curation and classification of structured and unstructured data, and it now provides an integrated metadata-driven intelligent API.

These new features support self-service discovery of the catalogued data that is best for the task at hand, such as training a machine learning model or curating customer datasets. They also enable users, such as data scientists and stewards, to apply the catalogued data via a single click to whatever application environment they’re working within. In addition, Informatica now provides single-click deployment of the catalog to the Amazon Web Services and Microsoft Azure, so all of these features are available within those public clouds.

Over the next several years, Wikibon expects to see big-data catalogs become ubiquitous in enterprise data environments, with AI, intelligent metadata, recommendation engines and automated task-specific guidance as essential features. These capabilities will help organizations to manage their growing information assets across more complex hybrid clouds.