Premise. The big data arena is at a crossroads. Use cases and tools are proliferating faster than most big data teams are gaining experience. CIOs must establish the big data business capabilities required to streamline complexity and deliver promised business outcomes.
(This is Part 1 of a two-part series on emerging approaches to establishing big data capabilities and accelerate big data time-to-value by streamlining toolset complexity. This report focuses on the strategic big data capabilities IT leaders must enact to manage big data complexity and achieve crucial big data-related business outcomes.)
History may not repeat itself, but it often rhymes. Just as OLTP technologies changed to solve more sophisticated business operations problems, big data technologies are evolving in response to use cases that feature increasingly complex and strategic analytics, as well as new cloud-based implementation options. But here’s a big difference: Single vendors methodically enhanced OLTP products like relational DBMSs as understanding of requirements progressed, whereas big data has embraced an open source ecosystem to accelerate innovation. Big data tools are proliferating, often according to divergent presumptions about problems. The side effect is complexity for developers and administrators at mainstream enterprises who, for the most part, don’t yet have to capabilities to consistently achieve big data objectives. Worse, many aren’t sure where to start because:
- Big data use cases are a moving target. The good news is that businesses are seeing big data opportunities everywhere. The bad news is that few firms are clear on how to pursue those opportunities. The Data Lake use cases that initially called for “ingest”, “transform”, “discover” are morphing to support more sophisticated forms of analytics, including machine learning and AI.
- Big Data developer models are under development. Most big data deployments are tool driven. Outcomes remain important, but establishing big data business capabilities and reusable programming elements is a tertiary goal, at best. This becomes especially challenging as enterprises seek to close the loop between big data and more rigorously managed operational applications. CIOs must invest in big data development capabilities that can reliably deliver business outcomes even as tools and tool sets evolve.
- Administrative roles and tasks are constantly being reshuffled. Big data projects typically involve business people, data scientists, infrastructure administrators, developers — and often third parties. Each of these groups identifies a different primary problem, and often see the other groups as obstructing progress. Similar to the need for development capabilities, CIOs need to establish administrative regimes that span IT, business, and partner roles to ensure smooth and streamlined execution of big data administrative tasks.
Big Data Use Cases Are A Moving Target
Delivering analytics application in the OLTP era was challenging, but made easier by two factors. First, OLTP technologies, like DBMSs, monitors, or admin tools, generally were managed by a single, strong vendor that could use their influence or control to design and deliver a comprehensive OLTP tool set, including tools for reporting and business intelligence. Second, use cases were pretty well understood and stable, reflecting process conventions, like accounting or HR, provided by third parties, like the SEC or the courts. Data administrators and database administrators had significant visibility into data model and run-time requirements, which were relatively common across implementations. Consequently, analytics applications often were refinements on similar themes, making the process of discovering, extracting, transforming, and loading data sources more manageable.
Today, big data pros don’t enjoy the same foundation. Instead, big data tools sets are evolving based on a highly distributed open source model. While this fuels tremendous innovation, it also means that big data tools can’t be the basis for defining strategic big data capabilities per se. Secondly, many of the use cases for big data are not predicated on common business conventions. For example, customer engagement and experience is itself a rapidly evolving art, but one that is crucial to business success in a digital world. Big data pros and IT leaders should, therefore, take two steps. First, they should seek tool packaging that is designed to solve concrete big data development and administration problems while sustaining access to the continuous flow of big data tool innovation emanating from open source communities. And second, they need to invest in big data-related business capabilities — especially in the developer and administrator domains — that can accrete big data experience and deliver big data successes even as tool sets evolve.
Big Data Developer Models Are Under Development
To consistently achieve big data business outcomes, even as use cases proliferate, CIOs and other senior IT leaders must establish strategic big data management capabilities and choose tools that can streamline developer efforts to generate crucial big data constructs (see Figure 1). Our research shows that these capabilities are best organized in terms of complex analytic pipeline, starting with ingesting, then moving to transformation, data discovery, analysis, and finally operationalization. Tool choices strongly influence the speed, costs, and execution quality of developer efforts to create key big data constructs.
Figure 1: Big data tools developed as different projects are creating unintended complexity for developers
The challenge facing big data leaders is that many big data tools are highly specialized and focused on supporting specific data management capabilities using unique approaches to generating developer constructs. As a result, big data tools usually don’t mix and match across the pipeline. For example, developers have to identify and locate data sets in different ways (addressing & namespace) for each tool in the pipeline. Each tool also has a different programming model for manipulating data, from the assembly language-equivalent of MapReduce to the multi-language Spark execution engine that combines streaming, batch processing, SQL, graph processing, and machine learning. Additional differences between tools include transaction models, if they exist at all, how to source data, and how to pipe data to the next step in the tool chain.
Administrative Roles And Tasks Are Constantly Being Reshuffled
DBA’s and database architects managing use cases for analytic data pipelines work with the same capabilities as developers, but DBA’s work with a different set of tasks. Admins have to keep the pipeline operating while developers are responsible for the results. Admin tasks cover security, logging, availability and recovery, scheduling & orchestration, and elasticity, among others.
Figure 2: Big data tools developed as different projects are creating unintended complexity for administrators
As with developer capabilities and tasks or constructs in traditional, integrated SQL DBMS’s, admin tasks typically feature consistency and integration. Authenticating and authorizing users uses one common service that works across ingesting, transforming, discovering, and analyzing data, and operationalizing the results (See Figure 2). In a mix and match open source world, that seamless integration of admin tasks typically doesn’t exist because the different tools each have their own implementation of tasks or constructs. For example, different components have different authentication and authorization models, which makes intrusion prevention much more difficult. The components also handle high availability differently, with different approaches to handling faults and recovery. Elasticity is another example. Components claim and release infrastructure resources differently in order to support concurrency for their internal workload managers. Different vendors attack this problem of admin complexity with technologies such as unified storage or proprietary management tools, among others. But these approaches are having trouble keeping up with the seemingly endless creation of new technologies coming from all corners of the big data ecosystem.