Contributing Analysts
- George Gilbert
- Ralph Finos
- Peter Burris
Overview
The big data market is changing rapidly as mainstream use moves beyond data lakes. Vendors are helping customers better manage the complexity of the big data ecosystem. Applications increasingly require real-time analytics to inform a decision, whether by a human or machine. And the analytics driving those decisions are beginning to get their own pipeline, especially when machine learning is involved.
This report describes Wikibon’s definition of big data and big data categories. It also serves to describe the methodology underpinning our big data market forecasts and other related forecasts, as well as vendor big data market shares. This research document is used as a basis for the following Wikibon research:
This report also details how definitions changed since the 2016 forecast. Those reports include:
- “Forecasting Spark’s Adoption in the Context of Systems of Intelligence“
- “Forecasting Big Data Application Patterns“
- “Enhancing Systems of Intelligence”
- “2015 Big Data Market Shares“
Big Data Definition and Examples
The big data market is defined as the workloads with data sets whose size, type and variety, speed-of-creation, and data velocity make them less practical to process and analyze with traditional infrastructure and software technologies, and require new tools and management processes to successfully execute and manage.
Wikibon counts revenues derived from sales of hardware, software, and services to the end-users of big data, who in turn utilize big data and big data analytics for their enterprise. Typically this will involve an enterprise’s purchasing:
- Compute clusters, shared or commodity storage or other big data-related hardware infrastructure.
- Big data tools, analytic software, databases, middleware, application software and application or infrastructure services (e.g., SaaS services) that will utilize the big data hardware infrastructure and create business value.
- External professional services (business and IT consulting, system integration, application development, data management) that is necessary to realize the value of big data for an enterprise.
Representative workloads utilizing big data that meet our definition would include the following. Of course, this list is meant to be illustrative and is not exhaustive.
- Customer data analytics
- Customer segmentation
- Customer churn analysis
- Recommendation engine (i.e. Upselling customers)
- Patient diagnosis, outcomes and remote management
- Sentiment analysis
- Transaction analytics
- Fraud detection
- Marketing campaign analysis
- Supply chain optimization
- Workflow optimization
- Risk management
- Financial
- Weather
- Security Analytics – anomaly detection, threat assessment
- Machine-related monitoring/prediction
- IT operations support
- Industrial equipment predictive maintenance and failure reporting
- Gas & oil field monitoring
- Network monitoring and analysis
- Smart meters and Smart Grid
- Application performance management
- Distributed sensor data management, analysis, and coordination (Internet of Things)
- Spatial and Location-based applications
- Smart Cities
- Transportation
- Emergency response
- Field force management
- Rich media analytics
- Surveillance
- Entertainment
- Content Analytics
- Data Lakes
Hardware, Service, and Public Cloud Categories
HARDWARE includes commodity compute, storage and networking that supports scale-out big data software. A fundamental premise of big data is to leverage a scale-out software (versus scale-up) approach where possible. As a result, the majority of hardware-related revenue in the Big Data market is associated with commodity servers with direct attached storage. A major change in hardware will start to be deployed in the second five years of the forecast. Edge computing will apply ever more intelligence in making ever faster decisions to information coming from sensors on physical devices at the edge of the network. For the purposes of this study, Wikibon includes the IaaS portion of big data Public Cloud services (e,g., AWS, Microsoft Azure) as hardware.
PROFESSIONAL SERVICES help practitioners apply the technology to real-world business problems. This includes identifying initial use cases, designing and deploying the supporting infrastructure, architecting data flows and transformations to create data lakes, managing the data science to derive and operationalize insights from the data, and consulting and education. The market is led by the usual large system integrators (IBM, Accenture, etc.), but also includes thousands of small and mid-sized SIs and consultancies, especially those with traditional data warehouse, analytics and business intelligence DNA, as well as those with vertical market expertise. While the professional services market makes up the largest slice of the overall big data market today, Wikibon expects professional services to become less critical to big data projects over the long-term (5-10 years out) as software (both platforms and applications) and hardware (private and public clouds) mature and as more packaged applications emerge. Technology maturity should make big data applications more accessible to less sophisticated practitioners, thereby requiring a lesser mix of professional services. Highly sophisticated professional services firms will always have a critical role to play at the leading edge of the market. Vendors such as IBM and Accenture will likely always serve customers with sophisticated application requirements that deliver competitive differentiation. Among the most challenging applications over the next 5-10 years will be the management of industrial devices informed by sensor data analyzed by a “digital twin”, as GE has termed it. The knowledge to build these applications will primarily come from successive implementations by leading edge professional services firms in a particular domain. Each implementation adds to expertise that gradually transforms services into more widely shared expertise and packaged applications.
PUBLIC CLOUD SERVICES are becoming the primary choice for big data workloads. The market for IaaS and PaaS supporting native and third party big data tools and workloads is an increasingly important factor in the market. Since early 2016 a rapidly growing portion of big data workloads and apps have been finding their way to the cloud as customers experience the challenge of running heterogeneous distributed systems. Multi-vendor complexity is also driving adoption of cloud native services from AWS (Kinesis, EMR, DynamoDB, Redshift, etc.), Azure, Google, and IBM. In addition, as more data originates across widely dispersed networks, including at the edge, more of that data gets aggregated in the cloud. Traditionally, terminals created data at the center and applications pushed it outwards to larger, consuming audiences. Hybrid solutions involving data moving back and forth from on-premises workloads to the cloud are also becoming more frequent – within limits of obvious latency constraints posed by movement of such large datasets. Public Cloud revenue is treated as an orthogonal view of the overall forecast. PaaS and SaaS software native to public clouds is included in our software figures. Software via 3rd parties is attributed to each 3rd party provider. Hardware (typically ODM) required to deliver big data public cloud services includes IaaS to support native and 3rd party software, as well as hardware services utilized for big data directly by enterprises for their own in house code.
SOFTWARE CATEGORIES
SOFTWARE CATEGORIES for 2017 have changed to capture application trends.
- There are now two classes of databases, one to support application processing, the other to support analytics.
- Data science pipelines, which manage the machine learning process, is now its own category after having been incorporated in 2016’s “Data Management” and “Big Data Applications, Analytics, and Tools” category.
- Application infrastructure partially replaces “Core Technologies”, with the balance mostly in “Stream Processing.”
- Machine learning packaged applications, while nascent, has its own category in order to reflect its future importance.
See the section covering “2017 Changes” for details of the software taxonomy changes used relative to the 2016 report. That section also details the changes in 2016 relative to 2015.
APPLICATION DATABASES manage application state and will coopt functionality from analytic databases. Managing the state of applications used to mean there was a single source of truth. Traditionally, SQL RDBMSs managed this single source of truth so it could be shared across the various application modules. But with the rise of big data applications and internet-scale Web and mobile apps, NoSQL databases relaxed some of the transactional consistency restrictions to enable orders of magnitude greater scale. The new databases also relaxed some of the transactional consistency restrictions because the restrictions were less important in the context of microservices. Application databases include AWS DynamoDB, Cassandra, HBase, MongoDB, Kudu, and VoltDB. AWS RDS databases also belong in this category. The category includes a share of the traditional vendors working at the extreme end of their scalability, including Oracle Exadata, IBM’s DashDB and Cloudant, Azure SQL DB, and the recently available Spanner service from Google. Databases that currently span the analytic and application database category include SnappyData, Splice Machine, MemSQL, and Iguaz.io. We expect more application databases to coopt more analytic functionality over time as applications need low latency analytics to inform decisions, whether made by humans or machines. For more detail see the Application Databases section of the 2017 Big Data and Analytics Forecast.
Rationale for separation from analytic databases: These databases need to be separate from analytic databases because the application databases have a different place in the data pipeline.
ANALYTIC DATABASES start with Data Lakes, which have taken on progressively more of the traditional roles of data warehouses, from serving as an archive of data too voluminous to be collected in one repository to supporting traditional business intelligence to supporting the machine learning pipeline. With the blurring of the line with data warehouses, this new category includes a fraction of data warehouse revenue of AWS Redshift; Azure SQL DW; and emerging products from independent vendors such as Crate.io and Snowflake. This category also includes all the revenue of the MPP SQL DBMS’s on Hadoop such as Pivotal HAWQ; Teradata’s Aster and Presto; the offerings from Hadoop distro vendors including Hortonworks Impala, MapR Drill, and Hortonworks Druid. Several databases span the data warehouse and application database categories such as Splice Machine, MemSQL, and Transwarp. For the databases that span categories, half their revenues are counted here and half in the application database category. Iguaz.io and SnappyData actually span data warehouses, application databases, and stream processors, so we allocate them to three categories. Also included in this category is a share of the business intelligence tools such as Tableau, Qlik, PowerBI, and Zoomdata that present the data in these databases.
Rationale for separation from application databases: Specialized analytic databases need their own category for several reasons. Most important, applications don’t rely on them to keep track of every change in the state of a process. In addition, they have different place in the analytic data pipeline than application databases. In the analytic data pipeline they serve big, historical data.
DATA SCIENCE PIPELINES manage the machine learning lifecycle with their own design-time and run-time states that are distinct from the design-time and run-time states of the broader data pipeline. The data science pipeline includes stages for ingest, explore, prepare and integrate, identify model features, test the model, deploy the model, and integrate the continual data feedback into the model. There are dozens of vendors that offer specialized functionality at various stages of the pipeline. The public cloud vendors have traditionally offered raw algorithms for customers to train. More recently Google, Microsoft, and Amazon are offering fully-trained services that support conversational user interfaces as well as machine vision. Microsoft and IBM have also begun offering higher level untrained templates for subscriber churn, recommendations, etc. GE and other industrial companies have begun offering more fully-trained asset management and predictive maintenance for equipment their customers operate.
Rationale for separation from Data Management; Applications & Analytics; and Hadoop, Spark, and Streaming: The data science tool chain has its own design-time and run-time pipelines that are separate from the design-time and run-time in the broader analytic data pipeline. In addition, these tools are maturing well behind the other categories.
APPLICATION INFRASTRUCTURE is the big data equivalent of an application server. This infrastructure is the platform or execution engine for the application logic defined by the developer. Examples include Splunk, the various execution engines in Hadoop such as MapReduce and Hive. The category includes a portion of Spark and Flink, since they also belong partly in stream processing. Application infrastructure also includes the management software that keeps applications and infrastructure running such as AppDynamics, New Relic, Rocana, and Unravel Data. Cask also belongs in this category by virtue of creating an application abstraction layer above Hadoop infrastructure, including Spark and Kafka.
Rationale for separation from Data Management and Hadoop, Spark, and Streaming: The big data platform technology on which applications are built contain the application logic and needs to be separate from data managers such as databases and stream processors.
STREAM PROCESSING is complementing batch and request/response application patterns, especially when low latency analytics have to be applied to each event in the stream. Streaming is becoming an application design pattern in its own right where it continuously connects autonomous microservices which manage their own state and which themselves may operate in any of the three application patterns. In addition to microservices, stream processing is fundamental to IoT applications, which might analyze each event locally and forward only ones to the cloud relevant for global model retraining. Stream processing represents a very different programming model from batch and request/response and that difference makes its adoption slower. Developers who are familiar with the other two patterns and the use of databases to share data will continue to be pervasive for years. However, newer developers as well as those building applications that process data continuously, such as IoT applications, will grow stream processing’s share. Leading products that include streaming all or in part include Spark’s Structured Streaming, Flink, Samza, Azure Streaming Analytics, Google Dataflow, AWS Kinesis Analytics, IBM Streams, and Kafka Streams. Products that are currently transport-focused include Kafka, NiFi, and MapR Streams.
Rationale for separation from Hadoop, Spark, and Streaming: Stream processing needs its own category for several reasons. Core Hadoop as well as core Spark were designed for batch processing. Streaming, by contrast, is the foundation for a new application design pattern where data is continually processed, typically by a collection of microservices. In addition, stream processing and analytics are the foundation for front-line analytics for IoT applications.
MACHINE LEARNING packaged applications are still emerging. Some of the first categories include micro-apps such as anti-money laundering, departmental apps such as cybersecurity, vertical apps such as CPG demand and replenishment planning, and ecosystem apps such as ad-tech. Another, separate taxonomy of machine learning application services that are designed to be integrated into broader applications is also emerging. The conversational user interface services are described in the machine learning pipeline category. IBM, Microsoft, and AWS are starting to include fully-trained models in horizontal and vertical micro functions. This last category is currently extremely nascent.
Rationale for separation from applications and analytics: Applications have to be separate from the tools to build them. As the tools mature, it becomes easier to build packaged apps. But tools and applications each have their own data pipelines and lifecycles.
Category Changes in 2016 Report (Last Year’s) From 2015 Report
Software categories in 2016 (last year’s report) were changed in order to capture application changes. Wikibon redefined some of the software categories to accommodate the trends that we saw emerging over the forecast period. The following are all the software categories from 2016 with an explanation of which changed and why:
- Data Management – No change from our 2015 Big data Market Forecast.
- Core Technologies – Hadoop was changed to include Spark. We extended the Hadoop software category to account for Spark and Streaming since we expected these to become an increasingly important component for big data solutions in the coming years.
- Big Data Database – reflected the combination of SQL and NoSQL databases. Wikibon expected to see NoSQL databases get SQL query capabilities.
- Big Data Applications, Analytics, and Tools – No change from our 2015 Big data Market Forecast.
Database: Last year’s forecast put all databases in one category, including traditional application SQL OLTP and HTAP databases, MPP data warehouses, and the newer NoSQL application databases.
Hadoop, Spark, Streaming: The core big data compute engines were grouped in one category. The engines included MapReduce, Hive (on Tez or MapReduce), Spark and all its libraries including Streaming, Flink, and Kafka, among others.
Data management was a catch-all category for data preparation, integration, transformation; and governance tools such as lineage, quality, and compliance. Vendors included IBM, SAP, Informatica, Oracle, Talend, Syncsort, Pentaho, Datameer, Attunity, Paxata, and Trifacta.
Applications & analytics: Combine machine learning apps with business intelligence and machine learning tools. Packaged big data & machine learning applications have been so immature that we included the tools used in custom development in the category. We defined the tools category broadly to include machine learning (H2O, Parallel Machines, Revolution Analytics, etc.), business intelligence and data visualization (Tableau, Qlik, PowerBI, and Zoom Data).
Themes driving 2016 definition changes (last year’s forecast) from prior years:
- Narrower definitions focus more on relevant technologies than just projects with a big data “mindset.”
- Customer adoption spreads beyond Web native enterprises and requires commercial support.
It is critically important to understand how Wikibon defines big data as it relates to the market size overall and to revenue estimates for specific vendors in particular. In our 2016 sizing of the 2014 and 2015 markets and projections to 2026 forecast entitled “2016 – 2026 Worldwide Big Data Market Forecast”, we altered our definition to reflect the fact of the evolution of the market and the appearance of product suites from a wide variety of vendors pointed at solving big data problems.
In Wikibon’s prior forecasts, we included spending on projects where practitioners embraced an exploratory and experimental mindset regarding data and analytics replacing gut instinct with data-driven decision-making. In prior years, workloads and projects whose processes were informed by this mindset met Wikibon’s definition of big data, even in cases where some of the tools and technology involved may not have been included.
This definition was suitable for the state of an early market where practitioners were often using free, open-source tools to experiment with the possibilities of realizing the business value of this technology. The market was led in this period by Fortune 500 enterprises vying for advantage at large scale and by web-native data-driven companies. We believe that the market has moved beyond its infancy into (perhaps) its adolescence and is well on its way to a modest level of maturity and rationalization that will help facilitate real business process solutions and value.
As a result, our 2016 Forecast took a more narrow perspective that reflected big data’s move into the mainstream with a landscape of targeted tools and solutions aimed at enabling more modest enterprises to participate meaningfully. Also, we recognized that users were moving beyond free open-source tools to paying for support in order to enable them to move confidently to real workloads with real business value.
As such, we believed it was time to require that the big data workload and dataset be the primary gate – with “mindset”, per se, being excluded. However, we do include some traditional databases and application tools which include capabilities and extensions to support big data use cases where workload and data set size, type, and speed-of-creation are the primary considerations for solution selection.
Big Data Market Share & Forecast Research Methodology
Wikibon has built a body of vendor and user research on Big Data since 2012, when we initiated our first annual Big Data Market Forecast. Regarding our data sources over the years, Wikibon’s big data market size, forecast, and related market-share data is based on 100s of extensive interviews with vendors, conversations on theCube at big data events, venture capitalists and resellers regarding customer pipelines, product roadmaps, and feedback from the Wikibon community of IT practitioners. Third party sources – public financial data, media reports, and Wikibon and 3rd party surveys of big data practitioners. Information types used to estimate revenue of private big data vendors included supply-side data collection, number of employees, number of customers, size of average customer engagement, amount of venture capital raised, and the vendor’s years of operations.
Wikibon’s overarching research approach is “Top-Down & Bottom-Up”. That is, we consider the state and possibilities of technology in the context of potential business value that is deliverable (Top-Down) and compare that with both supply-side (vendor revenue and directions, product segment conditions) and demand-side (user deployment, expectations, application benefits, adoption friction and business attitudes) perspectives. In general, we believe a ten-year forecast window is preferable to a five year forecast for emerging, disruptive and dynamic markets because we feel there are significant market forces – both providers and users – that won’t play out completely over a shorter time period. By extending our window we are able to describe these trends better, and indicate more clearly how Wikibon believes they will play out.
Action Item
Big data and machine learning represent a journey, not a destination, toward applying data-driven decisions in applications. While big data pros may not be able to skip entire generations of technology, they can time their investments to match their internal skills with appropriate technology maturity levels.