Premise
- Analytic data pipeline was stable for decades, providing business intelligence in the form of historical performance reporting from enterprise applications
- We are in the midst of a transition toward a near real-time “convergence” of analytics within Systems of Intelligence
- The emerging analytic pipeline, which we call Big Data 3.0, will collect data from smart, connected products (IoT) and optimize their performance as part of a larger system
- There is a major intermediate step in this transition that can provide “training wheels” for IT leaders and practitioners: managing the operation of their modern apps through the analysis of their log data
The emerging analytic data pipeline
Sometimes the best way to look forward is to look back and see if there are hints about the future by connecting the dots between past and present. If we look at the evolution of the analytic data pipeline, three key directions are emerging. Harnessing them fully in the future requires aligning investment in human and financial capital now. They are:
- Fully leverage the decline in the cost of capturing and storing data from $700M/TB 30 years ago to roughly $50/TB today
Covered in Part 2:…
- Deliver near real-time responsiveness between capturing data and driving an action
- Build towards “converged” analytics, which enables any type of analytics on any type of data
Fully leveraging the decline in the cost of storing (and processing) data from $700M/TB 30 years ago to roughly $50/TB today
If data volumes for any workload can leverage a growing number of sources, the default database choice should work on commodity clusters
The widely-cited study claiming that the world’s digital data is growing 40% per year is wrong. The number doesn’t convey the real growth and the urgency of adopting the infrastructure and skills to build a new generation of analytic data pipelines.
The supply of data, as opposed to the cost of storing it, is essentially limitless. Traditional applications captured all their data through human data entry. That cost is constant at roughly $1bn/TB. But almost all information from all sources is now generated in digital form and with a zero marginal cost.
Harnessing as much of that new data as possible starts with capturing the log data from applications. Mainstream database technology has traditionally bottlenecked on expensive shared storage that required SAN or NAS appliances. Learning how to capture and process data on commodity clusters is critical and the new event log data being collected makes that much easier than with traditional business application data.
Traditional business transactions often updated information and supporting that was much easier when all the database processing nodes could share the same (scale-up and expensive) storage. But log events are all unique because each component or sensor only emits one event at a time, each with a unique timestamp. That means all the events keep getting appended or inserted into the database and that makes it much easier to use commodity clusters. There’s no need for the different database nodes to try to update the same data at the same time.
The variety of “things” emitting events requires new ways of storing information – almost like ingesting a Data Lake and then reorganizing everything into a Teradata data warehouse
So many applications and the services within them are emitting events that evolve with them that new database storage techniques are necessary. JSON has emerged as the preferred way of representing this data. It has the flexibility to handle the variety of machine-generated information. It also is easy for developers to read when they’re working with it.
Traditional business transactions come from the same forms so common transactions can all be stored together in common tables. But JSON “documents” have no such guarantees that they will all be alike. So for a database to be accessible like traditional SQL databases, it has to be much cleverer in organizing the data. Under the covers it has to take on more of the admin tasks of tuning the physical layout of the data to deliver on the performance expectations of end-users.
(to be covered in part 2 of 2 posts on this topic)…
- Deliver near real-time responsiveness between capturing data and driving an action
- Build towards “converged” analytics, which enables any type of analytics on any type of data
Action Items
- To get ready for the future of smart, connected products, practitioners must deal with the current equivalent of IoT: the portfolio of modern applications.
- Properly collecting and analyzing their log data in order manage them requires putting in place many of the skills, infrastructure, and processes necessary to support the IoT of tomorrow.
- Running Hadoop on a public cloud is likely the quickest way to acquire the skills to manage elastic big data applications on shared infrastructure.