Formerly known as Wikibon

Getting Ready for Big Data 3.0 and the IoT

Premise

  • Analytic data pipeline was stable for decades, providing business intelligence in the form of historical performance reporting from enterprise applications
  • We are in the midst of a transition toward a near real-time “convergence” of analytics within Systems of Intelligence
  • The emerging analytic pipeline, which we call Big Data 3.0, will collect data from smart, connected products (IoT) and optimize their performance as part of a larger system
  • There is a major intermediate step in this transition that can provide “training wheels” for IT leaders and practitioners: managing the operation of their modern apps through the analysis of their log data

 

The emerging analytic data pipeline

Sometimes the best way to look forward is to look back and see if there are hints about the future by connecting the dots between past and present.  If we look at the evolution of the analytic data pipeline, three key directions are emerging.  Harnessing them fully in the future requires aligning investment in human and financial capital now.  They are:

  • Fully leverage the decline in the cost of capturing and storing data from $700M/TB 30 years ago to roughly $50/TB today

Covered in Part 2:…

  • Deliver near real-time responsiveness between capturing data and driving an action
  • Build towards “converged” analytics, which enables any type of analytics on any type of data

 

Fully leveraging the decline in the cost of storing (and processing) data from $700M/TB 30 years ago to roughly $50/TB today

If data volumes for any workload can leverage a growing number of sources, the default database choice should work on commodity clusters

The widely-cited study claiming that the world’s digital data is growing 40% per year is wrong.  The number doesn’t convey the real growth and the urgency of adopting the infrastructure and skills to build a new generation of analytic data pipelines.

The supply of data, as opposed to the cost of storing it, is essentially limitless.  Traditional applications captured all their data through human data entry.  That cost is constant at roughly $1bn/TB.  But almost all information from all sources is now generated in digital form and with a zero marginal cost.

Harnessing as much of that new data as possible starts with capturing the log data from applications.  Mainstream database technology has traditionally bottlenecked on expensive shared storage that required SAN or NAS appliances.  Learning how to capture and process data on commodity clusters is critical and the new event log data being collected makes that much easier than with traditional business application data.

Traditional business transactions often updated information and supporting that was much easier when all the database processing nodes could share the same (scale-up and expensive) storage.  But log events are all unique because each component or sensor only emits one event at a time, each with a unique timestamp.  That means all the events keep getting appended or inserted into the database and that makes it much easier to use commodity clusters.  There’s no need for the different database nodes to try to update the same data at the same time.

The variety of “things” emitting events requires new ways of storing information – almost like ingesting a Data Lake and then reorganizing everything into a Teradata data warehouse

So many applications and the services within them are emitting events that evolve with them that new database storage techniques are necessary.  JSON has emerged as the preferred way of representing this data.  It has the flexibility to handle the variety of machine-generated information.  It also is easy for developers to read when they’re working with it.

Traditional business transactions come from the same forms so common transactions can all be stored together in common tables.  But JSON “documents” have no such guarantees that they will all be alike.  So for a database to be accessible like traditional SQL databases, it has to be much cleverer in organizing the data.  Under the covers it has to take on more of the admin tasks of tuning the physical layout of the data to deliver on the performance expectations of end-users.

(to be covered in part 2 of 2 posts on this topic)…

  • Deliver near real-time responsiveness between capturing data and driving an action
  • Build towards “converged” analytics, which enables any type of analytics on any type of data

 

Action Items

  • To get ready for the future of smart, connected products, practitioners must deal with the current equivalent of IoT: the portfolio of modern applications.
  • Properly collecting and analyzing their log data in order manage them requires putting in place many of the skills, infrastructure, and processes necessary to support the IoT of tomorrow.
  • Running Hadoop on a public cloud is likely the quickest way to acquire the skills to manage elastic big data applications on shared infrastructure.
Article Categories

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
"Your vote of support is important to us and it helps us keep the content FREE. One click below supports our mission to provide free, deep, and relevant content. "
John Furrier
Co-Founder of theCUBE Research's parent company, SiliconANGLE Media

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well”

You may also be interested in

Book A Briefing

Fill out the form , and our team will be in touch shortly.
Skip to content