adam121 - Fotolia
Small World Big Data
Published: 23 Aug 2018
Have you noticed yet? Those geeky big data platforms based on clusters of commodity nodes running open source parallel processing algorithms are evolving into some seriously advanced IT functionality.
The popular branded distributions of the Apache projects, including Hortonworks, Cloudera and MapR, are no longer simply made up of relatively basic big data batch query tools, such as Hadoop MapReduce, the way they were 10 years ago. We've seen advances in machine learning, SQL-based transaction support, in-memory acceleration, interactive query performance, streaming data handling, enterprise IT data governance, protection and security. And even container services, scheduling and management are on a new level. Big data platforms now present a compelling vision for the future of perhaps all IT data processing.
Wait, do I really mean all IT data center processing will be big data processing? Most of us are just getting used to the idea of investing in and building out functional data lakes to capture and collect tons of unstructured data for business intelligence tasks, offline machine learning, active archive and other secondary data applications. And many are having a hard time making those data lake initiatives successful. It's a challenge to develop staff expertise, assure data provenance, manage metadata and master implied schemas, i.e., creating a single version of truth.
Many organizations may be waiting for things in the big data market to settle out. Unfortunately, especially for those more comfortable being late adopters, big data processing technology development is accelerating. We see use cases rapidly proliferate, and general IT manageability of big data streams (easing adoption and integration) greatly increase.
The universal big data onslaught is not going to slow down, nor will it wait for slackers to catch up. And those able to harness their big data streams today aren't just using them to look up old baseball stats. They are able to use data to improve and accelerate operations, gain greater competitiveness and achieve actual ROI. I'm not even going to point out the possibility that savvy big data processing will uncover new revenue opportunities and business models. Oops, just did.
If you think you are falling behind today on big data initiatives, I'd recommend you consider doubling down now. This area is moving way too fast to jump on board later and still expect to catch competitors. Big data is proving to be a huge game changer. There simply won't be a later with big data.
I've written before that all data is eventually going to be big data. I'll now add that all processing is eventually going to be big data processing. In my view, the focus of big data technology has moved from building out systems of insight over trailing big data sets to now offering ways to build convergent systems of action over all data.
In other words, big data isn't just for backroom data science geeks. The technologies involved are going to define the next-generation IT data center platform.
Big data system of action
What makes this concept a system of action? First, platforms now converge support for both analytical and transactional big data processing. Early big data platforms focused on offline batch-style analysis of largely read-only big data sets. They were inevitably driven to support faster, more interactive query speeds for faster and more ad hoc analytics. As query performance picked up, a wider set of business-side users clamored for familiar SQL semantics, at first for friendlier query construction but then also for big data transactional processing.
At least two more trends lead to the system of action. One is that more real-time data streams in from IoT deployments, which drive a more real-time business application focus. These data streams are often the very definition of big data, and big data stream processing platforms now need to host a multiplicity of workloads, including business applications.
Another important trend is machine learning and AI. Big data offers a wealth of information points to train predictive models. In today's world of convergence and hyper-convergence, it makes sense to use a big data platform that handles big data and streaming data, recurring (and even dynamic) model building and training, and the application of models in business-optimizing applications.
Fundamentally, as big data streams in faster, model and training iterations become more continuous, business applications merge together analytics-derived optimizations with current transactional operations, and the overall data processing loop gets tighter, shorter and faster.
This all points to a single, future-converging big data processing platform. In some ways, this vision is reminiscent of how the mainframe of the past was seen as the system of record. If you are still skeptical, take a good look at plans for Apache Hive, the capabilities of MapR-DB and what IBM is up to with DB2.
Data virtualization and hybrid deployment
I'm not at all implying that this future system of action needs to actually live in one location or that it should be hosted on premises -- even in part. There are some great advances in data virtualization that allow data to sit where it sits best; tools such as SwiftStack and Hortonworks Connected Data Platform Services help move data when necessary -- but only to where it's processed most effectively.
Process data in place if you can, whether that's on site, in the cloud or at some IoT edge or device. But there has to be unifying IT governance throughout. Enterprise-capable IT governance facilities can now be found in most big data distributions. These provide data protection, business continuity and disaster recovery, security and audits, and compliance and regulatory controls through the latest generation of adroit big data processing tools that offer data discovery, validation, cataloging, access controls, collaboration and more.
A truly converged big data platform could serve as an active, operational system of truth and not just sit there as a vast data lake of assorted and aggregated bits.