Victoria - Fotolia
Small World Big Data
Published: 17 Nov 2016
Big data and artificial intelligence will affect the world -- and already are -- in mind-boggling ways. That includes, of course, our data centers.
The term artificial intelligence (AI) is making a comeback. I interpret AI as a larger, encompassing umbrella that includes machine learning -- which in turn includes deep learning methods -- but also implies thought. Meanwhile, machine learning is somehow safe to talk about. It's just some applied math -- e.g., built-over probabilities, linear algebra, differential equations -- under the hood. But use the term AI and, suddenly, you get wildly different emotional reactions —for example, the Terminator is coming. However, today's broader field of AI is working toward providing humanity with enhanced and automated vision, speech and reasoning.
If you'd like to stay on top of what's happening practically in these areas, here are some emerging big data and AI trends to watch that might affect you and your data center sooner rather than later:
Where there is a Spark… Apache Spark is replacing basic Hadoop MapReduce for latency-sensitive big data jobs with its in-memory, real-time queries and fast machine learning at scale. And with familiar, analyst-friendly data constructs and languages, Spark brings it all within reach of us middling hacker types.
As far as production bulletproofing, it's not quite fully baked. But version two of Spark was just released in mid-2016, and it's solidifying fast. Even so, this fast-moving ecosystem and potential "Next Big Things" such as Apache Flink are already turning heads.
Even I can do it. A few years ago, all this big data and AI stuff required doctorate-level data scientists. In response, a few creative startups attempted to short-circuit those rare and expensive math geeks out of the standard corporate analytics loop and provide the spreadsheet-oriented business intelligence analyst some direct big data access.
Today, as with Spark, I get a real sense that big data analytics is finally within reach of the average engineer or programming techie. The average IT geek may still need to apply him or herself to some serious study but can achieve great success creating massive organizational value. In other words, there is now a large and growing middle ground where smart non-data scientists can be very productive with applied machine learning even on big and real-time data streams. Platforms such as Spark are providing more accessible big data access through higher-level programming languages such as Python and R.
We can see even easier approaches emerging with new point-and-click, drag-and-drop big data analytics products from companies such as Dataiku or Cask. To achieve the big data and AI goals, you still need to understand extract, transform and load (ETL) concepts and what machine learning is and can do, but you certainly don't need to program low-level parallel linear algebra in MapReduce anymore.
Data flow management now tops the IT systems management stack. At a lower level, we are all familiar with silo data storage management, which is down in the infrastructure layer. But new paradigms are enabling IT to manage data itself and data flows as first-class systems management resources, the same as network, storage, server, virtualization and applications.
For example, enterprise data lakes and end-to-end production big data flows need professional data monitoring, managing, troubleshooting, planning and architecting. Like other systems management areas, data flows can have their own service-level agreements, availability goals, performance targets, capacity shortfalls and security concerns. And flowing data has provenance, lineage, veracity and a whole lot of related metadata to track dynamically.
Much of this may seem familiar to longtime IT experts But this is a new world, and providing big data and big data flows with their own systems management focus has real merit as data grows larger and faster.
I wrote recently about how the classic siloed IT practitioner might think to grow his career; big data management would be an interesting career direction. New vendors such as StreamSets are tackling this area head-on, while others that started with more ETL and data lake catalog and security products are evolving in this direction.
Super scale-up comes around. Those of us long in the IT world know that there are two megatrends that cycle back and forth: centralize vs. distribute and scale-up vs. scale-out. Sure, every new cycle uses newer technology and brings a distinct flavor, but if you step back far enough, you can see a cyclical frequency.
Big data has been aiming at scale-out on commodity hardware for a decade. Now, it's bouncing back a bit toward scale-up. To be fair, it is really scale-up within scale-out grids, but a new crop of graphics processing units (GPUs) is putting the spotlight on bigger -- and not necessarily commodity -- nodes. For example, Kinetica worked with IBM on a custom four Nvidia GPU/1 TB RAM system to power its fast, agile query, big data database -- no static pre-indexing needed. And Nvidia recently rolled out a powerful 8 GPU DGX-1 appliance designed especially for deep learning.
I have no doubt this trend hasn't finished swinging back and forth yet, although it could result in a greater connection between big data and AI. Internet of things applications are going to push quite a bit of the big data opportunities out toward the edge, which means super scale-out by definition. As always, a practical approach will likely use both scale-up and scale-out in new combinations. (How many folks kept mainframes that now can run thousands of VMs, each capable of supporting unknown numbers of containers?)
Eventually, all data will be big data, and machine learning -- and the broader AI capabilities -- will be applied everywhere to dynamically optimize just about everything. Given the power easily available to anyone through cloud computing, the impending explosion of internet of things data sources and increasingly accessible packaged algorithms, the possibilities of big data and AI are becoming real in our lifetimes.
The data center of the near future may soon be a converged host of all the data an organization can muster, continually fed by real-time data flows, supporting both transactional systems of record and opportunistic systems of engagement, and all driven by as much automated intelligence as possible. The number of enterprise IT management startups touting machine learning as part of their value proposition is increasing daily.
Mike Matchett is senior analyst at Taneja Group. Reach him on Twitter: @smworldbigdata.
Benefits and uses of big data analytics tools
Watch: No free ride for IT on cloud big data systems
Study: AI to drive mobile innovation