Sergey Nivens - Fotolia
Published: 17 Nov 2016
How corporations use big data will literally be the difference between business success or business failure. Now, the fast data architecture promises more possibilities, with the right data management to support it.
Data generation is increasing at mind-boggling rates, and the evidence surrounds us: 21 million tweets and 9 billion email messages are sent every hour. Soon, even more information will be created. Sensors will collect performance data on items such as light bulbs, personal medical devices will monitor insulin rates and inventory will be tracked as it moves from place to place.
As a result, analyst firm IDC expects data volumes to double every two years and reach 40 zettabytes -- a zettabyte equals one million petabytes -- in 2020. Enterprises want to do more than collect information for future analysis -- they want to evaluate it in real time, a desire that is dramatically changing the data management market.
Recently, big data systems have been all the rage. In fact, IDC projects that the market will grow at 23.1% annually and reach $48.6 billion in 2019. Big data systems have been gaining traction for a few reasons. They allow organizations to collect large volumes of information and use commodity hardware and open source tools to examine it. Businesses can then justify deployments that are much less expensive than traditional proprietary database management systems (DBMSes). Consequently, Hadoop clusters built from thousands of nodes have become common in many organizations.
With competition increasing, management is placing new demands on IT.
"Knowledge is power, and knowledge of yesterday is not as valuable as knowledge about what's happening now in many -- but not all -- circumstances," said W. Roy Schulte, vice president and analyst at Gartner.
Businesses want to analyze information in real time, an emerging term dubbed fast data. Traditionally, acting on large volumes of data instantly was viewed as impossible; the hardware needed to support such applications is expensive. But such thinking has recently been changing. The use of commodity servers and the rapidly decreasing cost of flash memory now make it possible for organizations to process large volumes of data without breaking the bank, giving rise to the fast data architecture. In addition, new data management techniques enable firms to analyze information instantly.
For example, transaction systems include checks so that only valid transactions take place. A bank would not want to approve two transactions entered within milliseconds that took all of the money out of a checking account. Analytical systems collect information and illustrate trends, such as more time being taken by call center staff handling customer inquiries. By linking the two, corporations could build new applications that perform tasks, like instantly approving a customer's request for an overdraft because the client's payment history is strong.
How big is big data?
Technically speaking, a zettabyte is 1021 bytes, or a billion terabytes -- but data capacity numbers that large can be hard to digest. In more practical terms, a zettabyte could be expressed by the equivalent of 152 million years of high-definition video. Forty zettabytes -- the level IDC expects data volume to reach in 2020 -- split among the 7 billion people on Earth equates to about 5.7 terabytes per person.
New data management products
Traditional data management systems worked only with data at rest, storing information in memory, on a disk, in a file, in a database or in an in-memory data grid and evaluating it later. Emerging products, which are being labeled as streaming systems, work with data in motion, information that is evaluated the instant it arrives.
The new streaming platforms use various approaches, all with the goal of delivering immediate analysis. "You don't need any DBMS at all for some fast data applications," noted Gartner's Schulte.
In certain cases, traditional DBMS products have morphed to support the fast data architecture. For example, Hadoop is a parallel data processing framework that has traditionally relied on a MapReduce job model. Here, data is collected. Batch jobs, which take minutes or hours to complete, eventually present the data to users for evaluation. To address the demand for fast data, Apache, the group in charge of Hadoop standards, created Spark, which runs on top of Hadoop and provides an alternative to the traditional batch MapReduce model. Spark supports real-time data streams and fast, interactive queries.
In some cases, a business wants to store a copy of the event stream and use it for later analysis. Originally developed by the engineering team at Twitter, Apache Storm processes unbounded streams of data at a rate of millions of messages per second. Apache Kafka, developed by engineers at LinkedIn, is a high-throughput distributed message queue system designed to support fast data applications. In addition, start-ups are adding streaming functionality to NewSQL and NoSQL and trying to bridge the traditional antithetical desires of processing fast data and storing information for later analysis.
These various streaming data products are built to handle the increasingly high-volume, complex information streams that businesses generate. Examples of the new data sources include news feeds, web clickstreams, social media posts and email. The new, usually unstructured data -- information that does not fit neatly into the rows and columns found with a traditional DBMS -- is growing at higher rates than structured data. Consequently, these emerging data repositories ingest large amounts of diverse information -- as many as millions of inputs every second.
The potential reach of these systems is enormous. Large companies already have thousands of event streams running at any given moment. Traditionally, firms want to tap into that information and improve operations. Such moments are "perishable insights," urgent business risks and opportunities that firms can only detect and act on at a moment's notice, according to Mike Gualtieri, vice president and principal analyst at Forrester Research.
Equipped for battle
Wargaming.net is an online multiplayer game developer that was founded in 1998. In June 2015, the company was searching for a platform to support its application's 100-node, 200 TB fast data architecture running on Apache Spark. The gaming firm evaluated products from vendors and opted for the Apache Hadoop provider Cloudera because of its strong customer support, according to Sergei Vasiuk, development director at Wargaming.net.
The gaming supplier began deploying the Cloudera platform in June 2015 and had it operating by the end of the year. Currently, a dozen fast data applications support functions that range from securing network connections to analytics outlining how well individual players fare.
Building new infrastructure
Most companies are not ready to support a fast data architecture for a number of reasons. First, the applications are complex and hard to build, and almost always combine data from multiple sources. For example, a telecom support application links incoming call data with customer profiles to enable contact center agents to upsell, offering coupons for an upgrade to a higher-tier calling plan during a customer engagement.
For such connections to be coded into the applications, new development tools are needed.
Developers require products that create streaming flows and rely on new runtime platforms. These tools, in early forms today, lack the amenities found with more established products -- such as robust development, testing, integration and administration functionality. Often, the user has to write the code to deliver that functionality themselves, which increases development time as well as the complexity in system maintenance.
Because the streaming platforms and development tools are new, many IT departments have little to no experience with them. Firms need to develop different design practices for fast data architectures than those now used with traditional IT architectures. Employees then need to work with IT shops to understand how to best write these applications.
Finally, the products that enable fast data are expensive. While many suppliers offer free, limited-function entry-level systems, pricing for fast data products can quickly rise into the six-figures. These projects can cost millions of dollars to deploy once other factors, such as development tools and labor costs, are factored into the equation.
The future is rushing up
Despite the current limitations, fast data's future looks bright. The advent of mobile and social media is altering customer expectations: They want answers right now. So firms need to collect more information and move immediately to satisfy customer demands.
As noted, new data sources are gaining traction, and their future is bright: "The internet of things is the single biggest driver for fast data demand," Gartner's Schulte stated. "By 2020, more than half of all new application projects will incorporate some -- large or small -- amount of IoT processing. Some of these will use a stream analytics platform, and the remainder will write stream processing into the application code."
Paul Korzeniowski is a freelance writer who specializes in modern infrastructure issues. He has been covering technology issues for more than two decades, is based in Sudbury, Mass., and can be reached at firstname.lastname@example.org.
What unstructured data really means
Deep learning in the analytics field