Melpomene - Fotolia
So, we have data -- lots and lots of data. We have blocks, files and objects in storage. We have tables, key values...
and graphs in databases. And increasingly, we have media, machine data and event streams flowing in.
It must be a fun time to be an enterprise data architect, figuring out how to best take advantage of all this potential intelligence -- without missing or dropping a single byte.
Big data platforms such as Spark help process this data quickly and converge traditional transactional data center applications with advanced analytics. If you haven't yet seen Spark show up in the production side of your data center, you will soon. Organizations that don't, or can't, adopt big data platforms to add intelligence to their daily business processes are soon going to find themselves way behind their competition.
Spark, with its distributed in-memory processing architecture -- and native libraries providing both expert machine learning and SQL-like data structures -- was expressly designed for performance with large data sets. Even with such a fast start, competition and larger data volumes have made Spark performance acceleration a sizzling hot topic. You can see this trend at big data shows, such as the recent, sold-out Spark Summit in Boston, where it seemed every vendor was touting some way to accelerate Spark.
If Spark already runs in memory and scales out to large clusters of nodes, how can you make it faster, processing more data than ever before? Here are five Spark acceleration angles we've noted:
- In-memory improvements. Spark can use a distributed pool of memory-heavy nodes. Still, there is always room to improve how memory management works -- such as sharding and caching -- how much memory can be stuffed into each node and how far clusters can effectively scale out. Recent versions of Spark use native Tungsten off-heap memory management -- i.e., compact data encoding -- and the optimizing Catalyst query planner to greatly reduce both execution time and memory demand. According to Databricks, the leading Spark sponsor, we'll continue to see future releases aggressively pursue greater Spark acceleration.
- Native streaming data. The hottest topic in big data is how to deal with streaming data. This is really about how to process data bits as they arrive. But real-time streaming data sets require special handling, and this presents quite a management challenge. In the past, this often required complexly managed workflow and messaging and queuing algorithms; sometimes the answer was the use of separate infrastructure clusters running a different stack of software altogether. Today we are seeing streaming data support converging into -- and under -- more friendly paradigms. Spark 2.0, for instance, now natively supports structured streaming, which easily folds new kinds of streaming data sources into the existing developer-friendly big data platform.
- Unifying big data. Products such as MapR, Alluxio and Splice Machine aim to create unified big data sources, databases and storage that can natively ingest many different kinds of data and serve them in a unified manner to a downstream application, such as Spark. Some of these tools converge transactional data with other big data types -- and provide SQL access to all. Others merge streaming data into historical data sets, providing a consistent data API. Either way, upstream integrated big data sources can help make Spark application processing far more focused and efficient.
- Hardware acceleration. There is, of course, a lot of speeding up to be had using specialized hardware. While many would prefer to stick with a strictly vanilla commodity server infrastructure, it's clear that harnessing large numbers of graphical processing units (GPUs) or custom field-programmable gate arrays, such as those from Kinetica or BigStream, would greatly accelerate Spark processing. In addition to dense computation, the video RAM contained on GPU cards offers another tier of memory; systems can bend that extra memory to good use, accelerating certain Spark functionalities.
- Purpose-built platforms. There are some attractive non-commodity platforms and appliances specially built to offer high-end Spark performance. These vendor products might converge high-performance compute, network and storage components, use dense server-side NVMe flash, and even toy with new kinds of low-level memory management specifically applied to Spark acceleration. Examples include the Cray Spark platform, Oracle Sparc servers, DriveScale racks and Iguaz.io appliances.
Other products and open source projects can deliver even greater performance than Spark in certain cases. Apache Flink, for example, is designed for low-latency streaming.
In general, though, the Spark architecture has plenty of room to evolve and has the momentum to remain the big data platform of choice for at least the next five years. And given the market-wide acceleration effort, there is no good reason to keep pushing off Spark adoption.
Chapter excerpt explores how to build big data systems
UC Berkeley offers Drizzle to support Spark streaming
NoSQL opens up logical modeling for enterprise data architects
Cardinal Commerce combines big data and Spark to grow users
Dig Deeper on Distributed Application Architecture (Scale-Out Architecture) for IT
Solve real-time analytics challenges across operational and data lake data
What’s all the fuss about in-memory databases for IoT?
Feature wise comparison between Apache Hadoop vs Spark vs Flink
Functionality gaps not stopping Spark usage from growing fast