Modern Infrastructure

The underlying costs of cloud apps

Mathias Rosenthal - Fotolia

Spark speeds up adoption of big data clusters and clouds

Infrastructure that supports big data comes from both the cloud and clusters. Enterprises can mix and match these seven infrastructure choices to meet their needs.

If enterprise IT has been slow to support big data analytics in production for the decade-old Hadoop, there has been a much faster ramp-up now that Spark is part of the overall package. After all, doing the same old business intelligence approach with broader, bigger data (with MapReduce) isn't exciting, but producing operational time predictive intelligence that guides and optimizes business with machine precision is a competitive must-have.

With traditional business intelligence (BI), an analyst studies a lot of data and makes some hypotheses and a conclusion to form a recommendation. Using the many big data machine learning techniques supported by Spark's MLlib, a company's big data can dynamically drive operational-speed optimizations. Massive in-memory machine learning algorithms enable businesses to immediately recognize and act on inherent patterns in even big streaming data.

But the commoditization of machine learning itself isn't the only new driver here. A decade ago, IT needed to stand up either a "baby" high performance computing cluster for serious machine learning or learn to write low-level distributed parallel algorithms to run on the commodity-based Hadoop MapReduce platform. Either option required both data science and exceptionally talented IT admins that could stand up and support massive physical scale-out clusters in production. Today there are many infrastructure options for big data clusters that can help IT deploy and support big data-driven applications.

Here are seven types of big data infrastructures for IT to consider, each with core strengths and differences:

1. Dedicated physical big data clusters

When Hadoop was first released, it enabled anyone to stand up any old cluster of commodity servers to tackle big data. The canonical Hadoop cluster consists of a scale-out set of servers that hosts and converges both distributed processing and a purpose-built distributed file system (the Hadoop Distributed File System). Whereas the original Hadoop distribution was almost still a science project, recent versions of both core Apache Hadoop and commercial distributions from popular vendors such as Cloudera, IBM and Hortonworks offer enterprise friendly features.

Still, an organization that builds a monolithic dedicated physical cluster must worry about server lifetime issues like increasing heterogeneity, global patching, isolated resources, and mixed workload performance and capacity management.

2. Orthogonal compute and storage

Some enterprise storage vendors have "externalized" HDFS as an API or protocol to substitute their own enterprise storage area network (SAN). Here, familiar data center storage can sit adjacent to one or more compute-node Hadoop/Spark clusters. While comparatively costly on a Capex basis, big data kept in a SAN can be broadly shared, governed and protected like other enterprise data.

Companies such as Data Direct Networks have worked hard to evolve their HPC capable storage and integrated storage-focused application expertise (like Lustre) to offer HPC-powered but enterprise-quality big data products for Hadoop and Spark.

3. Virtualized/containerized clusters

Like every other enterprise application these days, Hadoop compute and storage can be hosted in VM form. Virtual clusters are easy to provision, and extend or contract as desired. VM hosting can even be set up to isolate fungible compute node clusters while sharing data readily from persistent data storage nodes. For example, VMware offers a native Big Data Extension functionality to vSphere for big data clusters.

Of course there is implicit overhead and contention when multiple big data applications compete for shared virtual resources, but virtual hosting is great for temporary bursting, data sharing, quick development and testing sandboxes, and serving on-demand intermittent processing tasks.
Big data tools are also increasingly available in container form, which can increase agility and greatly reduce the combined overhead of the VM approach.

4. Hyper-converged appliances

Plain old Hadoop nodes already converge server and storage in each node, but companies such as Diamanti go further by offering IO acceleration and the virtualized pooling of storage within and between their agile, containerized Hadoop appliances. This plug-and-play hyper-converged appliance approach to big data makes it easy for IT to stand up, scale and serve high performance big data clusters to multiple end users.

5. Cloud hosted big data tools

Of course, Amazon Web Services, Google and other cloud providers offer a software as a service (SaaS) version of big data analytics. Anyone can bring a credit card to a public cloud and immediately purchase tremendous computing power for a few dollars. But while the burst cost for access to a large cluster is empowering and compelling, ongoing persistent big data cloud storage costs can add up (as well as big data I/O transaction costs). Security and other data governance issues can also be a challenge to cloud-based big data processing.

6. Hybrid cloud big data

One new architectural approach was recently introduced by Galactic Exchange. This company uses a public cloud SaaS offering to host and provide all the fussy Hadoop/cluster management functionalities, while keeping all analytical processing, compute and data storage nodes on-premises. This solves data storage issues, can minimize long-term costs and augments overworked staff IT with expertly managed big data coverage. If desired, management functionality can always be migrated back to on-premises hosting so there is no hard lock-in.

7. Bare-metal cloud infrastructure

DriveScale and HPE Synergy promise yet another option for large enterprises looking for high agility and dense resource pooling with a new type of bare-metal cloud provisioning. DriveScale is a rack-sized pool of servers and disks that can be composed dynamically into Hadoop clusters as desired. Synergy is a new dense blade-style infrastructure out of which physical machines, storage services and network resources can be dynamically carved out and provisioned on-demand. An IT shop can benefit from cloud pooling, elastic provisioning and utility pricing models, while offering fully isolated and dedicated physical machine-based services to business clients.

No longer does IT have to make a one-size-fits-all commitment to a monolithic, homogenous cluster of servers. Instead IT can take advantage and profit from an intelligent mix-and-match approach, building out an architecture that aligns with both flexible and unique big data requirements.

Article 4 of 5

Next Steps

Data analytics drive the next wave of IT ops tools

Dig Deeper on Application Maintenance on Production Systems

Get More Modern Infrastructure

Access to all of our back issues View All