Machine learning is the force behind many big data initiatives. But things can go wrong when implementing it, with significant effects on IT operations.
Unfortunately, predictive modeling can be fraught with peril if you don't have a firm grasp of the quality and veracity of the input data, the actual business goal and the real world limits of prediction (e.g., you can't avoid black swans).
It's also easy for machine learning and big data beginners to either make ineffectively complex models or "overtrain" on the given data (learning too many details of the specific training data that don't apply generally). In fact, it's quite hard to really know when you have achieved the smartest yet still "generalized" model to take into production.
Another challenge is that the metrics of success vary widely depending on the use case. There are dozens of metrics used to describe the quality and accuracy of the model output on test data. Even as an IT generalist, it pays to at least get comfortable with the matrix of machine learning outcomes, expressed with quadrants for the counts of true positives, true negatives, false positives (items falsely identified as positive) and false negatives (positives that were missed).
A lot of key metrics derive from these four basic measurements when using machine learning and big data. For example, overall accuracy is usually defined as the number of instances that were truly labeled (true positives plus true negatives) divided by the total instances. If you want to know how many of the actual positive instances you are identifying, sensitivity (or recall) is the number of true positives found divided by the total number of actual positives (true positives plus false negatives).
And often precision is important too, which is the number of true positives divided by all items labeled positive (true positives plus false positives). A simplistic model that labels everything positive would have 100% recall, but terrible accuracy and precision -- it finds everything, but you can't tell the wheat from the chaff. Usually some tradeoff is made between these metrics to find an optimal balance.
In some uses for big data based on machine learning, such as targeted marketing, a 20% advantage over randomly flipping a coin might be great (in Las Vegas, the house really needs only a 1% advantage to prosper over time). In other situations, such as when screening a million people for cancer, even a 99% accuracy rate can lead to bad consequences: assuming a low incidence of actual cancer, most of the 1% inaccuracy would be false positives, and that might translate to 10,000 unnecessary treatments.
This brings us to machine learning's impact on IT. First, the host storage and the processing platform should match the kind of learning you're attempting. Sometimes learning is done offline, and the resulting model is applied as a simple processing step in production. Other times the learning is continuous or recurring (e.g. reinforcement learning) and needs to be closer to the current data stream.
Some machine learning algorithms scale better than others with partitionable libraries suitable for big data scale-out clusters (e.g., Apache Mahout, MLlib, Madlib), while others might even require high-speed high performance computing-style interconnect and read-write transaction storage architectures to calculate efficiently.
In-memory tools can be the way to go for heavy-duty interactive data mining or predictions that require low latency. And there are cloud-hosted machine learning services that charge by the API call in production, which may be cost-effective assuming cloud-hosted data.
If you have programing chops and want to play around with or start developing machine learning, there are free packages for Python and other languages. You can even sign up for a free-to-develop, cloud-hosted machine learning studio on Microsoft Azure. Many of these products can run on small data sets locally on your laptop and scale to large data sets for production. This is a hot area, and every day we hear of vendor-specific offerings that promise to make machine learning simple enough for the average business analyst.
All of this predictive modeling isn't artificial intelligence. Yes, it can provide a real and distinct business advantage by looking for and exploiting deeper patterns in the data, but all you've established is correlation. And as they always told us in school, correlation isn't causation.
Still, given how easy it's becoming to apply machine learning methods to just about any interesting data set, it is valuable for all IT organizations to start developing in-house expertise -- gathering and cleaning data, hosting development, assisting modeling efforts and applying them in production. Expertise in data science is valuable, but given the democratization going on in this field, you shouldn't wait until you can justify a team of full-blown data scientists to get started.
Mike Matchett is a senior analyst and consultant at Taneja Group. Contact him via email at firstname.lastname@example.org.
Programming contest uses analytics to identify endangered whales
Learn about machine learning methods that have been applied to big data
Artificial intelligence apps don't always live up to their name