.shock - Fotolia

Big data testing to production: All Vertica systems not a go

IT pros at TurboTax spotted trouble with big data search times on Vertica after standing up 40 new servers and the tax deadline just weeks away.

BOSTON -- It was a few weeks before the U.S. tax deadline of April 15 when the data scientists at TurboTax started to notice that Vertica was "taxing" the company's brand new servers.

Just a few months earlier, Intuit Inc., the parent company of TurboTax, had undertaken a cluster upgrade. The organization went from 16 Dell servers to 40 as part of a plan to put the big data analytics platform HP Vertica into production.

All of the company's business happens during the three-month window of tax season, and about half of it occurs over just 10 days, when customers file their annual tax returns.

"We can lose or win in the matter of a few days -- that is our biggest challenge," said Massimo Mascaro, chief data scientist for the consumer tax group TurboTax at Intuit, based in Mountain View, Calif.

The company has 190 active users on Vertica and runs up to 65,000 queries per day. Inuit uses the data to understand what questions to ask users as they file their taxes. For example, it helps decide that a retiree won't be asked about school loans and 20-somethings won't be asked about their retirement income.

"There is a lot of statistical inference that goes on while asking the questions," Mascaro said.

TurboTax can also predict whether or not a taxpayer should itemize their taxes. The company estimates that the feature saves taxpayers a cumulative 2 million hours of preparation time each year.

Last year, TurboTax used a 16 server cluster to run the Vertica prototype as a backup system, with a plan to go to 40 servers before moving into production for the 2015 tax-filing season. The company conducted an in-season test with 16 nodes in April 2014. Just a few production queries ran on it before it went live with 40 nodes in December 2014.

Each node consisted of a Dell PowerEdge R620 with one rack unit, as well as two rack units of Dell PowerVault MD1220 direct-attached storage.

The move was part of a mantra at Intuit that "nothing hits production until it is fire-tested and fireproof."

To make the transition, HP's support team advised TurboTax to run Vertica on the identical hardware with the identical configuration to production. The new machines were supposed to be configured the same way as the pre-existing machines, Mascaro said.

We can lose or win in the matter of a few days -- that is our biggest challenge.
Massimo Mascarochief data scientist for the consumer tax group (TurboTax) at Intuit

But in March, TurboTax staff started to notice a big spike in the duration of queries on Vertica.

It was getting close to what the company calls the "second peak" -- the time of the year when tax filings peak for the second and final time just before the April 15 deadline. TurboTax got worried and reached out to HP's support.

HP did a full analysis of the servers and found that a BIOS flag was set up differently on some of the machines. After 48 hours, the problem was resolved and query time dropped by 80%.

"The hardware was identical, but there was a configuration at the BIOS level that was wrong by mistake," Mascaro said. "It was messing up Vertica when we had very high throughputs."

Intuit staffers had no idea it was a hardware issue, and later found out that they had gone through most of the season with a "performance hit" they didn't realize, Mascaro said.

"It looked normal to us," Mascaro said. "It probably would have taken us a few weeks to figure it out without their help and we would have missed our second peak."

Big data testing to production issues

The scenario TurboTax encountered was one of "a million things that can go wrong between test and production," specifically for big data projects and other applications that require real-time interactive performance, according to Mike Matchett, an analyst and consultant for the Taneja Group Inc. in Hopkinton, Mass.

Moving into production in a large server environment creates the opportunity for many server-specific configuration issues. For example, the application might not automatically scale properly by using all available cores and sockets, due to threading constraints. A move into production may also yield some long-running processes that should be pinned to a given socket or core to avoid thrashing, a concept where there is an inefficient sharing of CPU memory and cache alignment.

The move from test to production may also show that the memory available to the application in production isn't what IT thinks it should be, or it may not be configured or allocated as optimally as it was in test.

Some other hiccups IT pros should watch for in making the transition include changes in access, such as user IDs and permissions, Matchett said. Additionally, test environments often don't test with system or other batch workloads, so server management could impact the application in production.

Matchett, who worked to solve those issues during his time as the principal UNIX performance consultant for BMC Software Inc., said there are too many misconfigurations to list.

"That would take a book or three," Matchett said. 

Robert Gates covers data centers, data center strategies, server technologies, converged and hyperconverged infrastructure and open source operating systems for SearchDataCenter. Follow him on Twitter @RBGatesTT or email him at [email protected].

Next Steps

NYSE outage sheds light on configuration problems

Better change management needed in data centers

How to become a data scientist

Dig Deeper on Real-Time Performance Monitoring and Management