SAN JOSE, Calif.—Business, science and academic researchers have access to an unprecedented array of data to mine and discover significant trends, from social network chatter to consumer buying patterns, credit card transactions and even sports statistics.
But speakers at the Hadoop Summit here June 10 noted that many organizations aren’t aware of the novel techniques they can use to analyze mountains of data to gain meaningful insights.
One speaker used sports statistics to illustrate new approaches companies should consider in dealing with big data.
“In sports we’re drowning in data, but it’s largely ineffective because it needs to be married with small data,” said David Epstein, author of The Sports Gene: Inside the Science of Extraordinary Athletic Performance.
He used the example of sprinters, pointing out there is typically only a second or less difference between those who consistently finish first or second and those who finish farther back in the field. He said the emerging area of sports science is using “small data” to see how athletes can improve performance.
In one case, researchers analyzed three basic variables in how three top Olympic shot putters cast the shot. They discovered the gold medalist released at an angle one degree higher than his competitors.
Similarly, researchers took a new approach to a study of broad jumpers’ techniques. While past studies looked at things like speed and the force with which the jumpers took off from the board, a smaller set of data by a bio-mechanical jump specialist revealed the key difference for the winner was the angle of takeoff. Using that data, a broad jumper from Great Britain changed his training and won a gold medal even though he wasn’t favored.
What is the lesson for business in these examples? As in sports, often the difference between good and great is less than one percent. A company might find, for example, that some small glitch in customer service or response is keeping it from being tops in its market.
TrueCar Finds Hadoop Drives Value
One company that has moved aggressively to get more from big data is car buying service TrueCar, which maintains a massive up-to-the-minute database of selling prices. Russ Foltz-Smith, head of the company’s data platform said the biggest challenge it faced when it ramped up efforts to use a Hadoop-powered system to manage its “couple of petabytes” of data was finding qualified developers.
Finding few qualified applicants, it decided to hire and train a developer in the use of Hadoop and went from there. “It was a hard decision, but now we have over 25 Hadoop experts and we’re extremely effective at hiring more.”
TrueCar has 600 TB of data in active use at any one time and over 20 million buyer profiles.
“The idea is to be the brain of the industry,” said Foltz-Smith. “The important thing is you can’t be wrong in the automotive industry. If you’re wrong, you lose the transaction.”
Staying at the cutting edge, TrueCar recently developed what Foltz-Smith says is an advanced, multi-dimensional real-time search capability.
Hadoop Summit: Wrangling Big Data Requires Novel Tools, Techniques
“It’s very much like a Minority Report experience within TrueCar. It’s not science fiction,” he said.
The big advantage of working with Hadoop for TrueCar, which uses HortonWorks Data Platform implementation of Hadoop, is its ability to scale. Foltz-Smith says TrueCar’s data has grown 24-fold in the past year with the system processing 12,000 data feeds and 65 billion data points.
The company also managed some 700 million car images that it makes available to customers. “If there is no vehicle image, the car doesn’t exist (as far as the consumer is concerned),” said Foltz-Smith. “And there is a ton of intelligence embedded in those images.”
Is Your Data Lake Polluted?
Walter Maguire, chief field technologist at HP’s Big Data Business Unit, discussed one of the more controversial ways to manage big data, so-called data lakes. A data lake is a storage repository that holds large amounts of raw data in native format until it’s needed.
But Maguire said he’s heard IT disparage the concept with terms like “data dump” and “data swamp” because while data lakes can be a convenient way to store vast amounts of raw data, it’s not always easy to get at the data you need. “A CIO told me ‘there are three petabytes in my Hadoop data lake and I don’t know which 100 terabytes are really important.’ I’ve heard this again and again,” said Maguire.
After showing a picture of a murky, polluted lake, Maguire used an image of a clear lake to detail HP’s solution, Haven for Hadoop, which he says “makes the data lake business-ready. An analyst can sit at a console and get at the data no matter what format it’s in,” he said.
Quentin Clark, CTO of SAP, said data and digitization are at the heart of huge changes in society.
“Imagine we live in a world where Uber and Airbnb are the largest rental companies and they don’t own any assets. How is that possible? Data is at the heart of it. These companies deeply embrace data to understand what is going on with the user’s experience,” he said.
Clark said he expects big data systems like SAP’s own HANA in-memory database to help transform more industries.
“You can imagine any walk of life seeing transformation over next decade. In retail, the ability to understand where customers are in a retail shop and using big data to realize what products you need and see in real time, the effectiveness of sales associates and be able to change how the store operates on an hour-to-hour basis.” He expects big data systems to help oil and gas companies proactively identify when systems or machinery needs downtime for maintenance, saving millions of dollars.
In health care, he expects wearables and other advances to yield vast new sources of information. “We should be striving to make every doctor smarter in real-time so their knowledge can be augmented in real-time rather than having to chase down medical journals,” he said.