NEWS ORLEANS—It was clear when I stood up to speak at the session on big data at the Society for Professional Journalism’s Excellence in Journalism 2016 conference here that I wasn’t addressing your average trade show audience.
The hundred or so people in front of me were all professional journalists, which meant that they were expecting a no-nonsense, practical look at how they could use vast data archives to find the truth on a wide variety of topics, ranging from political corruption to the spread of the Zika virus.
With me in the front of the room was Pam Baker, the highly respected author of Data Divination: Big Data Strategies and Louis Lyons, chief operating officer of ICG Solutions, the company that created the LUX2016 data analysis engine and which helped with our examination of viewer reaction to last year’s Democratic and Republican primary debates.
Our discussion started out with my description of how we used data analysis to figure out who won last year’s debates well before the major news organizations had polling data to release. But as valuable as our first effort to use big data to support a feature article was, the fact is that data analysis goes far beyond what I was able to do in eWEEK’s first attempt.
This is no surprise because the analysis of large data sets is still in its infancy, and while data analysis can be used by most new organizations to gain important insights, finding the way is still hard.
Fortunately, I had Pam Baker at my side in this effort, and she was able to part the seas of confusion and teach what big data can do and what it can’t do. The first lesson was that big data analysis isn’t magic, and just because you have a lot of data doesn’t mean it’s useful data.
What really matters, Baker said, is that your big data archive contains accurate data that is most relevant to the information you hope to discover.
In addition, Baker pointed out that it’s critical that you know the origin of the data you’re planning to analyze and that you’re comfortable with the way that the data was collected so that you can be more confident that you are working with valid information. One example that Baker cited was the difference in the influenza infection rates reported by the Centers for Disease Control and by Google Flu Trends.
Google Flu Trends was an effort to determine infection rates using only information obtained on social media. The CDC, on the other hand, compiled influenza infection rates using a variety of sources, including social media, but also including government sources, health care providers and more.
Using Big Data to Discover the Truth Isn’t as Easy as It Looks
After a few years of testing, Google Flu Trends was withdrawn because it wasn’t accurate. Social media alone, it seems, is not a good way to track disease.
However, Baker said that it’s also important that once you’ve performed an analysis you should question the output. She pointed out that many sources provide data that’s already been analyzed and that there can be mistakes in the analysis. She adds that you need to check your own analysis as well, including by doing the basics such as checking the math.
One way to help make sure your data analysis is more accurate is to diversify your sources, which is what the CDC did with its flu reporting. She added that you have to assume that your analysis will fail and you have to be prepared to figure out what to do next.
“Data is essential and analytics certainly speed results, but don’t assume results to be infallible,” Baker said. She recommends running any analysis at least three more times to make sure the analysis is correct.
Fortunately, much of the data you’re likely to need for analysis is readily available. The government has a wealth of information and provides a site, Data.gov, where vast amounts of government data can be found and much of it is very useful.
Many government agencies, ranging from the CDC to the Federal Communications Commission, have stores of data that are available for analysis, much of the time simply on request. But depending on the data, it’s important to try to confirm the data, just as you would from any other source. Just because it’s from the government doesn’t mean it’s accurate, current or relevant.
It’s also worth noting that much of the data you may need in business, or in my case in journalism, isn’t in a useful form. It may be in files that need to be converted to a format that can be readily analyzed with the available tools, or the data may be in printed reports where it must be entered manually or scanned to be useful.
If it looks like using big data may be a lot of trouble, you’re right. There’s nothing magical about big data, including the fact that it’s big. As Baker told me, it’s more important to have the right data than it is to have a lot of data.
The old adage of “garbage in, garbage out” holds true when it comes to data analysis. Accumulating lots of useless data is still useless, there’s just more of it. But once you do find the right data, and analyze it properly, it can show you things that you can’t find any other way, and that is what makes this technology so valuable to business managers and journalists.