Hadoop Drives Down Costs, Drives Up Usability With SQL Convergence

SPECIAL FEATURE: As more enterprises begin to adopt the Hadoop big data wrangling technology, there is a growing need for SQL convergence.

In 2011, Charles Boicey looked at Twitter, Facebook, Yahoo and other major Web entities and said to himself, "Why do those guys get to have all the fun?"

Boicey, an informatics solutions architect at the UC Irvine Medical Center, said he could very much see that the underlying big data technologies driving the big Web companies could help in the IT environment at the medical center.

Boicey told eWEEK, "I was intrigued by the volume of data and the speed with which they could access it, and I said, 'Why can't we do that' in healthcare?"

"We came to the conclusion that healthcare data although domain specific is structurally not much different than a tweet, Facebook posting or LinkedIn profile and that the environment powering these applications should be able to do the same with health care data," he wrote in a 2012 blog post.

Moreover, "A lab result is not that much different than a Twitter message," he told eWEEK. "Pathology and radiology reports share the same basic structure as a LinkedIn profile with headers, sections and subsections. A medical record shares characteristics of Facebook in that both represent events over time."

Indeed, in health care, data shares many of the same qualities as that found in the large Web properties. Both have a seemingly infinite volume of data to ingest, and it is all types and formats across structured, unstructured, video and audio, Boicey said. "We also noticed the near zero latency in which data was not only ingested but also rendered back to users was important. Intelligence was also apparent in that algorithms were employed to make suggestion such as people you may know."

That was the beginning of a strategy to employ Hadoop at the medical center. Apache Hadoop is an open-source software framework that supports data-intensive distributed applications. It supports the running of applications on large clusters of commodity hardware. Hadoop, derived from Google's MapReduce and Google File System (GFS) papers, became a target technology because of its attractive scale-to-cost ratio and because it is open source.

The UCI Medical Center's first big data project was to build an environment capable of receiving Continuity of Care Documents (CCDs) via a JSON pipeline, store them in MongoDB and then render them via a Web user interface that had search capabilities. From there, the new system, known as Saritor, went online.

Boicey said Saritor became a necessity because electronic medical records (EMR) cannot handle complex operations such as anomaly detection, machine learning, complex algorithms or pattern set recognition, and the Enterprise Data Warehouse (EDW) supports quality control, operations, clinicians and researchers.

"We, like many organizations with data warehouses, run ETL [extract, transform, load] processes at night to minimize the load on the production systems," Boicey said in his post. "We have some real-time interfaces with the data warehouse, but not all data is ingested in real time. In turn, our data suffers from a latency factor of up to 24 hours in many cases, making this environment suboptimal."