NEW YORK—Organizations planning to use
Hadoop to aggregate and analyze data from multiple sources need to consider
potential security issues beforehand, according to IT professionals at the
Hadoop World conference here.
Hadoop makes it easier for organizations
to get a handle on the large volumes of data being generated each day, but can
also create problems related to security, data access, monitoring, high
availability and business continuity, Larry Feinsmith, managing director of IT
operations at banking giant JPMorgan Chase, said in a keynote speech at Hadoop
World on Nov. 8.
Data is growing faster than ever,
thanks to blogs, social media networks, machine sensors and location-based data
from mobile devices. Companies can analyze the data to gain insights into
customers and industry trends they weren't able to have in the past. However,
organizations are faced with the prospect of somehow managing and securing
petabytes and petabytes of data, Richard Clayton, a software engineer with
Berico Technologies, said in a security panel at the conference.
The data is not monolithic, as there
may be mixed classifications and varying levels of security sensitivity,
Clayton said. As an IT services contractor for federal agencies, Berico
Technologies had to consider varying encryption technologies, retention
policies and access requirements for individual pieces of data.
Most organizations don't have the
visibility they need to understand what they have and to properly secure it,
Ken Cheney, vice president of business development and marketing at storage
management software vendor Likewise, told eWEEK
before the conference. The visibility is essential to "know who owns the
data, and who has access to it," Cheney said.
Enterprises need to implement
appropriate security controls for enforcing role-based access to the data,
according to Clayton. However, he felt that built-in Hadoop Distributed File
System (HDFS) security features, such as Access Control Lists and Kerberos, are
not adequate to meet enterprise needs.
Many organizations tie the data being
stored to identity management systems, such as Active Directory or LDAP, as the
"source of truth," according to Cheney. By linking the data with an
actual identity, IT departments can track what is being done with the data and
by whom, he said.
Another big concern for organizations
using Hadoop is the fact that analyzing the data within the environment creates
new datasets that also need to be protected, Clayton said. The data being
aggregated in one place also increases the risk of data theft or accidental
disclosures, he said. An effective data security approach in many Hadoop
environments would be to encrypt the data at the individual record level, while
it is in transit or being stored, according to Clayton.
Many government agencies are putting
Hadoop-stored data into separate "enclaves," or network segments, to
ensure that only people with the proper level of security clearance can view
the information, he said. Others are building firewalls that protect Hadoop
environments and restrict access, Clayton said.
Some agencies have opted out of using
Hadoop databases altogether because of these data access concerns, according to
Clayton.
Large companies such as IBM, Yahoo and
Google have been using Hadoop for years, but it's only recently that large
enterprises have started looking at Hadoop to rein in their out-of-control
data.
JPMorgan Chase has been using the open-source
storage and data analysis framework for almost three years in various
applications, such as fraud detection, IT risk management and self-service,
Feinsmith said. Chase relies on Hadoop to collect and store Weblogs,
transaction data and social media information on a common platform and runs
data mining and analytics applications to gather intelligence, according to
Feinsmith.