2Plan for Information Security From the Start
3Get In Early on Projects, Ask Questions About the Data
Apache Hadoop projects are probably already popping up in your organization; don’t wait until after the fact to ask questions about the data. As a leader charged with protecting your organization’s sensitive data, you need to know where the sensitive data is, who will have access to it, what the access rules in the source system are and if they carry into Hadoop. You will also need to know if any of the data is subject to HIPAA (Health Insurance Portability and Accountability Act), PCI DSS (Payment Card Industry Data Security Standard), SOX (Sarbanes-Oxley Act) or any other regulatory requirements.
4Tie Into Your Corporate Email and Identity System
Chances are that you already have a corporate identity system, LDAP, Active Directory or a simple Gmail.com log-in in place; tie your Apache Hadoop users and groups to this. Establishing centralized user access control and management early on will help you in many administrative tasks as well as security audits down the line.
5Encrypt Your Data
The argument that encryption could slow down systems is no longer valid. Apache Hadoop distributions support over-the-wire encryption and are now starting to enable data-at-rest encryption that has little to no impact on speeds. With faster hardware and built-in cryptographic acceleration available, there is never any reason to skip this critical step.
6Log Everything and Keep Backups
IT and/or security managers need to enable all the logging and monitoring capabilities of the platform and maintain a centralized way of viewing, auditing and archiving this data. They need to continually monitor logs and transactions proactively for any suspicious activity and reactively for forensics, root cause analysis and sometimes evidence retention.
7Set Up a Security Steering Committee
Security has many layers, including everything from physical security and risk mitigation when using your laptop, mobile phone or public WiFi to having security steps during the HR on-boarding and termination processes. Set up a security steering committee comprising members from IT, HR and even line-of-business employees (marketing, sales, etc.). If you don’t already have an information security officer, at a minimum assign this role to someone in IT and send him or her to a security class to learn where to start.
8Identify and Tag Your Sensitive Data
Data access should never be open by default; it should always be set on a “need-to-know” basis. Make sure you have processes in place that allow you to identify and tag sensitive data and request access to that data. Data security tagging capabilities are in very early stages within Apache Hadoop, but you can start now by segregating data in directories using naming conventions or separate metadata to tag and identify your sensitive data.
9Voice Your Security Requirements
Apache Hadoop distributions, developers, users and the security community are all looking for real customer use cases to voice their security requirements. Reach out or, even better, contribute code back into Hadoop under the Apache license, even if it is only opening a ticket and writing a requirement. There are many security features in the open-source Apache Hadoop roadmap, and the ones that garner more interest will go to the top of the list.
10Expect More From Your Commercial Hadoop Distribution
Add security to the list of things you should expect from your Hadoop support subscription. Setting up a secure Hadoop cluster is not trivial and touches many areas, including Kerberos and keytab configuration, SSH (Secure Shell cryptographics), SSL (Secure Sockets Layer) certificates, RSA Key management, SSO (single sign-on) integration, secure logging, cryptographic ciphers, role-based access control and secure cluster provisioning—just to name a few.
11Empower and Layer Security, One Coat at a Time
Be a friend to business and productivity by empowering and enabling your business to securely tap into data sets in Hadoop in order to extract knowledge in ways that were not possible before. Add security in layers that reduce risk without completely blocking business; if you put up complete barriers, users will go around security all together with skunkworks projects, which is a more dangerous proposition.
12Understand Data’s Lineage
Hadoop provides many abilities to ingest data from various sources. It is a good security practice to keep track of the data lineage (from where it came). It is important to understand the sources for all data sets, including derived data sets to support compliance and audit requirements. Hadoop provides tools that will automatically track upstream sources of new data sets and provide full lineage and auditing-enable them.
13Protect All the Data
Not all of the important and/or interesting data is stored directly in the Hadoop Distributed File System (HDFS). Many important data repositories exist outside HDFS in the form of metadata stores and files; the protection of all sensitive data inside and out of HDFS requires careful consideration.