How do you protect the data in big data analytics projects?
As big data initiatives focus on volume, velocity or variety of data, often overlooked in the big data project is the security of the data. This is especially important for financial services, healthcare and government or anytime sensitive data is analyzed.
This webinar highlights:
*Hadoop security landscape
*Hadoop encryption, masking, and access control
*Customer examples of securing hadoop environments
Architectural issuesBig data environments do not typically offer no finer granularity of access than schema levelLack of secure inter-node communication (create separate layer so customers don’t have to worry about this)Hadoop security is developingImprovements to role-based HDFS security are in progressOpensource projects are just beginningVendors offer immature solutionsTrying to apply traditional methods to HadoopCreate a central chokepoint and are not operating at the node levelLack of solutions in production
The big data users I have spoken with about data security agreed that data masking at that scale is infeasible. Given the rate of data insertion (also called ‘velocity’), masking sensitive data before loading it into a cluster would require “an entire ETL cluster to front the Hadoop cluster”. But apparently it’s doable, and Netflix did just that – fronted its analytics cluster with a data transformation cluster, all within EC2. 500 nodes massaging data for another 500 nodes. While the ETL cluster is not used for masking, note that it is about the same size as the analysis cluster. It’s this one-to-one mapping that I often worry about with security. Ask yourself, “Do we need another whole cluster for masking?” No? Then what about NoSQL activity monitoring? What about IAM, application monitoring, and any other security tasks. Do you start to see the problem with bolting on security? Logging and auditing are embeddable – most everything else is not.
Kerberos to provide mutual authentication—both the user and the server verify each other’s identityGazzang – block level for big data
A proper infrastructure PKI inside an organizationCert The warning screen – users are used to certificate warnings CDH4.1 – Kerberos SSLDisable Hadoop we access
Object typesUnix based permResults sharingEasy to understand and audit
Granular rolesPer type of an object not just per an objectExample:Hadoop admin – role – can access Hadoop settings and create import jobs – do not have access to data
Different group in an organization – more security – Hadoop admins do not have rights to change add/remove p. from groups
Join to a company AD infrastructure.Adopted by Hadoop as an authentication mechanismIntegration with other services across platforms – zookeeper, For example MSSQL services
Delegation – Datameer can run jobs as a owner of the jobWith imp only owner can access his own file.When user is deleted from system ……Jobs are run as an owner of the job and stored
- Show access rights, role screen, LDAP screen, Kerberos setup
Intel – Implemented in Hadoop APIYoung project – others – future will shows if others participate – Cloudera ….Others: Volateg. Preterit – not open source and not wildly used
Detailed information about user access Detailed information job runs – dependent on Hadoop logs
Datameer for big data.Use Datameer to analyze Datameer access logs.Abnormality detectionSecurity breach detection.Behavior analysis.* HDFS - Hadoop