Sqrrl Blog

Apr 28, 2014 2:12:00 PM

Big Data Security Roundup

by Joe Travaglini, Director of Product Marketing for Sqrrl

As Big Data products continue to gain traction and enter the mainstream, they must also provide the security and compliance capabilities that are expected by users of trusted, Enterprise software.

In order to break down data silos and analyze business events from a 360-degree vantage point, folks are consolidating disparate sources of data into a single location. This consolidation also brings with it a compounding of risk, as the blast radius of a security event now affects multiple assets at once. In effect, Big Data has amplified the stakes of security.

Upping the Ante

The good news is that we’ve acknowledged the dearth of security solutions and lack of an overarching security strategy, and are taking steps along the right path.

In the early days, functionality trumped security in terms of product development priority - the web companies innovating these Big Data products did not have the same regulatory, security, or compliance concerns as many other businesses.

But things have changed.  Companies both large and small are buying into the promise that Big Data analytic techniques can bring significant impact to their business.

Recognizing that the security gap was a major detractor for many end users, the community got to work – and they delivered. This past month has brought a flurry of product announcements and delivered solutions to address certain needs:

  • Hortonworks released HDP 2.1, which introduced advanced ACLs for HDFS and also introduced perimeter security via Apache Knox
  • MongoDB 2.6 was released, bringing with it a whole slew of security improvements, including a “field-level redaction” feature
  • Cloudera released Cloudera Search 1.2, which introduced “index-level security”

Each of these improvements is a step in the right direction, but there’s still a long way to go.  Let’s take a closer look at each of them.

Hadoop DFS Advanced ACLs

Hortonworks posted some nice detail on the updates to HDFS ACLs that are finding their way into the more recent Hadoop distributions.  In a nutshell, these improvements bring Hadoop from the basic owner/group/world style permissioning toward a POSIX-style ACL definition.

Why It Matters

POSIX-style HDFS ACLs enable more complex combinations of authorization rules and reduce overhead of group management within Hadoop.

What Can Still Be Done

Hadoop ACLS are set via a command-line interface on a folder-level.  There is no file-level or information-level access rule capability at this time.

Apache Knox

Hortonworks also announced the availability of Apache Knox within HDP 2.1.  Knox is a RESTful API gateway to Hadoop clusters.

Why It Matters

Knox allows Hadoop administrators an easy way to secure the perimeter of their clusters.  This enables exposing Hadoop services within an Enterprise environment and integrating into account provisioning and auditing tools, while protecting cluster configuration information.  It also gives clients a single interface point into the cluster, thereby reducing the complexity of tools required to get at the underyling data and applications.

What Can Still Be Done

Knox introduces a new critical piece of infrastructure into an already complex system.  High-availability and seamless failover must be configured in order to keep your cluster running, but these features require laborious setup and configuration.

As it is new, Knox does not yet protect all the subprojects of the Hadoop ecosystem. Knox also is subject to the underlying security capabilities of the components that it protects, in that it provides all-or-nothing access to those components.  It does not enhance security, per se, but rather provides a secure checkpoint for access.

MongoDB 2.6

MongoDB recently released MongoDB 2.6.  An important theme of this release was Security – MongoDB now has a “field-level redaction” feature.

Why It Matters

Field-level redaction restricts access to data based upon information contained within the data itself.  This is a great step forward, and is part of Sqrrl’s Data-Centric approach (more on this later).  Field-level redaction is partially analogous to Sqrrl’s Cell-Level Security Enforcement, a feature we shipped on day one.

What Can Still Be Done

Field-level redaction is enforced at the application layer, rather than at the data layer itself.   Mongo’s $redact syntax introduces additional necessary complexity to an otherwise elegant data access pattern.  It therefore places a greater burden on both the application developer and “trusted middleware” maintainer to provide assurance this security model isn’t circumvented.

Cloudera Search Index-Level Security

With Cloudera Search 1.2, administrators can now integrate with Apache Sentry to define access rules around their search indexes.

Why It Matters

This improvement introduces a net new feature to CDH.  It allows a Hadoop administrator to protect search indexes, granting access only to folks who ought to have it.

What Can Still Be Done

Index-level security is similar to Knox in that it is a secure perimeter, all-or-nothing approach.  Either you have access to the search index, or you don’t.

This is a great improvement, especially where the data contained within the index has a uniform security classification.  However, more and more we’re seeing this is not the case, as evidenced by the adoption of Apache Accumulo by Hadoop distribution providers, and the introduction of field-level security enforcement features in HBase and MongoDB.

Search indexes can be a source of data leakage.  If a user has full access to a search index but only granular access to the data contained within it, the index will give the user information about the data that they didn’t know existed.  Unfortunately, early adopters of HBase’s cell-level security features are not able to take advantage of this particular search capability, unless they are willing to bear the risk of leaking information.

Sqrrl’s Data-Centric Security

Sqrrl is the only Big Data solution where security was designed and delivered from the very beginning, not added on after the fact.  We believe that if you want to really make data the center of your datacenter, your approach to data security should mirror that effort.  Our unique Data-Centric Security solution provides the most comprehensive protection for your Big Data needs while still preserving the integrity of your analytic results.

Sqrrl Enterprise’s Data-Centric Security entails:

  • Fine-grained, cell-level security enforcement: each field of information written to the system contains information about who ought to be able to access it, enforced at the data level, not the application level
  • Encryption at rest: pluggable encryption via the Java Cryptography Extensions (JCE) and integration with Enterprise key management systems
  • Encryption in motion: protect client-server and inter-process communication with SSL or TLS
  • Role-Based and Attribute-Based Access Control (RBAC + ABAC): ABAC enables definition of rich user access rules at a dynamic, granular level
  • Secure Search: Sqrrl is the only product with term-level security on search indexes
  • Labeling and Policy Definition Engines: allow users to specify rules by which data is tagged with visibility labels, and by which users are assigned point-in-time authorizations to those labels
  • Audit: providing full visibility into system activity for compliance purposes


For more information about Sqrrl and our Data-Centric Security approach, definitely drop as a line at @SqrrlData or send as an email at info@sqrrl.com.

Topics: Big Data, Blog Post, Big Data Security, Cybersecurity