How to Choose a NoSQL Database

The world of NoSQL databases is a very noisy (and confusing) space.  Matt Aslett at the 451 Group has done an amazing job of cataloging various databases (including NoSQL) in his Database Landscape Map.

image

To simplify the NoSQL world, lets take a look at the top 3 databases in terms of current popularity and how they compare to Apache Accumulo, which is at the core of our product, Sqrrl Enterprise.

MongoDB:  It is a wonderfully easy-to-use document store that many select as a flexible replacement for a SQL database, as it (like all NoSQL databases) does not require pre-defined schemas.   However, MongoDB has difficulty scaling to very large datasets (e.g., 100+ TB) and does not natively work with your Hadoop cluster.  It also does not possess fine-grained security controls.

Cassandra:  This is an excellent choice if your data is too big for MongoDB and you require multi-datacenter replication.  Although Cassandra was not originally designed to run natively on your Hadoop cluster, it now has integrations with MapReduce, Pig, and Hive.  It does not possess fine-grained security controls.

HBase:  HBase natively integrates with Hadoop, and it can handle very large datasets.  However, it does not have fine-grained security controls. 

Accumulo:  Accumulo has an architecture most similar to HBase, which allows it also to natively plug into your Hadoop cluster.  It is far more scalable than MongoDB, and with reported cluster sizes in the multiple thousands within the Intelligence Community it is also significantly more scalable than HBase and Cassandra.  Accumulo is the only NoSQL database with cell-level security capabilities.  Accumulo also has other features that could lead one to choose it over HBase or Cassandra for reasons other than security or scalability.  For example, Accumulo has a powerful server-side programming mechanism called Iterators, which provide it with the capability to do a variety of real-time aggregations and analytics.

These high level differences between MongoDB, Cassandra, HBase, and Accumulo are summarized in the decision tree diagram below.  Of course, there are a wide variety of more detailed technical differences that will be explored in greater detail in a later post.  This decision tree can be summarized with a few simple statements:

  • If you need a quick, simple solution and have “small” Big Data (e.g., a few dozen terabytes), MongoDB may be the answer.
  • If you need cell-level security or multi-petabyte scalability, Accumulo is the right answer.
  • If you have data that is too big for MongoDB and don’t need cell-level security or massive scalability, we would recommend testing HBase, Cassandra, and Accumulo for your specific workloads.  Each has their own nuanced advantages and disadvantages.
  • If you don’t need real-time analytics, you are probably on the wrong decision tree and can stick with the Hadoop Distributed File System and batch analytics.

 image

It is worth noting that the NoSQL databases above are all open source databases.  Sqrrl Enterprise builds upon Accumulo and adds a number of additional features to Accumulo including streaming ingest, JSON, encryption, identity management integrations, full-text search, SQL queries, graph search, and statistics.  We believe that these features set Sqrrl Enterprise apart from other Big Data platforms.

The History of Sqrrl

Interested in the history of Sqrrl?  Check out this podcast with Ely Kahn from Sqrrl, Luke Fretwell from FedScoop, and Gunnar Hellekson (Red Hat’s Public Sector Chief Technology Strategist).

http://fedscoop.com/fedoss-sqrrl-brings-open-source-big-data/

CSO Article on Securing Big Data Infrastructure

CSO discusses here the difficulty in bolting on security to Big Data infrastructure.  Sqrrl Enterprise is the only Big Data platform with security baked in from the start. 

Read more here:

http://www.csoonline.com/article/732342/big-data-can-be-a-big-headache-for-data-defenders

Sqrrl April Newsletter

New CEO, New Website, New Jobs.  Check out our April newsletter here.

Sqrrl’s Take on the Big Data Ecosystem

We are often asked where Sqrrl resides in the Big Data ecosystem.  This is a great question since there is some much buzz (and confusion) about Big Data, and the larger picture often gets lost.  

We view the ecosystem in 11 large buckets (which have many similarities to the buckets in Dave Feinlab’s ecosystem map in Forbes):

  1. Hardware providers:  Big Data software runs on both commodity disks and flash/SSD.
  2. Services providers:  These folks help with both strategy and implementation of Big Data solutions
  3. Cloud providers:  Many organizations run their Big Data solutions in public, private, or hybrid clouds
  4. Enterprise Data Warehouse (EDW) vendors:  These are traditional EDW vendors and the relational databases that typically sit on top of them.
  5. Data Integration vendors:  These companies sell the tools that assist in getting data into Hadoop or Scale-Out databases.
  6. Hadoop vendors:  These folks license commercial distributions of the Hadoop Distributed File System and related Apache projects (or in some cases, just sell support services around them).
  7. Security vendors:  They sell security tools, such as encryption and key management, specifically designed for Big Data.
  8. Scale-Out Database vendors:  Includes both NoSQL (unstructured and semi-structured data) and NewSQL (structured data) databases.
  9. Horizontal Big Data Platforms:  These are application development platforms often built on top of Hadoop and/or scale-out platforms and provide additional analytical capabilities beyond what the underlying database can natively provide.
  10. Vertical Big Data Platforms:  Similar to Horizontal Big Data Platforms, but these are specialized applications for a specific industry vertical.
  11. Business Intelligence and Analytical Tools:  Focused on static reporting, analytics, and dashboards for data held in Hadoop.

As depicted in the diagram above, today Sqrrl falls in the intersection of four of these boxes:  Hadoop, Security, Scale-Out Databases, and Horizontal Platforms.  This is because our solution, Sqrrl Enterprise, consists of the following:

  • Hadoop:  Although we prefer to partner with Hadoop vendors, we can also ship our solution with open source HDFS.
  • Security:  We are the only Big Data solution with cell-level security, including fine-grained access controls and encryption.
  • Scale-Out Database:  At our core is Apache Accumulo, which is a NoSQL database with scalability to the tens of petabytes.
  • Horizontal Big Data Platform:  Sqrrl Enterprise powers real-time Big Data applications (aka “Big Apps”), and we do this by layering a number of real-time analytic APIs on top of Accumulo, including full-text search, statistics, and graph analysis.

Hope this helps folks trying to make sense of the Big Data landscape.

Big Apps > Big Data

The team here at Sqrrl has coined a new term:  Big Apps™.  We are seeing an important trend in the marketplace in that many organizations want move beyond storing and querying Big Data.  More and more, organizations now want to build real-time applications on top of Big Data.  We refer to these applications as Big Apps.

Big Apps could be used for a variety of different use cases ranging from clinical analysis, stock trade analysis, energy trading, immigration analysis, and cybersecurity.  In all of these cases, organizations need to bridge the gap between traditional OLAP and OLTP capabilities and build applications that can process and analyze petabytes of data in real-time.

To read more how Sqrrl can help organizations create Big Apps using Apache Accumulo, click here.

New sqrrl CEO

image

sqrrl is excited to announce that we have a new CEO, Mark Terenzoni from F5 Networks.  Mark brings a wealth of knowledge around growing technology startups into large successful companies.  Check out some of the media coverage here:

http://siliconangle.com/blog/2013/03/19/sqrrl-appoints-new-ceo-to-spearhead-big-data-security-push/

http://www.bizjournals.com/boston/blog/startups/2013/03/f5-networks-database-startup-sqrrl.html

Full press release here:

http://www.prweb.com/releases/2013/3/prweb10559578.htm

sqrrl and Technica Team Up

Today sqrrl and Technica Corporation announced a reseller partnership that makes it easier for government agencies to license sqrrl’s software product, sqrrl enterprise.  

sqrrl is now on Technica’s NASA SEWP IV contract vehicle.  This means that both DoD and civilian agencies can utilize the SEWP IV procurement process to quickly and easily procure sqrrl products.  

You can read more about it here.

Another Step Forward… Accumulo on Amazon Elastic MapReduce

Today, Amazon has released some important new work related to Apache Accumulo.  Now Accumulo users can easily spin up Accumulo clusters utilizing Amazon’s Elastic MapReduce (EMR) Framework. 

sqrrl is excited about this for a few reasons.  First, we strongly support any effort to decrease the friction associated with installing and using Accumulo.  Our engineering team consists of many of the original developers of Accumulo, and we are eager to further increase the Accumulo user base.  Secondly, our software product, sqrrl enterprise, runs on top of Apache Accumulo and Apache Hadoop, and Amazon’s efforts with ERM provide our customers with another use pattern for our product.

If you are interested in taking sqrrl enterprise for a spin on AWS, send us a note at info@sqrrl.com.

sqrrl Software and Services Now Available Via GSA

Today we officially announced a partnership with Triad Technology partners to offer sqrrl software and services to government customers via’s Triad’s GSA schedule.  

The GSA schedule provides government clients with a simple and fast way of purchasing sqrrl products.  You can read more about the announcement here:

http://www.prnewswire.com/news-releases/triad-technology-partners-expands-gsa-schedule-with-sqrrl-189641841.html