Full video here:
Hi, everybody. This is Dave Vellante. We’re here at Wikibon headquarters. Adam Fuchs is back. He’s the CTO and Founder of Sqrrl. We’ve been doing a number of whiteboard sessions around Accumulo, some of the innovations that Accumulo brings, some of the things that Sqrrl has added to Accumulo. You hear a lot about schema-less databases in the NoSQL database world. Well, how do you build structure and add schema to a schema-light environment? That’s what Adam is going to talk about today. So, Adam, take it away.
All right. Thanks, Dave. I get a lot of folks asking me, “Once I’ve gotten my database up and running I can store key value pairs in it. I can do searches. I can do range queries for those. But how do I organize my data?” There are basic ways of organizing their data and there are somewhat complex ones, and we’re going to touch on a couple of more complex concepts there. One of them is de-normalizing data, bringing it together to answer a particular set of queries, and the other is secondary indexing. I’m going to show you a basic diagramming technique that we use to diagram data models inside of Accumulo. Then I’m going to show you how that applies to indexing techniques.
To start off here we’ll look at a simple organization of data which we have diagrammed here in what I like to call a hierarchical data decomposition diagram. Obviously, I need to find a better name for that. But what we have is a data set centered around people and people have friends. They’re friends with other people. There’s some history of those relationships. They have things that they own, and maybe there’s a count of those things. Inside of Accumulo, we have a couple of features for storing and retrieving data. We can insert keys in random order. We can transform this data into a set of key value pairs. Insert them, update them in random order.
But then when we query them, we’re limited to range queries inside of that key space. So, a range in a sorted key space actually turns into a hierarchy in the row, column family, column qualifier value format. Every key has elements in it. It’s got a row, a column family, a column qualifier and these are associated with a value or potentially a set of values. If I have a range, I can select a particular row. I can select a row in a particular column family. I can select column qualifiers under that and in fact, I can select prefixes of those elements as well. If I want a query for a set of people and all of the things about them, if I use this hierarchical data decomposition, then I can query for prefixes in this tree structure. A range would translate into a single person or a person and their friends, or a person and their particular other friend. Right? Any of those hierarchies inside of this hierarchical organization really translates into a very simple query for Accumulo.
If we take this abstract view and we instantiate it, here we have Alice and Bob, which are two people in our person table. Alice has friends. Alice is friends with Bob and Charlie and they have some history. Alice also owns a couple of Oldsmobile’s, and Bob is over here. He has a couple of friends and owns five houses. Why not? All right, so what we’ve done here is essentially we’ve grouped all of the information associated with those people together so that we can query them all at the same time. If you think about it, a document store in general where you might have a hierarchical document, you have a number of features that are all grouped together underneath the title of that document. We’ve done a similar thing here. In fact, in this instantiated view, I can take a traversal of the tree from root down to leaf. Right? This actually forms a key value pair. So, Alice being perhaps the row portion of that key, the fact that Alice owns something being the column family, and the thing that she owns being the column qualifier, and then the count being the value in that case.
So we have a number of concepts that tag along with this mapping of data concept into key portion. In the row portion here, what we’re doing there is really controlling how the data gets partitioned throughout the cluster. Inside of the column family, we’re doing column-based partitioning. These are locality groups, or vertical orientation of the database. Inside of the column qualifier, that’s where we put anything else that has to do with uniqueness. So row, column family, column qualifier determine the uniqueness of the thing that we’re trying to store and then the value is any extra information that we would want to tag onto that. We can also take this basic diagramming technique and we can make a couple of other abstract versions of it. Some basic models that we use or some basic table designs that we use for indexing in particular and one of them is a document table with an inverted index table.
These are two tables that were paired together. The document table is organized by having UUIDs or just IDs of the document. Within that we have fields of the documents and within that we have values associated with the documents. Given that I can query ranges on this and ranges turn into prefixes, that gives me an ability to very quickly retrieve all of the fields in a document or a particular field in a particular document, and retrieve those. But maybe I don’t know my document ID. Maybe I don’t know exactly which document I’m trying to query, but I know characteristics of that and I want to search on it based off of those characteristics. So that’s when I start querying on the value itself, and this is our basic secondary indexing. That’s the basic concept of secondary indexing that we want to use here. So in order to support that, we’ll take parts of the value, we’ll create those as inverted index entries. From that value I generate a set of terms. Each of those terms maps to a set of UUIDs. Those UUIDs are references to this other table, to our document table.
Then maybe I keep which field was seen in there and perhaps some other information. What that looks like is when I’m ingesting data I have a record or a document. That record goes into a single place, everything grouped under the same UUID in my document table, and then I generate a set of index entries. So one place in the document table becomes multiple places in the index table.
On the flip side, during query if I have a particular term, I’m going to find that term at one spot in the index table. But it’s going to map back to several UUIDs inside the document table. So that’s basic, what we would call term distributed information retrieval. It’s a great technique. We use it all over the place. We extend it to do a whole bunch of things like geographical indexing. But it has flaws. It’s not perfect, and in fact in the field of information retrieval there are sort of two dominant spaces. One is term-distributed information retrieval. The other is document-distributed information retrieval. For document-distributed information retrieval, we tend to group things together into partitions or into shards.
This is the diagram associated with that. We take a set of documents, group those together in a partition and generate index entries, and put those index entries for that set of documents into the exact same partition. In this type of hierarchy I’ve grouped my index entries and my document entries together. That gives us a different view here. As we’re ingesting data, we’ll take a record, put that into one partition in our table and inside of that it has document portions and index portions just like our simple or indexing model. On the query side, if we have a single term, we can map that term into each of the partitions and look for the index entries associated with that particular term in each of those partitions. We parallelize the query across all of our partitions. These are a couple of the simple, generic table designs that we’re using in particular inside of Sqrrl, inside of our product. We’ve extended these. We’ve added a whole bunch of others to do graph organization, to do some other specialized types of indexing and some modeling associated with schema. But there you have it. That’s really how we’re adding a little bit of organization to a schema-less world.
Thank you, Adam. So you’re seeing the infrastructure for big data becoming hardened, and so-called enterprise-ready. Everybody talks about that. Accumulo is playing a key part of that, not only in terms of its ability to provide fine-grain levels of security but also high levels of scalability and performance. So check out sqrrl - S-Q-R-R-L - .com for more information. Some of Adam’s work will also be on there. Also, check out youtube.com/siliconangle for these and other videos associated with this topic. Go to wikibon.org for all the research and check out siliconangle.com for all the blogs. Thanks for watching, everybody. This is Dave Vellante of Wikibon and this is The Cube. We’ll see you next time.