Sqrrl Blog

Aug 2, 2013 10:01:30 AM

Sqrrl Whiteboard Video Transcript: Secret's to Accumulo's Real-Time Performance

Full video here:

http://www.youtube.com/watch?v=5RPaDzOwqQ8

Transcript below:

Dave:

This is Dave Vellante of Wikibon.org and this is The Cube, where we extract the signal from the noise and bring you the data that helps you make better decisions. Adam Fuchs, here. This is the third series that we’ve done of Chalk Talks. The first one we looked at big data, lessons learned at the NSA. In the second, we took a look at the security ecosystem around Accumulo, the database that Adam helped build while he was at the NSA. Today, we’re going to look at how Accumulo is optimized to maximize performance. Adam, take it away.

Adam:

Thanks, Dave. Good to be here. We’re going to cover some of the secrets of how Accumulo really gets all of its performance in real-time. A lot of it’s derived from Google’s Big Table and things that we’ve extended, in my previous career at NSA as well as what I’m doing now with Sqrrl. To start out we’ll look at one of the key elements here, which is the log-structured merge tree, which really backs up a lot of the key value operations inside of Accumulo. Accumulo stores key value pairs and it stores them in sorted order, and we take tables which are really just collections of key value pairs. We break them up into partitions, which are known as tablets.

Focusing in on just one particular tablet, we could talk about a couple of operations specific to that data. In particular, if we’re just looking at the tablet data flow here, we have data coming in from the left and these are key value pairs coming in, in random order within the boundaries of that tablet. On the right we have key value pairs coming out, but they come out in sorted order. So in random order, out in sorted order, and in between there’s a mess of operations to perform that sort and we’re going to optimize two different elements as we perform that sort operation. The first is to minimize the latency.

As data streams in, we want that data to be able to contribute to a query with as little latency as possible there. At the same time, we want to minimize our impact on disk resources. Accumulo is a disk-based architecture. As we store data on spinning disks we don’t want to do a whole lot of random IO. We want to perform sequential IO whenever possible on disks. That’s performed through this mechanism and I’ll take you through that here. Key value pairs coming in, in random order, immediately go into an in-memory map and this is a balanced binary tree. It’s a random IO, but it’s inside of memory, so it’s efficient. As soon as it’s in that in-memory map, it can contribute to a query.

We’ve minimized latency there. However, in-memory map, that’s a volatile memory. In order to preserve durability, at the same time we write the data to a write-ahead log. This write-ahead log for Accumulo version 1.5 and beyond is stored in HDFS, so it’s actually replicated on multiple nodes and it’s synced to disk by the time that we acknowledge a write. Another issue with just storing things in memory is that memory is much smaller than disk. At some point, this memory will fill up and we’ll need to perform what’s known as a compaction operation or minor compaction, also known as a flush operation. We flush that in memory map to a file which is stored, again, in HDFS.

That write on disk, we want to make sequential. So we buffer a fair amount of data in memory and write that all as one sequential stream to an R file. Memory is much smaller than disk drives, so it’ll fill up again. We’ll do another compaction operation writing to an additional R file. Each of these R files then participate in queries, and as we stream data out of it we’ll merge data from each of these independent containers through a mechanism here to provide a single sorted stream of key value pairs at the query. The more R files we have, the more sources of data that we have, the more seeks that we have to do on disk. Query latency is impacted by the number of these files.

In order to minimize query latency we do an additional background operation here, which is known as a major compaction. Where we take multiple files, merge them together to create one single globally sorted file across all of those and that operation happens as a background processing operation. We have background threads doing the minor compaction, background thread doing the major compaction, and then threads attached to the query or the right performing those operations. All of this is great. This is basically a standard log-structured merge tree design. It goes back into the mid-‘90’s for technology.

One of the things that we’ve done with Accumulo is to put another mechanism in here known as the iterator tree. As we stream data from these multiple sources and merge them together, it actually goes through a series of operators. Those operators take place not only on the query path, but also on the compaction paths. There’s a series of operators that these sorted streams of key value pairs go through. If we focus in on what those operators are, there’s another view here of that iterator tree. You can think of a series of R files and the in-memory map contributing to some following process here. The first one that we’re going to do is really a merge. We’ll merge all of these individually sorted key value sources together to perform a single stream of sorted key value.

But we’re not done yet. Operations like the leading key value pairs or the cell level security that you’re familiar with inside of Accumulo, that all goes through iterators and it’s performed in iterators. That’s a set of system iterators. On top of that, we can extend our capabilities to application-specific processing. You might consider the case of inserting key value pairs. I’d insert the same key twice with two different values. What do I do with the values? Do I keep the most recent one? Do I keep all versions of it? One of the operations that we do inside of the application-specific iterators is versioning.

We can decide to keep the most recent value. We can keep the most ten recent values. We can keep the most recent 90 days of values. So those are basic, what we would call filtering operations. In addition to that, we can do aggregation operations. Instead of just filtering out some of the values, we can calculate a function of all of the values that we’ve seen. If that function is associative and commutative, then we can reason about it and we can say that any online statistic fits naturally inside of that iterator tree here. One example of that is word count. Anybody who’s ever done the map-reduced tutorial has gone through a series of documents, pulled out the words, mapped those words to a count of those words, and then inside of the reduce phase done a collapsing of all of those counts, adding them up to provide the final word count.

We can do the same thing, but we can do it in a streaming fashion using the iterators combining capability, or aggregation capability to perform a very efficient merge. For example, if I have these documents, I generate a set of terms. Each term I associate it with a number of times I see it at. I’m going to see the term A many times. Inside of the iterators, I may see two of those key value pairs. I can add those up. I get another version. Maybe that’s one of my compaction operations. I do another compaction operation on the next two. Maybe I do a major compaction merging those two together to get my final or at least most recent count. So the term A being seen four times would happen as an aggregation inside of the iterator tree.

When we look at the performance of that, essentially what we’re trying to do here is avoid read, modify, write. As I’m inserting randomly ordered key value pairs, if I need to look up the previous value I’m essentially reading that off of disk, causing a seek on disk. Then, writing a new value of it could also result in a write to disk. That type of random IO on disk is very inefficient so since we’re piggy-backing on top of this log-structured merge tree mechanism, we can get much more efficient writes of online statistical data, where we’re keeping that aggregated footprint instead.

Accumulo has this thing called a compaction ratio and that actually allows us to shift between extremes here. On one extreme, we might want to say that I only want to keep one file at any point to really minimize the read latency that I see. The cost of that is every time I write new data, every time it flushes to disk, I need to compact it with that existing file. As such, I’m going to be doing a large number of copies of each of the key value pairs that I write where those copies are essentially rewriting that key value pair on disk. The flip side is maybe I don’t want to do a lot of copies. I want to minimize my impact to disk so I can keep the copies down. But the cost of that is I get a lot of intermediate R files and that has an impact on read latency. What we’ve offered is a ratio that you can set in between there. We’ve tried to optimize it for the general use case where it’s a mix of reads and writes. But we can bring that number of copies down which makes our writes much more efficient and at the same time keep our read latency fairly low. So there you have it. That’s the secret to Accumulo’s online real-time performance for a number of applications.

Dave:

Thank you, Adam. So you’re seeing some of the innovations that Accumulo built on top of Big Table. Of course Sqrrl just announced Sqrrl Enterprise, which is an application development environment to allow programmers to take greater advantage of Accumulo and dramatically simplify that environment. You can go to sqrrl - S-Q-R-R-L - .com to see more information about that and some more of Adam’s work. Go to YouTube.com/siliconangle to see this and other videos that are relevant to Accumulo and the big data space. This is Dave Vellante of Wikibon.org. This is The Cube. Thanks for watching, everybody. We’ll see you next time.

Topics: Blog Post