Sqrrl Blog

Dec 12, 2013 11:41:00 PM

What are Iterators in Accumulo?

Accumulo iterators are a real time processing framework. Iterators take in a bunch of data and then spit out modified versions of that data. If familiar with MapReduce, they provide Reduce like functionality, and then some. But unlike MapReduce (which utilizes batch processing), iterators can do this with very low latency (aka operate at interactive speeds).

Iterators can also be stacked one on top of the other or chained together, which allows you to do some cool operations. You can define a table to always have a set of iterators (different and distinct from other tables), in which case they act kind of like a filter on the table for real-time aggregates (kind of like a materialized view in a SQL database, except that eventually the unviewed data will disappear from the datastore). You can also define iterators for a given scan of the database (“scan” == “SELECT” in Accumulo vernacular).

Here are some illustrations of iterators.

 

Iterator Iterator Iterator Iterator

 

Iterators operate at compaction time. Compaction processes govern the process of ingesting and handling data within Accumulo. Key-value pairs first land in memory in a map, and they are eventually flushed to disk in a so-called minor compaction. Duplicate rows are then removed or merged depending on the iterators that you've set up for your table. Eventually (based on some tunable settings), the files created in a minor compaction are themselves re-compacted; this process is the "major" compaction.

The pictures below illustrate the compaction process and how iterators are positioned with regards to that process.

Compaction Compaction

 

The Accumulo manual has more information on compaction: http://accumulo.apache.org/1.5/accumulo_user_manual.html#_compaction

Topics: Accumulo, Big Data, Blog Post