NSA just released some fascinating new data on Apache Accumulo performance as a massively scalable graph store. The data was presented last week at Carnegie Mellon University, and the abstract of the report reads:
“Developing scalable graph algorithms in the Big Data era is an imposing challenge. Graph algorithms can be especially difficult to scale because of highly skewed data distribution and irregular data access patterns. Graphs at Big Data scales can exceed the physical memory of even the largest supercomputers which inhibit conventional algorithm implementations. Cloud technologies like Accumulo and MapReduce are promising for Big Graphs but require careful algorithm design to be effective. We present an experiment using Accumulo and MapReduce for Breadth-First Search (BFS) on the largest problem sizes in the Graph500.org industry benchmark.”
The slides go into greater detail. I think these are the most important highlights:
- Accumulo was tested on a 1200 node cluster and over a petabyte of data with 70 trillion edges. Massive scalability!
- Linear performance from 1 trillion to 70 trillion edges. What this means is that as an organization wants to store more data, it simply just needs to add more commodity hardware to maintain performance. Easy horizontal scaling for graphs!
There are also some interesting use cases referenced here where Accumulo could be used as a scalable graph store and then pieces of graphs are exported to either Hadoop or other graph analytics platforms for additional analysis.