Sqrrl Blog

Jul 31, 2013 8:02:43 AM

Sqrrl Whiteboard Video Transcript: NSA Lessons Learned

Full video here:

http://www.youtube.com/watch?v=1QrZmGDEGZ8

Transcript below:

Dave:

Hi, everybody. This is Dave Vellante of wikibon.org, and we’re here at Wikibon headquarters. This is the The Cube. Adam Fuchs is here. Adam spent the better part of a decade at the NSA. If you’re an application designer, and a big data practitioner, he’s going to share with you some of the lessons that he learned there. Adam, welcome.

Adam:

Thanks, Dave. Good to be here. I’ve broken out some of my best handwriting to share with you some of these lessons learned. I’ve got four of them today that we’re going to talk about. Starting small, quick designing for scalability. We’re going to talk about schemas and ontology development. We’re going to talk about application building blocks and discovery analytics. We’re going to talk a little bit about data-centric security.

I’ll just start out with a little discussion of adoption curves from an application developer’s perspective. I’ve worked on, at the National Security Agency, dozens of infrastructure components, applications, and I’ve seen a variety of approaches with them. Some of those have to do with, we pick a huge set of requirements, try to throw those into one application. That results in a huge amount of time spent building an application. Maybe you’re designing for scalability from the start but you’re also trying to bite off way too much. The adoption curve that you might see from that type of approach would be a huge amount of time spent. You get to market at the end of that time, and then your growth follows from that point.

A lot of people have shifted more towards a prototyping type effort, where they might do operational prototypes. One of the concepts there is, try to get the application to market very quickly. Here we’ve drawn a second curve where the application, we’ll design it for small scalability. Pick up a very small set of application requirements, build the app for that. It’ll reach a certain point, we’ll redesign for scalability. Grow from there. Redesign for scalability. Grow from there.

There are these levels. Across those tiers there’s a large amount of remodeling effort. We did get to market quickly and we got some early insight on how that application might be useful. A third curve which is one of the things that we’re trying at Sqrrl is to start small, but design from the beginning for scalability. A lot of the Hadoop ecosystem and the components in that are really designed to give a prototyping type capability, where you can bring that to market very quickly, but also scale up. All of the elements that you throw into that original application design are designed for scalability from the ground up. That’s a much nicer curve. You don’t have to take your system offline for a while to scale up to the next tier. Instead, you can just keep it running, add more boxes, scale horizontally, bring in a lot of elasticity. That’s a very nice lesson learned. Where a lot of the National Security Agency’s applications are shifting more towards that same paradigm.

Another concept I want to explore today is the concept of data modelling. There are two extremes in that space that you see a lot of application developers tending towards. One of the extremes is to use a very flat schema. This might be exemplified by people throwing data into HDFS and running map reliefs on top of it in its raw form. For some applications, that’s okay. If the application uses data that really doesn’t have a very complex schema associated with it, that’ll get you a long way. There’s the other extreme, though, which is for applications that deal with data that has a lot of join points, a lot of complexity in the data, it’s nice to have a much more highly modeled form of the data.

If we consider these two extremes from an application development perspective, there’s a lot of complexity that shows up in the application when we’re dealing with unstructured data. A lot less complexity shows up when we’re dealing with a more structured approached. We actually have a complexity curve that looks something like that.

There’s a flip side of this as well, though. One of the things that is nice about flat schemas is you can bring data in very quickly. You can have it available for application development before you go through a large modeling process to bring that data in. We have a second complexity curve, which is essentially the ETL, or the data modelling curve. You might think of this as the amount of time that it takes to model the data.

Certainly, if we’re throwing it into flat files, we can just throw it in. It’s very quick. There’s some small structure that we can bring with other techniques, but if we go to this complete ontology approach, it could be months of even years before we get a model that really handles all of the data we have. Even then, as we bring in new data sets, we may have to totally reorganize that ontology to handle the new complexities that new data brings.

What we’re trying to support with Sqrrl is a middle of the road approach, where it’s not totally flat. It’s supporting some higher level application concepts, but we can also bring in data very quickly. This is sort of more an ELT approach rather than ETL, where we bring in the data quickly, understand it using the tools that we have for doing big data analysis. Then transform it later on when we understand it more. That schema refinement cycle is an iterative schema refinement cycle.

Inside of Sqrrl and inside of Acculumo, we’re building a lot of tools to support that, whether it’s through flexible schemas to bring it in or through schema statistics to learn about the schema that actually exists in there. Or through fault transformation tools that give us high throughput for transforming data, and support that type of activity. That iterative refinement cycle is very key for bringing applications to market very quickly.

A third concept that I want to explore is this concept of discovery analytics as application building blocks. The space where I see a lot of innovation in government spaces as well as in commercial spaces is in that application development. You might think of an overall application for risk analysis, or fraud detection, or cyber security, whether it’s intrusion detection or forensic analysis. Those are all a suite of applications. There’s no single application for any of those use cases that covers everything. The more applications that we can develop, the faster we can innovate, and the faster we can evolve to support a broader set of scenarios, a broader case of applications in these cases.

In order to do that with good scalability and with good security, it’s necessary to have a set of building blocks on top of which to build those applications. Nobody goes out and builds an application end to end from the bytes that they put on discs all the way up through the visualization layer that’s human digestible. There’s always the layers in-between. At Sqrrl, what we think is the right layer to build applications on top of is something we all discovery analytics. This came out of years of efforts in building lots of different applications at NSA, but we think that some of the things that show up in that space are things like universal search, where it’s structured and unstructured data using languages that people are familiar with.

We see also in that space basic statistics, aggregations that are parallelized across the cluster. Document structures, building models over time online using higher ethical document structures and using graph structures. Those things all fit into this discovery analytics layer. If we can build those in a generic way such that they’re reusable at these higher level applications, then what we can do is figure out and solve the scalability and security problems inside of that discovery analytics layer, so that the cost of building applications on top of it is very small, or is significantly smaller at least.

This layer down here, this is really where Sqrrl fits. Our product, Sqrrl Enterprise, encapsulates that whole layer up through the database, up through the indexing, the organizations of data, up through that discovery analytics layer exposing things in the right API that’s useful for application development. Along all of these concerns, coming from NSA, security is always a big concern. At NSA, all of the applications that we developed had a multilevel security concern of some level.

That generally comes out of the concept of privacy policies, or legal restrictions, on how data can be used. As you get into the big data, big application space, you really run across more and more of those types of restrictions on how data can be used. Some of those things come from Sarbanes-Oxley or HIPAA restrictions on data usage. Some of them come from internal privacy policies, or information sharing between different organizations. We’re seeing, not only in the government space, but also in commercial spaces, more and more of those data restrictions coming into place.

The traditional approach to dealing with complex data restrictions is to put some business logic into the application space. That’s bad for a couple of reasons. One is that it complicates the application. Bringing an application to market with all of the security concerns built into it is a longer process. It might take you months instead of hours to build that application with all those security concerns. Not only that, but this is a point of vulnerability. The more times you implement complex security requirements, the more times you’re going to get it wrong. If we can push that down into the infrastructure layer, and get it right in that one place, we’ve increased our security, and we’ve increased innovation through cheaper application development.

The way we do that inside of Accumulo and inside of Sqrrl is through a concept called datacentric security. For that, data carries around with it some aspect of provenance, which really defines how it can be used when coupled with a set of security policies. Sqrrl and Accumulo implement that through a concept called cell level security, and inside of Sqrrl, and inside of Accumulo, every key value pair that represents those higher level concepts like structured documents, graphs, indexes, each of those things boils down to key value pairs which are tagged with visibility labels. That datacentric security concept allows us to separate the modeling of security from the modeling of the application.

Exposed at that discovery analytics layer are a series of methods that are used to model the applications. Each of them has that security built into it from the core. That drastically decreased that cost of building security and drastically increases the efficacy of that security. Those are the four basic lessons learned over my decade of experience at NSA. We follow through here. We start small, but design for scale. Iterative schema refinement. Discovery analytics as big app building blocks, and datacentric security throughout.

Dave:

Excellent, Adam. Thanks very much. We really appreciate you sharing your deep knowledge with Wikibon and Silicon Angle communities. If you want more information on this and other advice, go to sqrrl.com. It’s S-Q-R-R-L.com. Also check out YouTube.com/siliconangle. For other videos like this, go to siliconangle.com for all the news. and check out wikibon.org for all the research. Thanks for watching everybody. We’ll see you next time.

Topics: Blog Post