Sqrrl Blog

Mar 31, 2015 8:30:00 AM

Linked Data > Log Data: The Power of Context

By George Aquila

Many enterprise security tools, including SIEMs, Incident Response, and Network Analysis tools are log-based. However, making sense of log files can be tricky, since logs typically exist without context (i.e., it is hard to understand how they relate to the larger cybersecurity environment around them). Luckily, there is a more effective way of organizing your data: using a Linked Data approach.Loggedvs.Linked

Linked data enables more intuitive exploration, ad hoc interrogation, and more powerful analytics

Linked data describes a format for data representation that highlights the different types of relationships, or links, between entities. In this case, an entity is a logical item of interest, such as a ‘user’, a ‘website’, an ‘HTTP transaction’, and the like. These entities are then linked via different types of relationships – for example, a user can ‘know’ another user, an employee can ‘work for’ a manager, etc.

Surfacing the Context

As opposed to other, traditional data formats like flat files or database tables, Linked Data is especially helpful in surfacing the meaning and context of information. In other words, Linked Data is defined by its connections.These connections make it much easier to analyze a given element in your system and employ the various techniques that can exploit the relevant information you find.



An example Linked Data model

Linked Data Analysis gives cyber “hunters” and incident responders a way to quickly identify the important assets, actors, and events relevant to their organization, accentuating the natural connections between them and providing contextual perspective.

The Linked Advantage

There are a number of advantages that the Linked Data approach can provide over more traditional log management and analysis solutions.

  1. Easier to Ask Questions of the Data

Threats may well be there; you need only to make sense of them. The Linked Data model works particularly well in tandem with threat hunting because it enables you to ask iterative questions more easily. From a security perspective, this is fundamental in running any kind of manual analysis or hunt, since deciding what to focus on and how to investigate are the cornerstones of the art of detection. With the Linked Data approach, the kinds of questions you ask can be much more dynamic, and the answers can be found significantly faster than if one is dealing only with traditional collected logs.

For example, say you are starting with a ‘user’ and want to ask the question, “Show me all the websites this user has visited in the past day.” You can then dynamically expand out relationships from this data, asking questions like “Show me how all the users that have also visited these websites within the same time window” using a simple point-and-click operation. Then, you can further expand and ask to "show me how these users are connected to each other."

In this way, linking data can easily facilitate iterative question chaining, which streamlines the process of response and investigation. The effectiveness of Linked Data on asking questions is further elaborated on in our whitepaper.

  1. More Intuitive and Richer Visualization

Histograms, bar graphs, and pie charts can only get you so far. Linked Data visualization consists of weighted, directional nodes and edges that can provide compact representations of complex, dense datasets. As opposed to representing just simple trends and comparisons, Linked Data visualization enables users to easily refer relationships and second and third-order connections in the data.  This translates to stronger pattern discovery and pattern matching.  With a quick glance, analysts can unravel how disparate pieces of data relate and visually “connect-the-dots.”


Here blue edges represent flow relationships while red edges are logins. The emboldened blue arrow represents larger file transfers between entities.

Moreover, Linked Data visualization naturally aligns to the nature of cyber security data.  Network diagrams are typically utilized to outline the structure of an organization’s IT systems. Linked Data visualization takes the basic concept of network diagrams and implements it at massive scale and in extreme detail. It also lets an analyst quickly zoom in and out to study both micro- and macro- trends in the data.

  1. More Powerful Analytics

Linked data lends itself to far more effective automated graph analytics. Machine learning techniques can more easily traverse, find, and analyze specific entities when data points are already connected. These techniques include pattern matching, pattern discovery, and anomaly detection. A Linked Data implementation removes the need for expensive join operations present in relational databases, since data points are pre-joined in the model. This results in much faster cross-graph queries with operations moving through different tables.

Linked data analysis also includes the use of specific graph algorithms that are not available in traditional log analysis tools. Examples of graph algorthms include Pagerank and Random Walk with Restart (RWR). In Big Data, pattern matching becomes more difficult on account of the sheer amount and variety of available information. Sqrrl’s automated knowledge extraction assists in assuring data quality and builds enriched contextual information to uncover non-obvious links between different points.

  1. Schema Flexibility

Sqrrl’s Linked Data approach capitalizes on schema flexibility and the ability to store varied data types. As a comparison, many traditional tools are built on older non-adaptive architectures that make it difficult to update and query data points that don't fit into a tightly pre-defined schema. 

Flexible and dynamic schemas facilitate the ingesting of new information on the fly and allows for the storing of dissimilar data points into the same set. The work of establishing a fixed schema is cut out by the Linked Data model’s iterative development, which means you can figure it out as you go along, and don’t need to rely on any normalization.

  1. Massive Scalability

The concept of Linked Data is not new. However, similar to most log management and analysis solutions, Linked Data solutions traditionally have been limited by the underlying scalability of the databases that powered them. With the advent of massively scalable non-relational databases, Linked Data capabilities have taken a leap forward.

Sqrrl’s Linked Data models are deployed on the Apache Accumulo database, which can scale horizontally to 1000s of servers and 10s of petabytes, while maintaining linear performance. These performance figures enable Sqrrl to provide its customers with interactive search speeds across huge amounts of Linked Data. Since Accumulo is deployed on low-cost Hadoop hardware, the scaling can be done cost effectively, while not sacrificing durability and resilience.

For more detail on how Linked Data works and how it can be applied to cyber defense, check out our whitepaper: Linked Data for Cyber Defense.

Linked Data White Paper


Topics: Accumulo, NoSQL, Big Data, Data Analysis, Linked Data