Querying the Database

This chapter discusses how the database contents generated by Joern can be queried to locate interesting code. We begin by reviewing the basics of the graph traversal language Gremlin and proceed to discuss how to select start nodes from the node index. The remainder of this chapter deals with code retrieval based on syntaxtaint-style queries in and finally, traversals in the function symbol graph.

The user-defined traversals presented throughout this chapter are all located in the directory joernsteps of python-joern. It may be worth studying their implementation to understand how to design your own custom steps for Gremlin.

Gremlin Basics

In this section, we will give a brief overview of the most basic functionality offered by the graph traversal language Gremlin developed by Marko A. Rodriguez. For detailed documentation of language features, please refer to http://gremlindocs.com and the Gremlin website.

Gremlin is a language designed to describe walks in property graphs. A property graph is simply a graph where key-value pairs can be attached to nodes and edges. (From a programmatic point of view, you can simply think of it as a graph that has hash tables attached to nodes and edges.) In addition, each edge has a type, and that’s all you need to know about property graphs for now.

Graph traversals proceed in two steps to uncover to search a database for sub-graphs of interest:

1. Start node selection. All traversals begin by selecting a set of nodes from the database that serve as starting points for walks in the graph.

2. Walking the graph. Starting at the nodes selected in the previous step, the traversal walks along the graph edges to reach adjacent nodes according to properties and types of nodes and edges. The final goal of the traversal is to determine all nodes that can be reached by the traversal. You can think of a graph traversal as a sub-graph description that must be fulfilled in order for a node to be returned.

The simplest way to select start nodes is to perform a lookup based on the unique node id as in the following query:

// Lookup node with given nodeId
g.v(nodeId)

Walking the graph can now be achieved by attaching so called emph{Gremlin steps} to the start node. Each of these steps processes all nodes returned by the previous step, similar to the way Unix pipelines connect shell programs. While learning Gremlin, it can thus be a good idea to think of the dot-operator as an alias for the unix pipe operator |. The following is a list of examples.

// Traverse to nodes connected to start node by outgoing edges
g.v(nodeId).out()

// Traverse to nodes two hops away.
g.v(nodeId).out().out()

// Traverse to nodes connected to start node by incoming edges
g.v(nodeId).in()

// All nodes connected by outgoing AST edges (filtering based
// on edge types)
g.v(nodeId).out(AST_EDGE)

// Filtering based on properties:
g.v(nodeId).out().filter{ it.type == typeOfInterest}

// Filtering based on edge properties
g.v.outE(AST_EDGE).filter{ it.propKey == propValue }.inV()

The last two examples deserve some explanation as they both employ the elementary step filter defined by the language Gremlin. filter expects a so called closure, an anonymous function wrapped in curly brackets (see Groovy - Closures). This function is applied to each incoming node and is expected to return true or false to indicate whether the node should pass the filter operation or not. The first parameter given to a closure is named it by convention and thus it.type denotes the property type of the incoming node in the first of the two examples. In the second example, edges are handed to the filter routine and thus it.propKey refers to the property propKey of the incoming edge.

Start Node Selection

In practice, the ids of interesting start nodes are rarely known. Instead, start nodes are selected based on node properties, for example, one may want to select all calls to function memcpy as start nodes. To allow efficient start node selection based on node properties, we keep an (Apache Lucene) node index (see Neo4J legacy indices). This index can be queried using Apache Lucene queries (see Lucene Query Language for details on the syntax). For example, to retrieve all AST nodes representing callees with a name containing the substring code{cpy}, one may issue the following query:

queryNodeIndex("type:Callee AND name:*cpy*") \end{verbatim}

The Gremlin step queryNodeIndex is defined in joernsteps/lookup.groovy of python-joern. In addition to queryNodeIndex, lookup.groovy defines various functions for common lookups. The example just given could have also been formulated as:

getCallsTo("*cpy*")

Please do not hesitate to contribute short-hands for common lookup operations to include in joernsteps/lookup.groovy.

Traversing Syntax Trees

In the previous section, we outlined how nodes can be selected based on their properties. As outline in Section Gremlin Basics, these selected nodes can now be used as starting points for walks in the property graph.

As an example, consider the task of finding all multiplications in first arguments of calls to the function malloc. To solve this problem, we can first determine all call expressions to malloc and then traverse from the call to its first argument in the syntax tree. We then determine all multiplicative expressions that are child nodes of the first argument.

In principle, all of these tasks could be solved using the elementary Gremlin traversals presented in Section Gremlin Basics. However, traversals can be greatly simplified by introducing the following user-defined gremlin-steps (see joernsteps/ast.py).

// Traverse to parent nodes in the AST
parents()

// Traverse to child nodes in the AST
children()

// Traverse to i'th children in the AST
ithChildren()

// Traverse to enclosing statement node
statements()

// Traverse to all nodes of the AST
// rooted at the input node
astNodes()

Additionally, joernsteps/calls.groovy introduces user-defined steps for traversing calls, and in particular the step ithArguments that traverses to i’th arguments of a given a call node. Using these steps, the exemplary traversal for multiplicative expressions inside first arguments to malloc simply becomes:

getCallsTo('malloc').ithArguments('0')
.astNodes().filter{ it.type == 'MultiplicativeExpression'}

Syntax-Only Descriptions

The file joernsteps/composition.groovy offers a number of elementary functions to combine other traversals and lookup functions. These composition functions allow arbitrary syntax-only descriptions to be constructed (see Modeling and Discovering Vulnerabilities with Code Property Graphs ). For example, to select all functions that contain a call to foo AND a call to bar, lookup functions can simply be chained, e.g.,

getCallsTo('foo').getCallsTo('bar')

returns functions calling both foo and bar. Similarly, functions calling code{foo} OR code{bar} can be selected as follows:

OR( getCallsTo('foo'), getCallsTo('bar') )

Finally, the not-traversal allows all nodes to be selected that do NOT match a traversal. For example, to select all functions calling foo but not bar, use the following traversal:

getCallsTo('foo').not{ getCallsTo('bar') }

Traversing the Symbol Graph

As outlined in Section Database Overview, the symbols used and defined by statements are made explicit in the graph database by adding symbol nodes to functions (see Appendix D of Modeling and Discovering Vulnerabilities with Code Property Graphs). We provide utility traversals to make use of this in order to determine symbols defining variables, and thus simple access to types used by statements and expressions. In particular, the file joernsteps/symbolGraph.groovy contains the following steps:

// traverse from statement to the symbols it uses
uses()

// traverse from statement to the symbols it defines
defines()

// traverse from statement to the definitions
// that it is affected by (assignments and
// declarations)
definitions()

As an example, consider the task of finding all third arguments to memcpy that are defined as parameters of a function. This can be achieved using the traversal

getArguments('memcpy', '2').definitions()
.filter{it.type == TYPE_PARAMETER}

where getArguments is a lookup-function defined in joernsteps/lookup.py.

As a second example, we can traverse to all functions that use a symbol named len in a third argument to memcpy that is not used by any condition in the function, and hence, may not be checked.

getArguments('memcpy', '2').uses()
.filter{it.code == 'len'}
.filter{
        it.in('USES')
        .filter{it.type == 'Condition'}.toList() == []
}

This example also shows that traversals can be performed inside filter-expressions and that at any point, a list of nodes that the traversal reaches can be obtained using the function toList defined on all Gremlin steps.

Taint-Style Descriptions

The last example already gave a taste of the power you get when you can actually track where identifiers are used and defined. However, using only the augmented function symbol graph, you cannot be sure the definitions made by one statement actually reach another statement. To ensure this, the classical reaching definitions problem needs to be solved. In addition, you cannot track whether variables are sanitized on the way from a definition to a statement.

Fortunately, joern allows you to solve both problems using the traversal unsanitized. As an example, consider the case where you want to find all functions where a third argument to memcpy is named len and is passed as a parameter to the function and a control flow path exists satisfying the following two conditions:

begin{itemize} item The variable code{len} is not re-defined on the way. item The variable is not used inside a relational or equality expression on the way, i.e., its numerical value is not ``checked’’ against some other variable. end{itemize}

You can use the following traversal to achieve this:

getArguments('memcpy', '2')
.sideEffect{ paramName = '.*len.*' }
.filter{ it.code.matches(paramName) }
.unsanitized{ it.isCheck( paramName ) }
.params( paramName )

where isCheck is a traversal defined in joerntools/misc.groovy to check if a symbol occurs inside an equality or relational expression and code{params} traverses to parameters matching its first parameter.

Note, that in the above example, we are using a regular expression to determine arguments containing the sub-string len and that one may want to be a little more exact here. Also, we use the Gremlin step sideEffect to save the regular expression in a variable, simply so that we do not have to re-type the regular expression over and over.