Querying the Database¶
This chapter discusses how the database contents generated by Joern can be queried to locate interesting code. We begin by reviewing the basics of the graph traversal language Gremlin and proceed to discuss how to select start nodes from the node index. The remainder of this chapter deals with code retrieval based on syntaxtaint-style queries in and finally, traversals in the function symbol graph.
The user-defined traversals presented throughout this chapter are all
located in the directory joernsteps
of python-joern
. It may be
worth studying their implementation to understand how to design your
own custom steps for Gremlin.
Gremlin Basics¶
In this section, we will give a brief overview of the most basic functionality offered by the graph traversal language Gremlin developed by Marko A. Rodriguez. For detailed documentation of language features, please refer to http://gremlindocs.com and the Gremlin website.
Gremlin is a language designed to describe walks in property graphs. A property graph is simply a graph where key-value pairs can be attached to nodes and edges. (From a programmatic point of view, you can simply think of it as a graph that has hash tables attached to nodes and edges.) In addition, each edge has a type, and that’s all you need to know about property graphs for now.
Graph traversals proceed in two steps to uncover to search a database for sub-graphs of interest:
1. Start node selection. All traversals begin by selecting a set of nodes from the database that serve as starting points for walks in the graph.
2. Walking the graph. Starting at the nodes selected in the previous step, the traversal walks along the graph edges to reach adjacent nodes according to properties and types of nodes and edges. The final goal of the traversal is to determine all nodes that can be reached by the traversal. You can think of a graph traversal as a sub-graph description that must be fulfilled in order for a node to be returned.
The simplest way to select start nodes is to perform a lookup based on the unique node id as in the following query:
// Lookup node with given nodeId
g.v(nodeId)
Walking the graph can now be achieved by attaching so called
emph{Gremlin steps} to the start node. Each of these steps processes
all nodes returned by the previous step, similar to the way Unix
pipelines connect shell programs. While learning Gremlin, it can thus
be a good idea to think of the dot-operator as an alias for the unix
pipe operator |
. The following is a list of examples.
// Traverse to nodes connected to start node by outgoing edges
g.v(nodeId).out()
// Traverse to nodes two hops away.
g.v(nodeId).out().out()
// Traverse to nodes connected to start node by incoming edges
g.v(nodeId).in()
// All nodes connected by outgoing AST edges (filtering based
// on edge types)
g.v(nodeId).out(AST_EDGE)
// Filtering based on properties:
g.v(nodeId).out().filter{ it.type == typeOfInterest}
// Filtering based on edge properties
g.v.outE(AST_EDGE).filter{ it.propKey == propValue }.inV()
The last two examples deserve some explanation as they both employ the
elementary step filter
defined by the language Gremlin. filter
expects a so called closure, an anonymous function wrapped in curly
brackets (see Groovy - Closures). This function is applied to
each incoming node and is expected to return true
or false
to
indicate whether the node should pass the filter operation or not. The
first parameter given to a closure is named it
by convention and
thus it.type
denotes the property type
of the incoming node in
the first of the two examples. In the second example, edges are handed
to the filter routine and thus it.propKey
refers to the property
propKey
of the incoming edge.
Start Node Selection¶
In practice, the ids of interesting start nodes are rarely
known. Instead, start nodes are selected based on node properties, for
example, one may want to select all calls to function memcpy
as
start nodes. To allow efficient start node selection based on node
properties, we keep an (Apache Lucene) node index (see Neo4J legacy
indices). This
index can be queried using Apache Lucene queries (see Lucene Query
Language for
details on the syntax). For example, to retrieve all AST nodes
representing callees with a name containing the substring code{cpy},
one may issue the following query:
queryNodeIndex("type:Callee AND name:*cpy*") \end{verbatim}
The Gremlin step queryNodeIndex
is defined in
joernsteps/lookup.groovy
of python-joern
. In addition to
queryNodeIndex
, lookup.groovy
defines various functions
for common lookups. The example just given could have also been
formulated as:
getCallsTo("*cpy*")
Please do not hesitate to contribute short-hands for common lookup
operations to include in joernsteps/lookup.groovy
.
Traversing Syntax Trees¶
In the previous section, we outlined how nodes can be selected based on their properties. As outline in Section Gremlin Basics, these selected nodes can now be used as starting points for walks in the property graph.
As an example, consider the task of finding all multiplications in
first arguments of calls to the function malloc
. To solve this
problem, we can first determine all call expressions to malloc
and then traverse from the call to its first argument in the syntax
tree. We then determine all multiplicative expressions that are child
nodes of the first argument.
In principle, all of these tasks could be solved using the elementary
Gremlin traversals presented in Section Gremlin Basics. However,
traversals can be greatly simplified by introducing the following
user-defined gremlin-steps (see joernsteps/ast.py
).
// Traverse to parent nodes in the AST
parents()
// Traverse to child nodes in the AST
children()
// Traverse to i'th children in the AST
ithChildren()
// Traverse to enclosing statement node
statements()
// Traverse to all nodes of the AST
// rooted at the input node
astNodes()
Additionally, joernsteps/calls.groovy
introduces user-defined
steps for traversing calls, and in particular the step
ithArguments
that traverses to i’th arguments of a given a call
node. Using these steps, the exemplary traversal for multiplicative
expressions inside first arguments to malloc
simply becomes:
getCallsTo('malloc').ithArguments('0')
.astNodes().filter{ it.type == 'MultiplicativeExpression'}
Syntax-Only Descriptions¶
The file joernsteps/composition.groovy
offers a number of
elementary functions to combine other traversals and lookup
functions. These composition functions allow arbitrary syntax-only
descriptions to be constructed (see Modeling and Discovering
Vulnerabilities with Code Property Graphs
). For example, to select all functions that contain a call to foo
AND a call to bar
, lookup functions can simply be chained, e.g.,
getCallsTo('foo').getCallsTo('bar')
returns functions calling both foo
and bar
. Similarly,
functions calling code{foo} OR code{bar} can be selected as follows:
OR( getCallsTo('foo'), getCallsTo('bar') )
Finally, the not
-traversal allows all nodes to be selected
that do NOT match a traversal. For example, to select all functions
calling foo but not bar, use the following traversal:
getCallsTo('foo').not{ getCallsTo('bar') }
Traversing the Symbol Graph¶
As outlined in Section Database Overview, the symbols used and
defined by statements are made explicit in the graph database by
adding symbol nodes to functions (see Appendix D of Modeling and Discovering
Vulnerabilities with Code Property Graphs). We
provide utility traversals to make use of this in order to determine
symbols defining variables, and thus simple access to types used by
statements and expressions. In particular, the file
joernsteps/symbolGraph.groovy
contains the following steps:
// traverse from statement to the symbols it uses
uses()
// traverse from statement to the symbols it defines
defines()
// traverse from statement to the definitions
// that it is affected by (assignments and
// declarations)
definitions()
As an example, consider the task of finding all third arguments to
memcpy
that are defined as parameters of a function. This can be achieved using the traversal
getArguments('memcpy', '2').definitions()
.filter{it.type == TYPE_PARAMETER}
where getArguments
is a lookup-function defined in
joernsteps/lookup.py
.
As a second example, we can traverse to all functions that use a
symbol named len
in a third argument to memcpy
that is not
used by any condition in the function, and hence, may not be checked.
getArguments('memcpy', '2').uses()
.filter{it.code == 'len'}
.filter{
it.in('USES')
.filter{it.type == 'Condition'}.toList() == []
}
This example also shows that traversals can be performed inside
filter-expressions and that at any point, a list of nodes that the
traversal reaches can be obtained using the function toList
defined on all Gremlin steps.
Taint-Style Descriptions¶
The last example already gave a taste of the power you get when you can actually track where identifiers are used and defined. However, using only the augmented function symbol graph, you cannot be sure the definitions made by one statement actually reach another statement. To ensure this, the classical reaching definitions problem needs to be solved. In addition, you cannot track whether variables are sanitized on the way from a definition to a statement.
Fortunately, joern allows you to solve both problems using the
traversal unsanitized
. As an example, consider the case where
you want to find all functions where a third argument to memcpy
is named len
and is passed as a parameter to the function and a
control flow path exists satisfying the following two conditions:
begin{itemize} item The variable code{len} is not re-defined on the way. item The variable is not used inside a relational or equality expression on the way, i.e., its numerical value is not ``checked’’ against some other variable. end{itemize}
You can use the following traversal to achieve this:
getArguments('memcpy', '2')
.sideEffect{ paramName = '.*len.*' }
.filter{ it.code.matches(paramName) }
.unsanitized{ it.isCheck( paramName ) }
.params( paramName )
where isCheck
is a traversal defined in joerntools/misc.groovy
to check if a symbol occurs inside an equality or relational
expression and code{params} traverses to parameters matching its
first parameter.
Note, that in the above example, we are using a regular expression to
determine arguments containing the sub-string len
and that one may
want to be a little more exact here. Also, we use the Gremlin step
sideEffect
to save the regular expression in a variable, simply so
that we do not have to re-type the regular expression over and over.