Code Analysis with joern-tools (Work in progress)¶
This tutorial shows how the command line utilities joern-tools
can
be used for code analysis on the shell. These tools have been created
to enable fast programmatic code analysis, in particular to hunt for
bugs and vulnerabilities. Consider them a possible addition to your
GUI-based code browsing tools and not so much as a replacement. That
being said, you may find yourself doing more and more of your code
browsing on the shell with these tools.
This tutorial offers both short and concise commands that get a job
done as well as more lengthly queries that illustrate the inner
workings of the code analysis platform joern
. The later have been
provided to enable you to quickly extend joern-tools
to suit your
specific needs.
Note: If you end up writing tools that may be useful to others,
please don’t hesitate to send a pull-request to get them included in
joern-tools
.
Importing the Code¶
As an example, we will analyze the VLC media player, a medium sized
code base containing code for both Windows and Linux/BSD. It is
assumed that you have successfully installed joern into the directory
$JOERN
and Neo4J into $NEO4J
as described in
Installation. To begin, you can download and import the code
as follows:
cd $JOERN
mkdir tutorial; cd tutorial
wget http://download.videolan.org/pub/videolan/vlc/2.1.4/vlc-2.1.4.tar.xz
tar xfJ vlc-2.1.4.tar.xz
cd ..
./joern tutorial/vlc-2.1.4/
Next, you need to point Neo4J to the generated data in .joernIndex
.
You can do this by editing the configuration file
org.neo4j.server.database.location
in the directory $NEO4J/conf
as follows
# neo4j-server.properties
org.neo4j.server.database.location=$JOERN/.joernIndex/
Finally, please start the database server in a second terminal:
$NEO4J/bin/neo4j console
We will now take a brief look at how the code base has been stored in the database and then move on to joern-tools.
Exploring Database Contents¶
The Neo4J Rest API¶
Before we start using joern-tools
, let’s take a quick look at the
way the code base has been stored in the database and how it can be
accessed. joern-tools
uses the web-based API to Neo4J (REST API)
via the library python-joern
that in turn wraps py2neo
. When
working with joern-tools
, this will typically not be visible to
you. However, to get an idea of what happens underneath, point your
browser to:
http://localhost:7474/db/data/node/0
This is the reference node, which is the root node of the graph
database. Starting from this node, the entire database contents can be
accessed using your browser. In particular, you can get an overview of
all existing edge types as well as the properties attached to nodes
and edges. Of course, in practice, even for custom database queries,
you will not want to use your browser to query the database. Instead,
you can use the utility joern-lookup
as illustrated in the next
section.
Inspecting node and edge properties¶
To send custom queries to thedatabase, you can use the tool
joern-lookup
. By default, joern-lookup
will perform node index
lookups (see Fast lookups using the Node Index). For Gremlin
queries, the -g
flag can be specified. Let’s begin by retrieving
all nodes directly connected to the root node using a Gremlin query:
echo 'g.v(0).out()' | joern-lookup -g
(1 {"type":"Directory","filepath":"tutorial/vlc-2.1.4"})
If this works, you have successfully injected a Gremlin script into
the Neo4J database using the REST API via joern-tools
. Congratulations, btw. As you can see from the output, the reference
node has a single child node. This node has two attributes: “type”
and “filepath”. In the joern database, each node has a “type”
attribute, in this case “Directory”. Directory nodes in particular
have a second attribute, “filepath”, which stores the complete path to
the directory represented by this node.
Let’s see where we can get by expanding outgoing edges:
# Syntax
# .outE(): outgoing Edges
echo 'g.v(0).out().outE()' | joern-lookup -g | sort | uniq -c
14 IS_PARENT_DIR_OF
This shows that, while the directory node only contains its path in the filepath attribute, it is connected to its sub-directories by edges of type IS_PARENT_DIR_OF, and thus its position in the directory hierarchy is encoded in the graph structure.
Filtering. Starting from a directory node, we can recursively enumerate all files it contains and filter them by name. For example, the following query returns all files in the directory ‘demux’:
# Syntax
# .filter(closure): allows you to filter incoming objects using the
# supplied closure, e.g., the anonymous function { it.type ==
# 'File'}. 'it' is the incoming pipe, which means you can treat it
# just like you would treat the return-value of out().
# loop(1){true}{true}: perform the preceeding traversal
# exhaustively and emit each node visited
echo 'g.v(0).out("IS_PARENT_DIR_OF").loop(1){true}{true}.filter{ it.filepath.contains("/demux/") }' | joern-lookup -g
File nodes are linked to all definitions they contain, i.e., type, variable and function definitions. Before we look into functions, let’s quickly take a look at the node index.
Fast lookups using the Node Index¶
Before we discuss function definitions, let’s quickly take a look at the node index, which you will probably need to make use of in all but the most basic queries. Instead of walking the graph database from its root node, you can lookup nodes by their properties. Under the hood, this index is implemented as an Apache Lucene Index and thus you can make use of the full Lucene query language to retrieve nodes. Let’s see some examples.
echo "type:File AND filepath:*demux*" | joern-lookup -c
echo 'queryNodeIndex("type:File AND filepath:*demux*")' | joern-lookup -g
Advantage:
echo 'queryNodeIndex("type:File AND filepath:*demux*").out().filter{it.type == "Function"}.name' | joern-lookup -g
Plotting Database Content¶
To enable users to familarize themselves with the database contents
quickly, joern-tools
offers utilities to retrieve graphs from the
database and visualize them using graphviz.
Retrieve functions by name
echo 'getFunctionsByName("GetAoutBuffer").id' | joern-lookup -g | joern-location
/home/fabs/targets/vlc-2.1.4/modules/codec/mpeg_audio.c:526:0:19045:19685
/home/fabs/targets/vlc-2.1.4/modules/codec/dts.c:400:0:13847:14459
/home/fabs/targets/vlc-2.1.4/modules/codec/a52.c:381:0:12882:13297
Usage of the shorthand getFunctionsByName. Reference to python-joern.
echo 'getFunctionsByName("GetAoutBuffer").id' | joern-lookup -g | tail -n 1 | joern-plot-ast > foo.dot
Plot abstract syntax tree
Take the first one, use joern-plot-ast to generate .dot-file of AST.
dot -Tsvg foo.dot -o ast.svg; eog ast.svg
Plot control flow graph
echo 'getFunctionsByName("GetAoutBuffer").id' | joern-lookup -g | tail -n 1 | joern-plot-proggraph -cfg > cfg.dot;
dot -Tsvg cfg.dot -o cfg.svg; eog cfg.svg
Show data flow edges
echo 'getFunctionsByName("GetAoutBuffer").id' | joern-lookup -g | tail -n 1 | joern-plot-proggraph -ddg -cfg > ddgAndCfg.dot;
dot -Tsvg ddgAndCfg.dot -o ddgAndCfg.svg; eog ddgAndCfg.svg
Mark nodes of a program slice
echo 'getFunctionsByName("GetAoutBuffer").id' | joern-lookup -g | tail -n 1 | joern-plot-proggraph -ddg -cfg | joern-plot-slice 1856423 'p_buf' > slice.dot;
dot -Tsvg slice.dot -o slice.svg;
Note: You may need to exchange the id: 1856423.
Selecting Functions by Name¶
Lookup functions by name
echo 'type:Function AND name:main' | joern-lookup
Use Wildcards:
echo 'type:Function AND name:*write*' | joern-lookup
Output all fields:
echo 'type:Function AND name:*write*' | joern-lookup -c
Output specific fields:
echo 'type:Function AND name:*write*' | joern-lookup -a name
Shorthand to list all functions:
joern-list-funcs
Shorthand to list all functions matching pattern:
joern-list-funcs -p '*write*
List signatures
echo “getFunctionASTsByName(‘write‘).code” | joern-lookup -g
Lookup by Function Content¶
Lookup functions by parameters:
echo "queryNodeIndex('type:Parameter AND code:*len*').functions().id" | joern-lookup -g
Shorthand:
echo "getFunctionsByParameter('*len*').id" | joern-lookup -g
From function-ids to locations: joern-location
echo "getFunctionsByParameter('*len*').id" | joern-lookup -g | joern-location
Dumping code to text-files:
echo "getFunctionsByParameter('*len*').id" | joern-lookup -g | joern-location | joern-code > dump.c
Zapping through locations in an editor:
echo "getFunctionsByParameter('*len*').id" | joern-lookup -g | joern-location | tail -n 2 | joern-editor
Need to be in the directory where code was imported or import using full paths.
Lookup functions by callees:
echo "getCallsTo('memcpy').functions().id" | joern-lookup -g
You can also use wildcards here. Of course, joern-location, joern-code and joern-editor can be used on function ids again to view the code.
List calls expressions:
echo "getCallsTo('memcpy').code" | joern-lookup -g
List arguments:
echo "getCallsTo('memcpy').ithArguments('2').code" | joern-lookup -g
Analyzing Function Syntax¶
- Plot of AST
- locate sub-trees and traverse to statements
Analyzing Statement Interaction¶
- some very basic traversals in the data flow graph