2-dataflow-library.tex

% !TEX root = main.tex
\section{Description of the Dataflow Library}
In bug finding, it is often useful to consider where data (often user input)
ends up at, and how it reached that point in the program.
Generally, we would like to warn developers about security issues.
Many security issues stem from user input that is not sanitized (or improperly sanitized).
For example, in a web application it constitutes a security issue (an SQL injection)
if unsanitized user input reaches a function that executes a SQL statement, 
as this gives the user 
the ability to run arbitrary SQL statements.
Except for database query tools, this is undesired behaviour that can be abused
by malicious users.
In the same spirit, it is a security issue if a web application displays
unsanitized user input directly on the web page, as this allows the user to execute
java script in the context of the page. This is called an XSS injection.

Warning about both SQL and XSS injections requires, besides other things,
a concept of where user input will end up in the program.
We call this concept \emph{dataflow}.
Data enters the program at some point, at a \emph{dataflow source}, passes through 
\emph{dataflow steps} (the data from the source is copied around, passed as parameter 
to function calls, \ldots) and then ends up in a \emph{dataflow sink}.
A sink could be a call to a relevant function, for example a function
executing an SQL statement.
All dataflow from sources via one or more steps to sinks make up the 
\emph{dataflow graph}.
Nodes in the dataflow graph are points in the program that can hold data, 
and edges in the dataflow graph are dataflow steps.
A \emph{dataflow library} is concerned with computing an approximation 
the dataflow graph.
As always in static analysis, it is impossible to precisely solve the task at hand.
Calculate an exact dataflow graph would require solving the halting problem.

Because of the halting problem, the dataflow library cannot be sound \textbf{and}
complete. However, it is actually neither sound \textbf{nor} complete.
In general, the dataflow library reports \emph{possible} dataflow, i.e.\ as soon
as there is a path in the control flow graph of the program dataflow is reported.
Thus, the dataflow library is not complete, as it reports more flow than present
in the program.
On the other hand, sometimes dataflow paths that do exist are not reported by the 
dataflow library because finding them would be prohibitive due to poor performance.
Thus, the dataflow library is not sound either.
These limitations will be explained later in more detail.
However, a lot of care is taken to be as precise as possible 
without sacrificing performance. 

Designing a dataflow library that works in theory is not too difficult.
However, designing a dataflow library that performs well on projects with 
millions of lines of code is considerably more difficult.
In this section, we will describe the dataflow library used by the CodeQL platform % TODO check marketing material, is this correct?
for analyzing programs in Java.
We will describe the dataflow library without the contributions that are outlined 
in the next sections.

The dataflow library tracks only the flow of exact values.
For example, dataflow of a string is interrupted when \java{toUpperCase()} is called.
However, using functions like \java{toUpperCase()} does not prevent XSS or SQL 
injection attacks. To find such paths that do not interrupt flow when the
content of a variable still depends heavily on the input data 
(i.e.\ it is \emph{tainted}), the \texttt{TaintTracking} library exists.
It is built on top of the dataflow library and provides a generic interface to enable
finding of bug classes XSS, SQL injections or directory traversals.
While the taint tracking library is probably the biggest user of the dataflow library,
it is implemented strictly on top of the dataflow library.
Thus, we do not discuss it further, except to highlight the hooks in the dataflow 
library on top of which the taint tracking library is built.
Because the taint tracking library is built on top of the dataflow library,
 all improvements made to the dataflow library immediately also benefit taint tracking.

There is one exception to the rule that changing the data stops dataflow: All numeric types 
(integers and floating point numbers) are treated as data of the same type,
as converting from one type to the other often preserves the value.


\subsection{Interface}
Before we delve into the internals of the dataflow library, we describe the API 
that users of the dataflow library will encounter.

\subsubsection*{Nodes}
The main concept used throughout the dataflow library is that of a \texttt{Node}.
Every element in the program holding data is a node.
The dataflow library is concerned with the task to identify 
under which circumstances there is flow between nodes.
The set of nodes and the set of dataflow steps between those nodes make up the 
\emph{dataflow graph}.

The QL type \texttt{Node} has several subtypes, corresponding to different kinds of nodes.
The most important subclass of \texttt{Node} is \texttt{ExprNode}, 
as every expression in the program corresponds to a \texttt{Node}.
Another important subclass is \texttt{ParameterNode}.
Both the implicit \texttt{this} parameter for methods and all explicit parameters
given to methods are introduced in a method via these nodes.
In a call, data flows from the node of the argument of the 
call to the corresponding parameter node in the callee.

Objects that potentially change through an operation are modelled
 via \texttt{PostUpdateNode}s.
Objects that have a method called on them (potentially mutating it),
field accesses on an object and parameters to method calls (the callee could mutate the object)
have two nodes for that expression - one so-called \emph{pre-update node}, that represents 
the object before the operation, and a \emph{post-update node} that represents 
the object after the operation.
This becomes more clear when looking at the statement \java{a.field = "string"}.
In this example, there exists an expression node for 
\java{"string"}, an expression node for the instance access \java{a.field},
an expression node for \java{a} and an expression node for the entire assignment.
However, none of these expression nodes capture the fact that the object \java{a}
is mutated by the assignment.
To represent that fact a \texttt{PostUpdateNode} for \java{a} exists,
representing that the assignment (potentially) mutates the object \java{a}.
The corresponding pre-update node is the expression node for \java{a}.

Other subtypes of the \texttt{Node} class exist as well, 
but are less important to the bigger picture of understanding the dataflow library.
This subtypes are more technical in nature and correspond for example 
to instance creation, implicit reads of the \texttt{this} parameter or
the desugaring of vararg method calls.

One peculiarity of the \texttt{Node} class is that the dataflow graph
is stored in separate predicates, not on the nodes making up the dataflow
 graph itself.
This means that the dataflow graph is stored in an adjacency matrix, where the
adjacency matrix is encoded in the tuple set of a predicate.
%Different kinds of dataflow graphs can be represented by different predicates,
%using the same set of nodes.
% TODO elaborate - or maybe not?

\subsubsection*{Configuration}
The \texttt{Configuration} is the main class a user interacts with.
A configuration defines which nodes are sources and sinks of dataflow.
Furthermore, a configuration can also specify additional steps which yield flow.
For XSS, a configuration could add that a call to 
\texttt{toUpperCase()} on a string still provides dataflow.
On the other hand, configurations can also specify nodes which block flow.
In the example, this could be a sanitizer method that strips HTML tags.
This enables the user to filter out dataflow paths that are known to be safe.

The most important predicate on the \texttt{Configuration} class 
is the \texttt{hasFlow(Node source, Node sink)} predicate.
Using that predicate, users can query the dataflow relation.
It holds whenever there is dataflow starting at the given node \texttt{source} 
(which has to be a source) which flows to the given node \texttt{sink} (which has to be a sink).
Note that this interface does \textbf{not} provide information about specific
dataflow paths.
It also limits dataflow queries to dataflow between the (previously defined) sources and sinks.
This design limitation is intentional, because it allows performance optimizations
that make it feasible to compute whole-program dataflow.

The configuration also allows the user to disable dataflow via field reads and stores.
Doing so improves performance and allows the dataflow library to explore all branches
from virtual dispatch.
If field flow is enabled, the user has to set a limit on the number of branches 
allowed for virtual dispatch in the presence of field flow.
Methods calls via virtual dispatch are only considered as flow targets if there are 
sufficiently few possible implementations to analyze.
This restriction is needed to prevent the combinatorial explosion stemming from
fully respecting virtual dispatch and field flow.
This is explained in more detail later.
% TODO is it?

Furthermore, the configuration allows the specification of additional dataflow 
steps, as well as barrier nodes.
These concepts are discussed in more detail below.

\subsection{General Design}

In theory, implementing a dataflow algorithm in a logic programming language is simple.
Given the \texttt{Node} concept discussed above, one only needs to provide a relation
\texttt{step(Node node1, Node node2)}.
This step relation holds if \texttt{node1} has dataflow in one step to \texttt{node2}
(local and global steps!).
Then, the transitive closure of \texttt{step} has all possible dataflow paths in the program.
These dataflow paths then can be filtered so that only paths which begin in a source and 
end in a sink are considered.

In practice however, this example wouldn't perform well.
First, note that the number of \texttt{Node}s grows roughly linear in the size of
the program.
Second, the transitive closure of the \texttt{step} relation enumerates all dataflow
paths in the program from one node to another.
It is defined on the Cartesian product of \texttt{Node} with itself that is roughly quadratic 
in the size of the program.
We would end up with a predicate that grows probably superlinear in the program size.
The CodeQL toolchain is capable of handling very large programs,
for which asymptotic superlinear algorithms typically don't scale.
Thus, a more involved approach to compute dataflow is needed.

On a very high level, the algorithm used in the dataflow library is very similar 
to the simple algorithm sketched above.
The last step of it is indeed taking the transitive closure of a step-relation.
However, a lot of care is taken to define the step relation only on suitable 
nodes that with high probability will be part of a dataflow path from a source to a sink.
For example, nodes that cannot be reached from any source via a dataflow path 
are ignored.
This pre-filtering of the suitable nodes makes up the most of the code for the 
algorithm.
Reducing the node set makes it possible to define the step relation on a much
smaller set that is not quadratic in the size of the program.

\subsection{Description of the algorithm}
We will now present the details of how dataflow is computed for Java programs.

First, the algorithm runs four passes of finding candidate nodes that are part 
of dataflow paths.
Every pass refines the set of nodes by computing more accurate information about the 
possible dataflow.
Recording only candidate nodes (and not candidate dataflow edges) means that 
the predicates for node filtering are only defined on the node set,
not on the node set times the node set.
Thus, the candidate node sets are inherently linear in the size of the program.
This makes their computation feasible.

Second, the algorithm computes a step relation and takes the transitive closure of that.
As the step relation is only defined on a very refined set of candidate nodes,
it's size is small enough to handle even for large programs.

The implementation of the algorithm is divided into a generic dataflow algorithm that is 
used for Java, C/C++ and C\# and a specific language-specific implementation of a common interface.
In this description, we are glossing over the separation of those two modules and 
describe the algorithm formed by the generic module together with the language-specific
module for Java.
The implementation of the dataflow library for Java encompasses roughly 4000 lines of
QL code. Describing every single detail of the implementation would be neither feasible nor 
useful. Thus, some details are omitted or the implementation of a particular step 
of the algorithm is only described on a very abstract level.
Other parts of the algorithm are described more in-depth.

\subsubsection*{Terminology}
Before we can start with the steps of the algorithm, we need to introduce some terminology.
The \emph{virtual dispatch} library provides functionality to resolve calls of 
methods through interfaces and class hierarchies.
For any given method call, it provides a list of all potential targets of that
call site.
With an optionally provided call context, it can further narrow down the list of
possible method implementations.
A \emph{call context} encapsulates the information that the analyzed callable was
called from a specific call site, so information about the arguments to the 
method are available.
\begin{listing}
    \begin{javacode}
interface I {
    public void f();
}
class A implements I {
    @Override
    public void f() {
        // Implementation A
    }
}
class B implements I {
    @Override
    public void f() {
        // Implementation B
    }
}
class User {
    public void g(I i) {
        i.f();
    }
    public void h() {
        A a = new A();
        g(a);
    }
}
    \end{javacode}
    \caption{Example code for virtual dispatch resolution}
    \label{lst:virtual-dispatch}
\end{listing}
In the example~\autoref{lst:virtual-dispatch}, the \java{interface I} defines a
 method \java{f()}, and the two 
classes \texttt{A} and \texttt{B} both implement the interface \texttt{I}.
Without call contexts, the virtual dispatch library would resolve the call 
of \java{f()} in the method \java{User.g(i)} to both 
implementations in \texttt{A} and \texttt{B}.
However, if call contexts are provided to the virtual dispatch library 
it is able to infer that when \java{User.g(i)} is invoked by the method 
\java{User.h()} it only can dispatch to the implementation of \texttt{f()} in 
\java{class A}.
This functionality is orthogonal to the dataflow library, and the implementation 
of virtual dispatch is not described.

A \emph{local} flow step is a dataflow step where data flows from one node to another
and both nodes are inside the same callable.
Local flow computation utilizes a SSA form.
A standard textbook SSA with $\phi$-nodes is used.

Some examples of local flow steps are:
\begin{itemize}
    \item Data flows from the right-hand side of a variable assignment to a 
    first read of that variable
    \item Data flows from a variable read to an adjacent variable read (a use-use chain in SSA lingo).
    This means that both nodes read the same variable, and there is no other operation between the 
    reads, except possibly the presence of $\phi$-nodes.
    \item Data flows from a parameter node to a first read of the parameter
    \item Data flows from the inside of an expression in parentheses to the dataflow node
    representing the outer expression, i.e.\ in the expression \java{(var)} there is dataflow from the node 
    representing the variable read \java{var} to the dataflow node representing the entire expression \texttt{(var)}.
    \item Data flows from a post-update node that represents an object to an adjacent read 
    of that object
\end{itemize}
There are some more local flow steps that are quite technical in nature.
The local flow computation is sound, i.e.\ it never omits dataflow that is 
present in the program.
 % TODO check again, we're talking about single steps here, not entire graphs!
 % This might be a bit early here

 Unlike for global dataflow, the entire local dataflow graph for each function 
is computed.
This is necessary because the local dataflow is an important building block in computing
the global dataflow.
A priori it is not clear whether a local dataflow path is part of a (global) dataflow path
from a source to a sink, thus it is not possible to filter paths based on that condition.
As methods are much smaller in relation to the whole program, computing the whole local 
dataflow graph is feasible and no combinatorial explosion occurs.

A \emph{jump} flow step is a dataflow step that jumps between callables 
in a way that ignores call contexts.
In Java, there are two possibilities for a jump step to occur.
\begin{itemize}
    \item \textbf{Static fields:} Data flows from an assignment to a static field
     to a read of that static field, if the field read is not determined by SSA.
    The field read is only determined by SSA if a local value is assigned to the
    static field.
    However, in most cases the static field will be assigned to in one callable
    and read in another.
    \item \textbf{Closures:} Data flows from a use or a definition of a variable
    in the outer callable to a use of a captured variable in the inner function.
\end{itemize}

An \emph{additional} (local or jump) flow step is a dataflow step that the user 
provides via the \texttt{Configuration}.
The prime example of additional flow steps are taint steps.
In Java, string concatenation is a taint step, but not a (regular) dataflow step.
The object that flows from the source to the sink is changed by the concatenation,
but for example adding a newline to the end of a string makes it still dangerous
to use in a SQL query without further checks.
Note that an additional step (by definition) does not preserve the value of the
variable in question, as all value-preserving steps are already contained in the
dataflow library.

A \emph{barrier} is a node that prevents dataflow.
Barriers are user-specified via the \texttt{Configuration}.
For example, they are used by the taint tracking library to model sanitizing steps
that stop tained values to flow to a source.

% TODO remove this, check for other references to it
%An \emph{extended return} is a generalized return that representing dataflow out of a callable.
%It is either a return statement, which we call a \emph{value return} or a parameter update.
%Parameter updates are often called having an \emph{out parameter}.
%The programmer supplies the reference to an object in a parameter, and the method then
% TODO check what exactly out parameters are in Java, have a look at https://lgtm.com/query/4328056797028651291/
% Should be in fact vacuous

\subsubsection*{Field Flow and Access Paths}
The dataflow library has support for (an approximation of) storing the data in a
field of an object. The details of the approximation used will be described later, but
we will introduce the concepts now.
Whenever data is stored in an object, we need to switch the data we are tracking ---
before a store, we track an object that originated at a source, after a store the object
containing the field needs to be tracked, together with the information in which
field the originally tracked data is stored in.
Conversely, when reading out of the field, we need to switch back from tracking
the containing object to tracking the data we were originally interested in.
Because objects can be stored in objects, just keeping track of the state 
direct tracking versus tracking a field stored in an object is not enough.
Instead, we need to keep track of a list of all the fields objects were stored into along 
the current path.
We call such a list an \emph{access path}.
This access path provides the needed context to track dataflow through fields.
The access path is empty if the data that originated from the source was not 
stored in any field.
The \emph{head} of the access path indicates which field was last stored into,
if the access path is non-empty.
The access path is treated as a stack-like datastructure.
At every field store, the field is pushed to the access path, thus the head is
replaced by the current context.
At every field read, we pop the head from the access path which restores the old context.
Using the access path, we can precisely track field flow.

As access paths can be potentially lists of infinite length, for example 
when dealing with recursive datastructures, we cannot deal with access paths 
with full precision.
Thus, they need to be approximated.
Throughout the algorithm, several approximations of access paths and the push and 
pop operation on the access path are used.
These approximations will be discussed in detail.

\subsubsection*{Flow Through Methods}
As a preparatory step, the dataflow algorithm summarizes methods that at some point 
return one of their parameters to a single dataflow step.
This is independent of the configuration, and does not take into account
neither additional steps nor barriers.
Later steps in the algorithm offset these shortcomings.
The dataflow from the argument node to the node for the value returned by
the method is thus represented by only a single edge.
\begin{listing}[H]
    \begin{javacode}
class A {
    String name;

    public void setName(String name) {
        this.name = name;
        return this;
    }
}

class B {
    public void caller() {
        A a = new A();
        a.setName("a name");
    }
}
    \end{javacode}
    \caption{Example code for flow through method summarization}    
    \label{lst:flow-through-example}
\end{listing}
In the code example in~\autoref{lst:flow-through-example}, there is dataflow of the instance
of class \java{A} to the method \java{setName}, because the instance a 
is (implicitly) passed as \java{this} parameter.
Because of the return statement, there is dataflow back from \texttt{setName} to
\texttt{caller}.
After the summarization step, the dataflow in line 13 is described in a single step,
from the node representing \texttt{a} as implicit argument to the method call, to the node 
representing the expression \java{a.setName("a name")}.

This summarization helps the dataflow algorithm in several ways.
In the first passes of the dataflow algorithm, we do not keep track of call contexts
for method calls, as that would be too expensive.
However, that means that calls and returns are not matched up, and every return out of 
a method flows to any of the callsites.
For example, some programmers write their setters in a way that they always return \java{this},
as that allows chaining setters.
If one of these setters has 100 call-sites and we wouldn't match up the callers with the callees,
the dataflow analysis would be very imprecise.
Using method summaries is an inexpensive way
to rectify that problem, as they match up callers and callees.
It also ensures that any dataflow path first returns out of methods, and only
then enters other methods. As soon as a path starts entering methods, it can 
never encounter return steps anymore.
Furthermore, summarization provides shorter paths with fewer nodes involved.
This helps performance.

The summarization computes for every parameter a relation with all nodes 
that are reachable by taking local (value-preserving) dataflow steps from that parameter.
This means all dataflow nodes that receive that parameter are marked.
Then a \emph{parameter flows through a method} if the parameter reaches a 
dataflow node associated with a return statement.
This information is then used together with the virtual dispatch analysis
(not described here) to match up arguments to parameters, yielding a predicate 
that describes all arguments that flow through their method calls.
% TODO the sentence before is very complicated
These predicates are mutually recursive.
This means that in the setting of an outer method that calls an inner 
method both methods are summarized if a parameter flows through both of them.

The summarization is run in two passes.
The first pass works exactly as described above.
Its output is then used to detect getter and setter methods.
This relies on the predicate that indicates which dataflow nodes in a method are
reachable from a parameter.

A \emph{getter} is a method where data flows from a parameter node 
(the implicit \java{this} argument to methods) to a field read (or recursively another getter)
and then to a return statement.

A \emph{setter} is a method with (at least) two parameters,
where data flows from one parameter to a field store (or recursively another setter),
and the store updates a field in an object that was provided as a parameter as well.

The second pass enhances the detection of method calls where an argument flows through 
by also taking \emph{local store-read steps} into account.
These are stores of data into a field, that are read later, and both 
the store and the read happen in the same method.
This second pass relies on the first pass, as the store and read steps take the 
(previously computed) getters and setters into account.

\subsubsection*{Node Filtering --- Phase 1 forward}
Before computing the dataflow graph for a program,
the algorithm identifies several sets of candidate nodes that are possibly part of 
that dataflow graph.
These sets are restricted further and further with every filtering pass.
These sets are configuration-specific.
They take sources, sinks, barriers and additional steps from the configuration into 
account.
Note that in the description of the passes, \emph{candidate node} always refers
to being a candidate node of that pass, unless otherwise noted.

The first pass is done in the predicate \texttt{nodeCandFwd1}.
It uses a flooding-style algorithm to compute all nodes reachable from the source nodes
using dataflow steps, while ignoring all call contexts and access paths.
It is a huge overapproximation of the node set of the dataflow graph.
But because it is very simple it is cheap to compute.

The candidate node set is described recursively and computed by a fixed-point
iteration.
A \texttt{Node n} is a candidate node if it is not a barrier node and 
one of the following conditions holds:
\begin{itemize}
    \item The node \texttt{n} is a source node.
    \item There exists a candidate node \texttt{mid} such that \texttt{n} is reachable with a single flow step,
    either a local step, a jump step or an additional step.
    \item There exists a candidate dataflow node \texttt{arg}, such that 
    \texttt{arg} is an argument in a call, and \texttt{n} is a dataflow node 
    representing the parameter in the callee.
    In this case, we say that we have dataflow \emph{into} a callable.
    \item The opposite case is that we have dataflow \emph{out} of a callable.
    That happens when the node \texttt{n} is a node representing a value returned by 
    a call, either directly or by being a \texttt{PostUpdateNode} that
    represents a parameter that is possibly modified by a call.
    Furthermore, there exists a candidate node that returns from the callee.
    \item The node \texttt{n} is a \texttt{PostUpdateNode}, representing the state of an 
    object immediately after a field store (either direct, or by using a setter).
    Furthermore, the data that is stored needs to be a candidate node already.
    \item The node \texttt{n} takes the result of a field read 
    (either directly, or via a getter), the node from which we read is already a
    candidate node \textbf{and} the field we read from is stored to 
    by a candidate node.
\end{itemize}
In the case of calls and returns, virtual dispatch is taken into account,
so calls branch out to all possible targets, and returns go back to all possible callers.

Field flow is only considered when the field flow is enabled in the configuration.
If field flow is enabled, we do not only track candidate nodes, as computed by 
\texttt{nodeCandFwd1}, but also candidate fields in which we possibly store as
part of a dataflow path.
These fields are tracked by the predicate \texttt{storeCandFwd1}, which is mututally 
recursive with \texttt{nodeCandFwd1}.
While we do not track any access paths in this first pass,
the candidate fields can be understood as (a superset of) the nodes that make up 
access paths, as only fields in which we store can be a part of an access path.

\subsubsection*{Node Filtering --- Phase 1 backwards}
The second pass is done in the predicate \texttt{nodeCand1}.
The second pass filters the results from the first pass by only keeping nodes
that also are reachable from the sinks when taking all the steps backwards.
This pattern is used for all the candidate node filtering steps --- first, we 
filter the node set starting from the sources and taking steps forward,
and then we filter that set further by starting with the sinks and taking steps backwards.
This eliminates all intermediary nodes on paths that never reach a sink.

Like \texttt{nodeCandFwd1}, \texttt{nodeCand1} ignores all call contexts
and all access paths.
All cases are symmetric to the ones in \texttt{nodeCandFwd1}.

In detail, this means a \texttt{Node n} is a candidate node as considered by 
\texttt{nodeCand1} if it is contained in 
\texttt{nodeCandFwd1} and either of the following is true:
\begin{itemize}
    \item The node \texttt{n} is a sink node.
    \item There exists a candidate node \texttt{mid} such that \texttt{mid} is reachable with a single flow step,
    either a local step, a jump step or an additional step from the node \texttt{n}.
    \item Flow into a callable:
    There exists a candidate node \texttt{param} that is a parameter of a callable,
    such that \texttt{node} is a possible argument that gets passed to the callable.
    \item Flow out of a callable:
    There exists a candidate node that is a return statement, and \texttt{n} is 
    a node in a caller, receiving the returned value.
    \item The node \texttt{n} is an object from which a field \texttt{f} is read,
    where the value of the read flows to a candidate node, and \texttt{f}
    is a candidate field as identified by the predicate \texttt{storeCandFwd1}.
    \item The node \texttt{n} is stored into a field \texttt{f}, where the
    object after the store is already a candidate node, \texttt{f} is a
    candidate field as identified by the predicate \texttt{storeCandFwd1}
    \textbf{and} the field is read from by a candidate node.
\end{itemize}

Again, if field flow is enabled, we also track candidate fields that could make up 
access paths. This predicate is called \texttt{readCand1} and is a subset of 
\texttt{storeCandFwd1}.
It tracks the fields that are read from in a path from the sink to a source.

Both passes in phase 1 take virtual dispatch into account by dispatching to all possible callees
for each call site. Only later phases use a more advanced analysis to use the call context 
to restrict the virtual dispatch targets.
% TODO elaborate on type checking
%Note that no type checking is done in phase 1, so 

\subsubsection*{Node Filtering --- Phase 2 forward}
The second phase of node filtering works very similar to phase 1 --- it 
operates on all nodes that are present in \texttt{nodeCand1} and filters them further.
It is again implemented in a flooding-style algorithm with a forward and backward 
pass, like in phase 1.
However, it starts taking very simple call contexts and access paths 
into account to make the analysis more precise.
In phase 2 both the call context and the access path are approximated by booleans.
The boolean for the call context tracks whether the data flow entered through 
a parameter or not.
The boolean for the access path tracks whether the access path is empty or not.


The forward pass is implemented in the predicate \texttt{nodeCandFwd2}.
It is different in the following steps from phase 1:
It skips over methods where the argument value 
flows through and uses the summarized dataflow instead.
To take additional non-value-preserving steps into account,
summaries are computed for methods which take at least one
non-value-preserving step as well.
Flow out of a callable is only taken into account when the call context 
indicates that we did not end up in the current node via a method call.
This excludes the non-summarized paths from the predicate.

Whenever we encounter a store, we simulate pushing to the head of the access path 
by setting the flag indicating that the access-path is non-empty.
Non-value-preserving steps are only allowed when the access path is empty.
When the access-path is non-empty, we are tracking an object in which the tracked 
data is stored. However, when a non-value-preserving step is applied to that object,
object identity does not hold anymore, and the value of the field with the tracked
data is lost.
Thus, additional dataflow steps are only allowed to be taken when the access path 
is empty.

Whenever we encounter a read, we need to simulate a pop operation on the access path.
However, with the limited information we track it is not possible to determine 
whether the access path is empty after a read or not.
Thus, we continue with both options considered possible.

Additionally, both reads and stores are only considered when they appear in the 
predicate \texttt{readCand1}.
Furthermore, just like in \texttt{nodeCandFwd1}, the predicate 
\texttt{storeCandFwd2} tracks all fields that are stored into in the dataflow
covered by this pass.
As we only consider fields from \texttt{readCand1}, it is a further refinement 
of that set.

Whenever we encounter flow into or out of callables while having a non-empty 
access path \textbf{and} we encounter virtual dispatch,
we check that the branch limit holds.
Virtual dispatch targets which branch out a lot are ignored when tracking objects 
into which we stored earlier. What exactly a lot means is determined by the 
configuration.
This pruning obviously eliminates real dataflow paths from the results, but is needed
to make the analysis feasible.

\subsubsection*{Node Filtering --- Phase 2 backwards}
The exact same filtering as done in phase 2 forwards from the sources to the sinks 
is afterwards applied going backwards from the sinks to the sources.
The predicate for this is \texttt{nodeCand2}.
Furthermore, it also refines the potential fields making up access paths even further,
only recording fields that are present in \texttt{storeCandFwd2} and that are 
also read from in the flow covered by \texttt{nodeCand2}.
The fields are stored in the predicate \texttt{readCand2}.


\subsubsection*{Local Flow Summarization --- The Bigstep Local Flow Relation}
Before filtering the node set even further, the local flow steps are summarized 
into the so-called bigstep relation.

The local flow step relation contains very detailed dataflow information.
For example, the flow from a parameter to a method call 
is in general not a single step.
Indeed, every use of the parameter along the control flow graph until the method 
call is a part of the dataflow.
Having a such refined model of the dataflow is good for precision, but 
reporting that dataflow path to the user would not be useful.
Furthermore, it incurs quite a performance overhead, as a lot of nodes and edges 
are needed to model local dataflow on that level.
To address those problems, a coarser version of the 
local dataflow step relation is computed.

However, a simple transitive closure over the step relation would lose too much 
information, as it would skip over interesting local flow steps,
like storing the data in a field, or calling methods with the information as 
argument.

The \emph{local flow bigstep relation} strikes a balance between the refined 
local flow step relation and its transitive closure.
It conflates steps until it reaches nodes that are considered to be interesting.
The intent is to skip over steps that are irrelevant in the presentation
to the programmer, while still 
reporting enough steps in the dataflow graph that the programmer can understand 
where exactly dataflow is happening.
It thus stops conflating steps when it encounters casts, flow into or out of callables,
jump steps, calls to methods where the argument value flows through (see above), field 
stores and reads as well as when it reaches a sink. 
Cast nodes are important stopping points, because in later steps the dataflow 
algorithm actually prunes path based on the type information to rule out 
paths where the type is not compatible.
The bigstep relation also keeps track if non-value-preserving steps have been used 
in the summarization or not, because as explained above, those steps are only allowed 
when the current access path is empty.

\subsubsection*{Node Filtering --- Phase 3 forward}
The next filtering step filters the candidate nodes from \texttt{nodeCand2}
and the candidate fields for the access paths from \texttt{readCand2} even further.
It is implemented in the predicate \texttt{flowCandFwd}, that stores the candidate nodes,
and in the predicate \texttt{consCandFwd} that tracks the fields that potentially 
make up the access path.

The main differences to \texttt{nodeCand2} are that the bigstep relation is 
used for local flow (thus reducing the number of nodes needed), 
a more precise approximation of the access path than just a boolean,
and type-checking along the potential dataflow paths.
Call contexts are still approximated by only a boolean that indicates whether 
in the current context, dataflow came from a method parameter.

To approximate the access path, we now store the front of the access path.
The front of an access path is either Nil if the access path is empty,
or the head of the access path (a field).
This is clearly more accurate than the boolean that just stores whether the 
access path is empty.
This approximation relies on the observation made by Semmle engineers that
in most programs the current head of the access path is a
good predictor for the next element of the access path.
Furthermore, given the current head of an non-empty access path, we get the 
type of the currently-tracked object for free.
Thus, by tracking type information when the access path is empty, we can type-check
the whole path with low overhead.

The access path handling is implemented in the following way:
Sources enforce that the access path front is Nil.
Furthermore, non-value-preserving steps can only be used in contexts with
a Nil access path front.
Again, like in phase 2, we ignore calls to methods that branch out a lot when the
access path front is Nil.

When encountering a field store, we simulate a push to the access path by replacing 
the current access path front with a new one that uses the stored-to field 
as head.
This operation stores exactly the current content.

However, when encountering a field read, we need to simulate a pop operation.
As we throw away the information about the remainder of the access path as soon as 
we encounter a store, we have no reliable way to know what to set the access path front 
to.
Thus, we explore all possible options and consider all access path fronts that could fit.
Precisely, during a read from a field \texttt{f},
we explore data flow for all access path fronts that were encountered before 
any store into the field \texttt{f} in flow covered by phase 3.
% TODO type check - why?

The example~\autoref{lst:accesspathfronts} explains how the access path approximation as done in the forward-phase 
3 of the node filtering works in practice.
It creates an ambiguity in line 25 (the field read from the box),
because there are two places in the program where data is stored in the box 
(lines 21 and 35), and these two places have two different access path fronts.
Thus, after reading from \java{box.fo1}, both access path fronts are candidates 
and the algorithm cannot distinguish between them.
To create the ambiguity, \java{flowMethod2} is provided that stores 
in the second field of class \texttt{A}.

\begin{listing}[H]
    \begin{javacode}
class A {
    public String stringField1;
    public String stringField2;
}
class Box {
    public A fo1;
}
class Flow {
    // flow enters via the parameter source, access path front is Nil,
    // type is String
    public void flowMethod(String source) {
        A a1 = new A();
        // field store, access path front changes to the head for the field 
        // stringField1 in class A - the new type of the object we're tracking is A
        a1.stringField1 = source;
        Box box = new Box();
        // field store, access path front changes to the head for
        // field fo1, class Box - the new type we're tracking is Box
        box.fo1 = a1;
        // field read, access path front needs to be restored.
        // two options, either A.stringField1, or A.stringField2
        // the new type we're tracking is A
        A a2 = box.fo1;
        // field read, access path front becomes Nil, tracked type String
        // as there is no read from A.stringField1 anymore on this path 
        // it is a dead end and will be filtered out in the backwards pass
        String trackedObject = a2.stringField1;
        // flow of the tracked object to a sink
        sink(trackedObject);
    }         
    public void flowMethod2(String source) {
        A a1 = new A();
        a1.stringField2 = source;
        Box box = new Box();
        box.fo1 = a1;
        A a2 = box.fo1;
        String trackedObject = a2.stringField2;
        sink(trackedObject);
    }
}
    \end{javacode}
    \caption{Example that shows how the access path front approximation works}
    \label{lst:accesspathfronts}
\end{listing}

Types along dataflow paths are mostly checked by the compiler.
For example, we can be sure that the types of the left and righ-hand side of an
assignment are compatible.
However there are a few places where the final type check is either deferred to 
the runtime (casts) or where the dataflow analysis could get confused.
Thus, we prune dataflow nodes using the recorded type information at nodes that represent 
casts, parameters and returns.
These nodes are only candidate nodes if their type 
matches the type recorded in the access path front.
% TODO maybe with example as well?

\subsubsection*{Node Filtering --- Phase 3 backwards}
This pass is symmetric to the forwards pass.
The resulting candidate nodes are stored in the predicate \texttt{flowCand},
and the fields encountered along the paths are recorded in
the predicate \texttt{consCand}.
Both the node and the field candidate sets are subsets of the ones generated
in the forward part of phase 3.

\subsubsection*{Node Filtering --- Phase 4 forward}
The fourth phase is the last phase of node filtering.
Afterwards, the node set is all the nodes in the transitive closure of 
the actual step relation (and thus not explicitly computed anymore).
The forwards pass is implemented in the predicate \texttt{flowFwd}.
The set of candidate fields that make up the access path is not
restricted further.

The fourth phase still approximates call contexts by a boolean indicating 
whether we entered the current context via a parameter or not.
However, the access path approximation is more accurate.
This is in fact the last access path approximation done by the algorithm.
Storing the full access path is infeasible, as any recursive datastructures
would lead to access paths of infinite size.
As lists in QL have to be fully instantiated this would be impossible to 
represent.
Furthermore, even for access paths with finite size $N$ the number of access paths 
can grow exponentially in $N$.
This can be seen by looking at simple binary trees, where an access path 
of a leaf node is exactly a path from the root to a leaf node, and for a tree
of depth $N$ (the length of the access path) there are $O(2^N)$ different paths.
Thus, the maximum length of access paths has to be chosen carefully to not 
cause a combinatorial explosion.

The approximation chosen for access paths is that the first two elements of
the access path are tracked, and for the tail elements the length is tracked.
Furthermore, only access paths up to the length of 5 are considered at all.
Again, like in phase 3, the access path always tracks the type information of the 
dataflow, even when the access path is empty.

Pushing to such an approximation is easy - we just preprend the new field to the 
existing list.
Should the resulting list have more than 5 entries the dataflow path is not considered
at all.
This helps performance for example in the case of recursive datastructures,
where there are candidate dataflow paths with infinite length.

As in phase 3, popping from a list approximation is more complicated.
However, in phase 4 we know exactly
what to restore the access path to after a read for short access paths.
When the access path has length one or two popping from that list 
is possible with full precision.
In the case of access paths of length three to five, we can precisely determine 
the new head \texttt{f1} of the access path, as we store the second element.
However, the second element \texttt{f2} in the access path after the read cannot be determined.
Again, like in phase 3, we have to resort to just trying all possible options.
For the second element, we consider all possible fields that occur in dataflow
covered by phase 3 (i.e.\ exactly the fields recorded in \texttt{consCand}).
This is further filtered by just considering fields that occur along any access path
where the head of that access path is the field \texttt{f1}.
This ensures a relatively high precision in most cases.


\subsubsection*{Node Filtering --- Phase 4 backwards}
The backwards pass of phase 4 is implemented in the predicate \texttt{flow}.
This is the final candidate set of nodes on which the step relation is computed.

\subsubsection*{The Step Relation}
The step relation is computed in the predicate \texttt{pathStep}.
It can be viewed as the edge set of a directed acyclic graph,
the dataflow graph.
The node set of that graph is the union of all nodes that are either the beginning 
or the end of a single dataflow step.
Taking the transitive closure of the step relation yields a relation that indicates 
which sinks are reachable from a given source.

It reuses the same access path approximation as phase 4 of the node filtering,
but keeps track of more call contexts.
It tracks the context of the current start node of the step, i.e.\ whether it is
\begin{enumerate}
    \item in a context where the call context doesn't matter 
    \item in a context where data entered via a parameter. The parameter is recorded,
    but the originating call itself is not tracked.
    \item in a context where data entered via a parameter. The originating call,
    as well as the parameter through which the dataflow entered are recorded.
    This type of context is only used when it helps the virtual
    dispatch analysis (see the example above). Otherwise, call contexts that don't
    track the precise originating call are used.
    %\item TODO something about returns (needs Anders)
\end{enumerate}

Distinguishing between entering via a parameter without tracking the exact originating
call and tracking the originating call helps performance, as 
less call contexts are tracked.

In general, we do not have a model of a call stack.
Thus, when encountering a call all possible targets of the call have to be considered.
When encountering a return it is not clear which caller should be returned to,
so all callers are considered.
When taking virtual dispatch into account that means that all callers of an interface 
method are candidates to return to, even though they might call a different 
implementation.
%TODO some call contexts make it possible to restrict this set of possible return
%targets.

The different steps taken are the same as in all the node filtering passes,
the only difference is that both the start and the end point of the step is 
recorded.

\subsubsection*{Dataflow}
Using the step relation, it is now almost trivial to check if there is dataflow 
between sources and sinks.
One only needs to use the transitive closure of the step relation, as it contains 
all nodes reachable from sources, and restrict to those nodes that are sinks.
All those are reachable from sources, with the dataflow path being encoded in the 
\texttt{pathStep} predicate.


% TODO: why localFlowStep flows through casts?
% simpleLocalFlowStep (Util): Follows SSA adjacent uses and def-first-use.
%   Data also flows from assignments to asignees (Check word??)
%   interestingly enough, also through casts
% TODO: local data flow: We have flow from PostUpdateNodes to the next adjacent read,
% but we also have flow from the pre-update node to the next adjacent read?
% Isn't that (more or less) two paths for the same?

% flowthrough matches callsites with returns (flow summary)

% start of considered nodes: nodeCandFwd1(node, config) -  simple local flow 
% node is reachable via dataflow from the source,
% including local flow, field stores, reads, flows into callables via parameters
% and flow out of callables via modified parameters (takes VD into account)
% doesn't take call contexts into account

% then nodeCand1 filters nodeCandFwd1 by just keeping all nodes which are
% reachable from a sink (through the same steps as before)

% nodeCand2: very simple call contexts, TODO find out exact differences to Cand1
% nodeCand:=nodeCand2 excluding call contexts

% bigstep relation: localFlowBigStep

% flowCandFwd0 something something accesspathfront, starts again with flooding
% from the source with all nodeCand2's
% flowCand0 filters flowCandFwd from the sink backwards to the source

% flowCandFwd = flowCandFwd0 AND CastingNode compatible type check

% flowCand = flowCand0 AND flowCandFwd
% flowCand0 uses flowCand and flowCandFwd, so mutual recursion, goes backwards from sink

% TODO: Sections about casts
% TODO note about numerical types
% TODO: simpleArgumentValueFlowsThrough is not mentioned+only in empty AP
% ANDERS:
% consCandFwd why type check? Every store in a field needs to be compatible 
% with the type of the field?

% TODO mention a bit more about the general design/ node exists versus computing explicit
% step relation - we have an implicit union over the different step relations, but we don't connect
% them to paths just yet, because that would've size NxN instead of size N

% TODO callable versus method