4-path-sensitivity.tex

% !TEX root = main.tex
\section{Path Sensitivity}
\subsection{Motivation}

As described previously, the dataflow library conflates all control flow paths 
in a function and just reports possible flow through a function.
This is obviously an overapproximation of flow.
For example in~\autoref{lst:ex-ps-1} flow would be reported,
although it is quite obvious that no flow can happen at runtime,
because the to if conditionals evaluating \java{b} are mutually exclusive.
In one call of \java{f(b)}, only one of the blocks is executed.
Thus, no dataflow happens.

To address this issue,
it makes sense to add a notion of path-sensitivity to the dataflow library.
This path-sensitivity tracks that some path contexts make it possible to 
filter out dataflow paths that will never occur at runtime,
because a path cannot visit dataflow nodes in mutually exclusive blocks.
As in~\autoref{sec:call-sens}, our goal is to not introduce new sources of 
unsoundness, so a pruned path by our proposed algorithm
is never allowed to be a live path.

\begin{listing}[h]
    \begin{javacode}
public void f(boolean b) {
    Object o = null;
    if (b) {
        o = source();
    }
    if (!b) {
        sink(o);
    }
}
    \end{javacode}
    \caption{Simple example of path-sensitive dataflow}
    \label{lst:ex-ps-1}
\end{listing}

The issue of path sensitivity is not just a theoretical issue, but of practical importance.
One analysis that runs on LGTM.com is the nullness analysis.
It warns when developers (potentially) dereference a pointer that is null.
Right now, it is built separately from the dataflow library and only works 
intraprocedurally.
However, it could be implemented on top of the dataflow library.
The formulation of the nullness analysis as dataflow problem is:
Any literal \java{null} and any variable declaration of an object 
without initialization is a source.
Any pointer dereference (i.e.\ field or method access on an object) is a sink.
Then any assignment to a variable with a non-null expression leads to flow to stop.
This approach would extend the nullness to be an interprocedural one, and furthermore 
reduce the amount of code that needs to be maintained.

Unfortunately with the current dataflow library, the resulting analysis would 
be way too imprecise to be feasible.
Often, the nullness-status of a variable is correlated with the value of a Boolean 
variable, like in~\autoref{lst:ex-ps-nullness}.

% TODO bla
While having path-sensitivity as presented in this section would not be enough 
to enable the precise analysis of~\autoref{lst:ex-ps-nullness},
having a concept that some dataflow nodes are only active in some program paths 
is an important prerequisite to implement this analysis.
The reason that the presented analysis wouldn't work on that example is that
the proposed path-sensitivity focuses on identifying mutually exclusive 
blocks like in~\autoref{lst:ex-ps-1}.
So the deduction of the fact that whenever \java{b} is \java{true},
then \java{o} is overwritten with something non-null is missing.
Adding this to the dataflow analysis would be another step needed to 
implement a precise nullness analysis on top of it.

\begin{listing}[h]
    \begin{javacode}
public void A(boolean b) {
    SomeClass o = null;
    if (b) {
        o = new SomeClass();
        o.method1();
    }
    // some more code
    if (b) {
        o.method2();
    }
}
    \end{javacode}
    \caption{Nullness example that needs path-sensitivity}
    \label{lst:ex-ps-nullness}
\end{listing}


The goal of this section is to describe our approach to add a notion of
path-sensitivity to the dataflow library.
Then we highlight where and why our approach introduced soundness issues
that made it unacceptable for Github to merge the code.
However, Github still benefits by learning more about the problems
concerning implementing path-sensitivity and that this approach does not work.
Furthermore, over the course of the project a lot of test cases containing
corner cases for path-sensitivity were contributed.
%TODO is there?
%At last, we propose a possible fix to our approach. 
% TODO possible fixes?
%Because we ran out of time, it was not possible to implement 
%and evaluate the fixed approach, though.


%\subsection{Theoretical Background}
%Is there any?

\subsection{Description of Implementation}
\subsubsection*{\texttt{Property}s and local flow filtering}
The first part of the implementation introduces the concept of a \texttt{Property}
to the local dataflow analysis.
A property is a fact that, given a Boolean, holds at one or more dataflow nodes.
Whenever a property holds with \java{true}, it is guaranteed that the dataflow nodes 
at which the property holds with \java{false} will not be live.
The concept of a property abstracts away from dealing with boolean condition variables
directly. It is intended to be easily extended.

The intuition behind a property is that it abstracts that a given 
dataflow node is guarded by a Boolean condition variable.
Thus, in~\autoref{lst:ex-ps-1} the property associated with the boolean variable \java{b}
holds with the value \java{true} for all dataflow nodes guarded by
\java{if (b)}, and it holds with the value \java{false} for all dataflow nodes 
guarded by \java{if (!b)}.
It does not hold for other dataflow nodes in the program.

In the current implementation, the only inhabitant of the \texttt{Property}
type is the \texttt{BooleanSsaVarProperty}, that represents the concept of
a Boolean variable that guards dataflow nodes.
While there are no other subtypes of \texttt{Property} yet, an implementation
for enum constants instead of Boolean variables would be fairly easy to write.

With the property concept in place, the local flow 
is filtered.
We only filter local flow to reduce the computational complexity of the problem,
as opposed to filter interprocedurally.
For any property that holds at any of the nodes of a local flow path,
the local flow path has to be compatible with that property, i.e.\ the same 
property never holds with both \java{true} and \java{false} for that local flow path.
If there are no properties that hold along a local flow path, nothing is filtered.

Note that this approach to filtering fixes one property first, and then filters 
for that, and then considers the next property independently of the first one.
Thus,~\autoref{lst:ps-diamond} is reported as having flow.
First, when considering the boolean \java{b1}, it is inferred that flow is only 
possible if \java{b1} is \java{false} 
(otherwise, the source is never part of the flow).
Then, independently of that it is inferred that flow is only possible if \java{b2}
is \java{false} (otherwise, the sink is never part of the flow).
However, the information about those properties are never combined to detect 
that the assignment \java{y = x} can never happen while both reaching the 
source and the sink.

\begin{listing}
    \begin{javacode}
public void f(boolean b1, boolean b2) {
    Object x = null;
    if (!b1) {
        x = source();
    }
    Object y = null;
    if (b1) {
        y = x;
    }
    if (b2) {
        y = x;
    }
    if (!b2) {
        sink(y);
    }
}
    \end{javacode}
    \caption{Example of false positive because of a diamond-shaped control flow graph}
    \label{lst:ps-diamond}
\end{listing}

A supposedly easy fix to correctly filter that false positive,
would be that the implementation considers a list of currently active
properties that need to be checked against at the same time.
For most programming languages, this is not a problem.
In QL however, lists are quite tricky.
They are not a first-class datatype, and because in QL all user-provided datatypes 
are fully instantiated, they need to be fully constructible.
Thus, any "list" datatype would need to be fully instantiated, this means it would
need to list all possible lists ahead of time.
This makes lists very complicated to deal with, and often not very useful.
Thus, lists do not make for a quick fix for this issue.

\subsubsection*{The Boolean SSA Variable property}
Now that we have seen how properties are used, how exactly do they work internally?

Conceptually, we want to capture the fact that a boolean variable guards 
control flow.
For that, we can rely on the guards library (as discussed shortly in~\autoref{sec:call-sens}).
Exposing all boolean variables as properties would incur a too high performance penalty,
though.
Because of that, we pre-filter the set of boolean variables used as guards.
We only allow Boolean variables that guard at least two different blocks,
with inverted conditions.
This means that the property is useful for pruning later.

Note that this implicitly assumes that pruning is always okay when we encounter
two blocks that cannot be live at the same time, because they read the same variable.
As we see in~\autoref{ssec:soundness}, this is unfortunately not true.
However, we still describe the proposed algorithm fully to present the insights
learned during this project.

The equality relation on Boolean variables used is equality on the 
SSA form of the program - after assigning to a variable, it most likely does not
contain the same value as before the assignment, so we treat it as a different variable.
Thus, the two guards (for example, two if conditionals) need to read the same SSA 
variable $b_1$ in their conditions, where one guard checks for $b_1$ to be \java{true},
and the other guard checks for $b_1$ to be \java{false}.
Then all the nodes that are guarded are candidates for dataflow pruning.

Note that this approach assumes that a SSA variable always evaluates 
to the same value.
However, this is not true.
In~\autoref{lst:ps-ssa-var-different}, the variable \java{b} has one SSA definition 
in line 4, but different values at run time, depending on the iteration.
Note that the SSA definition dominates all reads, thus no $\phi$-node is
needed.
We can conclude that relying that \java{b} is the same SSA variable 
does not guarantee 
that it always evaluates to the same value.
Without that guarantee, any pruning based on \java{b} would introduce 
unsoundness into the algorithm.

\begin{listing}[h]
    \begin{javacode}
public void f() {
    int j = 0;
    do {
        boolean b = j % 2 == 0;
        Object x = null;
        if (b) {
            x = source();
        }
        if (!b) {
            sink(x);
        }
        j++;
    } while (j < 100);
}
    \end{javacode}
    \caption{Example where the SSA variable \java{b} has different values at run time}
    \label{lst:ps-ssa-var-different}
\end{listing}

An oversimplistic fix for this issue to require that the SSA definition of the
Boolean variable needs to be outside of any loops.
A loop is defined on the control flow graph ---
if the SSA definition viewed as a control flow graph node can reach itself by taking
arbitrary (function-local) steps on the control flow graph, it is contained in a loop.
This covers all cases where unsoundness occurs, but many more legitimate cases.
Furthermore, it would lead to the very strange behaviour that the dataflow analysis 
does not report dataflow because of path pruning, but as soon as the code is 
wrapped in a loop statement, false positives would start to appear.
This kind of behaviour is difficult to explain to the user, and also difficult
to justify theoretically.

Thus, a more involved solution is needed that caputures the
essence of the problem with higher precision.
The key insight is that every time before the SSA variable is re-defined, it 
goes out of scope, so technically we have been treating different variables
as equal, just because they had the same name.

Thus, we need to ensure that whenever we prune dataflow based on Boolean variables,
all reads of the SSA variable refer to the SSA variable \textbf{in the same scope}.
Then, the assumption that the SSA variables have the same value is actually true.

As~\autoref{lst:ps-ssa-different-scopes} demonstrates,
the issue of whether the second read of \java{b} (outside of the loop)
refers always to the same definition of \java{b} that the first read of \java{b}
uses (inside the loop) is dependent on the details of the loop execution.
Thus, due to the halting problem, we cannot aim to find a precise solution to this.
However, we only need a conservative heuristic that tells us when
two reads will provably refer to the same definition of the SSA variable.
This approximation would not mark the two reads of \java{b}
in~\autoref{lst:ps-ssa-different-scopes} as referring to the same SSA definition.

We propose the following notion of that heuristic:
A SSA Boolean variable \java{b} with two reads is eligible for pruning 
if all control flow paths from one read of \java{b} to the SSA definition of
\java{b} include the other read as node on that control flow path.
This ensures that everytime one read gets a value from \java{b}, the other read
also gets it.
Furthermore, it covers the case that \java{b} is defined in a loop (possibly
multiple times), but both reads happen after the loop (thus, the value of \java{b}
has "settled" before reading it).
The case that both reads of \java{b} are in the loop defining \java{b} is also
included.

The case that one read of \java{b} happens inside the loop, and one outside of 
the loop (like in~\autoref{lst:ps-ssa-different-scopes}) is correctly covered - 
there exists a control flow path from the first read of \java{b} to the SSA 
definition of \java{b} that does \textbf{not} visit the second read of \java{b}.
Thus they might not get the same value of \java{b}, thus pruning is not safe
in this case.

\begin{listing}[h]
    \begin{javacode}
public void f(int k) {
    int j = 0;
    boolean b;
    Object x = null;
    do {
        b = j % 2 == 0;
        if (b) {
            x = source();
        }
        j++;
    } while (j < k);
    if (!b) {
        sink(x);
    }
}
    \end{javacode}
    \caption{An example where reads of 
    \java{b} might refer to different SSA definitions}
    \label{lst:ps-ssa-different-scopes}
\end{listing}

In general, this approach would need a formal argument to support it.
This argument would show that every time this heuristic indicates that two
SSA reads read from
the same definition, they actually read the same value.
This would convince the reader that this is indeed the right approach 
to tackle this problem.
However, as the whole proposed algorithm suffers soundness problems as described 
in the next subsection, a formal justification is omitted.

\subsection{Soundness Problem}
\label{ssec:soundness}
Even with the quite complicated logic in place ensuring that only the same 
instance of an SSA variable is read when pruning, unsoundness can occur.

In~\autoref{lst:ps-false-negative}, the reads of the SSA variable \java{b} clearly
fulfills all conditions described above, thus \java{b} and it two reads
are eligible to be used for dataflow pruning.
After pruning the example does not report dataflow.
However, in one loop iteration data flows from the source to \java{x},
and in another loop iteration the data flows from \java{x} to the sink.
Ergo pruning this dataflow is wrong.
This shows that the assumption that pruning dataflow is sound whenever two mutually 
exclusive reads of a SSA variable (even when always having the same value!) 
happen is wrong.

\begin{listing}[h]
    \begin{javacode}
public void f() {
    boolean[] boolArray = {true, false};
    Object x = null;
    for(boolean b: boolArray) {
        if (b) {
            x = source();
        }
        if (!b) {
            sink(x);
        }
    }
}
    \end{javacode}
    \caption{Example of a false negative with the proposed algorithm}
    \label{lst:ps-false-negative}
\end{listing}

An easy fix for this would be to return to the first proposed fix above and
only consider boolen variables for pruning that are defined outside of loops,
with all the associated drawbacks discussed above.
Because of those drawbacks, this is understandably
not an option that leads to an algorithm that Github wants to deploy.

However, for a more general sound algorithm, the example shows that it is not enough
to come up with the Boolean variables used for pruning dataflow without also 
looking at the control flow path used by the dataflow path under consideration.

A better algorithm would need to operate on the control flow graph itself to
factor in that knowledge into the dataflow pruning.
This would allow it to see that data flows from the source to the sink in different 
loop iterations.

With the current design of the dataflow algorithm, this is very difficult.
In the very first step of computing (possible) local dataflow, it already 
only looks at possible reads of SSA variables, thus ignoring control flow structures
entirely.
As this is a lossy process, it is not possible to go back from an edge in the
dataflow graph and get the corresponding control flow path used by that edge.
This is intentional by design, to make the dataflow representation as sparse 
as possible, and thus more performant.

Fixing this would require a major refactoring of the dataflow algorithm with 
uncertain outcome.
This is, while unsatisfactory, clearly outside the allocated time budget
of a 15 ECTS project outside the course scope.


%\subsection{Improved Algorithm Proposal}