-
Notifications
You must be signed in to change notification settings - Fork 0
/
3-call-sensitivity.tex
213 lines (184 loc) · 10.1 KB
/
3-call-sensitivity.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
% !TEX root = main.tex
\section{Call Sensitivity}
In this section, we switch gears.
In the previous sections, we looked at how to formalize what unsafe dataflow is, and how
to implement a provably sound dataflow algorithm in QL.
Now we move on to the discussion of an actual, production-grade dataflow library
that is used to analyze Java programs with millions of lines of code.
This dataflow library is obviously much more sophisticated than the implementation given above,
consisting of several thousands line of QL code,
as it deals with the whole Java semantics, tracks flow through fields stored in
objects, and works intraprocedurally.
We describe how we designed and implemented a restricted notion of call sensitivity
for the dataflow library for Java.
Our contribution was accepted by Semmle (now Github), and is live of LGTM.com.
% TODO see if this can be tied a bit more stronger to the theory
\label{sec:call-sens}
\subsection{Motivation}
A common pattern used by software developers is to
write functions that change their behavior based on
a Boolean parameter.
This happens when two functions are very similar, but only have different behavior
in one small area of the function.
Then a good abstraction is to merge these two functions into one function that takes
a Boolean parameter, toggling the two different sub-functionalities.
One example of this is a function that, depending on where it is called,
should write something to a logfile or not.
Having two copies of that function, one that writes to a logfile, and another one
that doesn't would be a bad design.
As in this example can be determined statically at the callsite whether the logfile
should be written to or not, a constant Boolean argument is passed to the called method.
On the other hand, sometimes programmers even use a top-level
\java{if} statement, so depending on the
value of the Boolean parameter, a function $f$ contains two entirely different implementations.
In those cases, a design with two separate functions containing both implementations,
i.e.\ $f_1(\ldots) = f(\text{true}, \ldots)$ and $f_2(\ldots) = f(\text{false}, \ldots)$
would be preferable from a software engineering perspective. But as this patterns
occurs in practice, it makes sense to handle it.
In the context of static analysis it makes sense to support all patterns
used by software developers.
Thus, it should not make a difference to the static analyzer
if the method $f(\text{true}, \ldots)$ or $f_1(\ldots)$ is invoked.
However, before the implementation of this project, the dataflow library did not
account for this.
Even in the presence of constant Boolean parameters for \java{if}-conditionals,
both branches were analyzed.
This lead to avoidable over-reporting of dataflow.
For example in the Apache Hadoop project, an analysis\footnote{\url{https://lgtm.com/query/6851478535536757210/}}
reveals over 5400 call sites
where a constant Boolean parameter is provided that later influences control flow.
Thus, it makes sense not only theoretically to work on this problem, but it also
solves problems occurring in practice.
The goal of this section is to remove false positives like in~\autoref{lst:fp-cs}
from the dataflow library.
Furthermore, no new sources of unsoundness should be introduced.
The example in~\autoref{lst:fp-cs} has a false positive,
as the dataflow analysis without our changes is unable to avoid.
This is because it does not know that \texttt{param} is constant and evaluates
to \java{true}
in the context of \texttt{B} calling \texttt{f}.
To remove these false positives, the local dataflow
analysis is made call context sensitive, so only relevant branches of a function
are analyzed when computing local
(and in later stages of the algorithm, interprocedural) dataflow.
\begin{listing}
\begin{javacode}
public void f(boolean param, Object o) {
if(param) {
// do X
}
else {
// do Y
sink(o);
}
}
public void B() {
Object tracked = source();
f(true, tracked);
}
\end{javacode}
\caption{Flow is reported from the source in line 12 to the sink in line 7}
\label{lst:fp-cs}
\end{listing}
\subsection{Theoretical Background}
Static analysis often has to deal with the fact that analyzing individual methods
is much cheaper than running an interprocedural analysis.
Various methods have been developed to make analyses context-sensitive.
A thorough treatment can be found in Chapter 8 of~\cite{spa}.
Here, we use the theoretical approach of call strings, but optimized to our purpose.
The call string (or call context) only keeps track of the Boolean parameter, not
all parameters.
Furthermore, call contexts are only tracked for functions where it actually makes sense
to do so --- if there are no eliminable dataflow nodes in a function,
no call context is tracked.
This is a mere performance optimization and does not affect correctness.
From a semantic perspective, the correctness of parameterizing a
dataflow algorithm with a call context is obvious ---
propagating a constant boolean from a callsite to the callee does not change
control flow.
The algorithm then prunes provably unreachable nodes from the dataflow graph.
These are all the nodes that are only reachable when the Boolean parameter is set
to the opposite value.
This approach does not affect soundness,
as nodes that are not reachable at run-time will never be part of dataflow.
The completeness of the algorithm
is improved in case an unreachable node was part of a dataflow path.
Thus, the computation of dataflow is improved, and from a theoretical perspective
there are no drawbacks associated with the changes.
\subsection{Description of Implementation}
The QL dataflow algorithm works by computing sets of candidate nodes that
in each phase are filtered to exclude nodes that are not part of the final dataflow graph.
Call context sensitivity is also implemented as a filter of candidate nodes
that are available for local flow steps.
First, we identify nodes that should be filtered in the
predicate \texttt{isUnreachableInCall(Node n, DataFlowCall call)}.
It computes the set of all nodes in the dataflow graph that are unreachable
in a given call.
It does so by utilizing the Guards library.
The Guards library is an abstraction that allows its users to detect
basic blocks in the control flow graph that are only reachable if an
expression (for example, a variable read) is true (or false).
A node is unreachable in a given call if
\begin{itemize}
\item one of the arguments in that call is Boolean
\item that Boolean is either a compile-time constant (as defined by \cite{jls}, section 15.28)
or a read of an SSA variable that is defined using a Boolean compile-time constant
\item the Boolean parameter is used in an expression that guards a basic block
\item the unreachable node is in that basic block
\end{itemize}
The analysis takes virtual dispatch into account by relying on the virtual
dispatch library.
For example,
arguments of method calls on interfaces are matched up with methods
on interface implementations.
If no node in a callable is unreachable in a call,
recording precise call contexts for the sake of eliminating dataflow paths
is not necessary.
During the computation of the step relation, the dataflow algorithm
decides whether to record a precise call context or not.
This check is modified to record precise call contexts whenever that call context
then allows the removal of unreachable nodes in that call.
The computation of the step relation relies on the local flow bigstep relation
to take local flow steps.
The computation of the local flow bigstep relation is modified to include checks
that all nodes that are part of a local bigstep step are not unreachable in the
given context.
% TODO what about this?
%The correctness of the implementation was not formally proven, but we provided
It should be noted that the implementation of this feature was contributed to Semmle/Github,
reviewed by their engineers and is live on the LGTM.com platform.
The final commit of the contribution contained a diff of 225 added lines and
79 deleted lines of QL code for Java.
Furthermore, our contribution contains 297 added lines of test code.
These are unit tests that test the contribution in different settings, to ensure
that the feature doesn't break in case of future changes to the product.
We provided both test cases that exhibit flow in the presence of boolean
arguments, and that don't exhibit flow (with our contribution).
Those tests build confidence that the implementation does filter out unreachable
dataflow paths, but does not filter out reachable dataflow paths.
Based on this contribution for Java, engineers at Github adopted
the approach and introduced call-sensitive path pruning for C\#.
%\subsection{Evaluation}
%Maybe find a real-world false-positive that has been removed, Anders had one
% or two in JDK
\subsection{Future Improvements to Call Sensitivity}
The analysis described in this pattern can be further generalized, as
this pattern applies to more than just functions that take a Boolean parameter.
Boolean types are generalized by enumerated types, providing (usually) more than two options.
In fact, using an enum-typed variable over a Boolean variable can be the better
from a software engineering perspective
even when just encoding a choice of two options.
Choices such as left and right are better represented
by enum constants \texttt{LEFT} and \texttt{RIGHT}, rather than the
non-descriptive Boolean constants \texttt{true} and \texttt{false}.
Thus, there is a theoretical justification of also looking at enum types.
Practically speaking, we find that the Apache Hadoop project has almost 1000
function calls\footnote{\url{https://lgtm.com/query/8607714098277820862/}} with
enum constants as parameter.
In all these calls the enum parameter influences control flow in the called method.
Enum constants were not implemented in this project.
However, our architecture is designed to be easily extended.
Only the predicate \texttt{isUnreachableInCall} would need to be modified
to also detects unreachable basic blocks in the presence of enum constant arguments.
Thus, adding support for enum constants is a software engineering effort
with no significant academic interest.