-
Notifications
You must be signed in to change notification settings - Fork 0
/
2-dataflow-library.tex
959 lines (858 loc) · 49.6 KB
/
2-dataflow-library.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
% !TEX root = main.tex
\section{Description of the Dataflow Library}
In bug finding, it is often useful to consider where data (often user input)
ends up at, and how it reached that point in the program.
Generally, we would like to warn developers about security issues.
Many security issues stem from user input that is not sanitized (or improperly sanitized).
For example, in a web application it constitutes a security issue (an SQL injection)
if unsanitized user input reaches a function that executes a SQL statement,
as this gives the user
the ability to run arbitrary SQL statements.
Except for database query tools, this is undesired behaviour that can be abused
by malicious users.
In the same spirit, it is a security issue if a web application displays
unsanitized user input directly on the web page, as this allows the user to execute
java script in the context of the page. This is called an XSS injection.
Warning about both SQL and XSS injections requires, besides other things,
a concept of where user input will end up in the program.
We call this concept \emph{dataflow}.
Data enters the program at some point, at a \emph{dataflow source}, passes through
\emph{dataflow steps} (the data from the source is copied around, passed as parameter
to function calls, \ldots) and then ends up in a \emph{dataflow sink}.
A sink could be a call to a relevant function, for example a function
executing an SQL statement.
All dataflow from sources via one or more steps to sinks make up the
\emph{dataflow graph}.
Nodes in the dataflow graph are points in the program that can hold data,
and edges in the dataflow graph are dataflow steps.
A \emph{dataflow library} is concerned with computing an approximation
the dataflow graph.
As always in static analysis, it is impossible to precisely solve the task at hand.
Calculate an exact dataflow graph would require solving the halting problem.
Because of the halting problem, the dataflow library cannot be sound \textbf{and}
complete. However, it is actually neither sound \textbf{nor} complete.
In general, the dataflow library reports \emph{possible} dataflow, i.e.\ as soon
as there is a path in the control flow graph of the program dataflow is reported.
Thus, the dataflow library is not complete, as it reports more flow than present
in the program.
On the other hand, sometimes dataflow paths that do exist are not reported by the
dataflow library because finding them would be prohibitive due to poor performance.
Thus, the dataflow library is not sound either.
These limitations will be explained later in more detail.
However, a lot of care is taken to be as precise as possible
without sacrificing performance.
Designing a dataflow library that works in theory is not too difficult.
However, designing a dataflow library that performs well on projects with
millions of lines of code is considerably more difficult.
In this section, we will describe the dataflow library used by the CodeQL platform % TODO check marketing material, is this correct?
for analyzing programs in Java.
We will describe the dataflow library without the contributions that are outlined
in the next sections.
The dataflow library tracks only the flow of exact values.
For example, dataflow of a string is interrupted when \java{toUpperCase()} is called.
However, using functions like \java{toUpperCase()} does not prevent XSS or SQL
injection attacks. To find such paths that do not interrupt flow when the
content of a variable still depends heavily on the input data
(i.e.\ it is \emph{tainted}), the \texttt{TaintTracking} library exists.
It is built on top of the dataflow library and provides a generic interface to enable
finding of bug classes XSS, SQL injections or directory traversals.
While the taint tracking library is probably the biggest user of the dataflow library,
it is implemented strictly on top of the dataflow library.
Thus, we do not discuss it further, except to highlight the hooks in the dataflow
library on top of which the taint tracking library is built.
Because the taint tracking library is built on top of the dataflow library,
all improvements made to the dataflow library immediately also benefit taint tracking.
There is one exception to the rule that changing the data stops dataflow: All numeric types
(integers and floating point numbers) are treated as data of the same type,
as converting from one type to the other often preserves the value.
\subsection{Interface}
Before we delve into the internals of the dataflow library, we describe the API
that users of the dataflow library will encounter.
\subsubsection*{Nodes}
The main concept used throughout the dataflow library is that of a \texttt{Node}.
Every element in the program holding data is a node.
The dataflow library is concerned with the task to identify
under which circumstances there is flow between nodes.
The set of nodes and the set of dataflow steps between those nodes make up the
\emph{dataflow graph}.
The QL type \texttt{Node} has several subtypes, corresponding to different kinds of nodes.
The most important subclass of \texttt{Node} is \texttt{ExprNode},
as every expression in the program corresponds to a \texttt{Node}.
Another important subclass is \texttt{ParameterNode}.
Both the implicit \texttt{this} parameter for methods and all explicit parameters
given to methods are introduced in a method via these nodes.
In a call, data flows from the node of the argument of the
call to the corresponding parameter node in the callee.
Objects that potentially change through an operation are modelled
via \texttt{PostUpdateNode}s.
Objects that have a method called on them (potentially mutating it),
field accesses on an object and parameters to method calls (the callee could mutate the object)
have two nodes for that expression - one so-called \emph{pre-update node}, that represents
the object before the operation, and a \emph{post-update node} that represents
the object after the operation.
This becomes more clear when looking at the statement \java{a.field = "string"}.
In this example, there exists an expression node for
\java{"string"}, an expression node for the instance access \java{a.field},
an expression node for \java{a} and an expression node for the entire assignment.
However, none of these expression nodes capture the fact that the object \java{a}
is mutated by the assignment.
To represent that fact a \texttt{PostUpdateNode} for \java{a} exists,
representing that the assignment (potentially) mutates the object \java{a}.
The corresponding pre-update node is the expression node for \java{a}.
Other subtypes of the \texttt{Node} class exist as well,
but are less important to the bigger picture of understanding the dataflow library.
This subtypes are more technical in nature and correspond for example
to instance creation, implicit reads of the \texttt{this} parameter or
the desugaring of vararg method calls.
One peculiarity of the \texttt{Node} class is that the dataflow graph
is stored in separate predicates, not on the nodes making up the dataflow
graph itself.
This means that the dataflow graph is stored in an adjacency matrix, where the
adjacency matrix is encoded in the tuple set of a predicate.
%Different kinds of dataflow graphs can be represented by different predicates,
%using the same set of nodes.
% TODO elaborate - or maybe not?
\subsubsection*{Configuration}
The \texttt{Configuration} is the main class a user interacts with.
A configuration defines which nodes are sources and sinks of dataflow.
Furthermore, a configuration can also specify additional steps which yield flow.
For XSS, a configuration could add that a call to
\texttt{toUpperCase()} on a string still provides dataflow.
On the other hand, configurations can also specify nodes which block flow.
In the example, this could be a sanitizer method that strips HTML tags.
This enables the user to filter out dataflow paths that are known to be safe.
The most important predicate on the \texttt{Configuration} class
is the \texttt{hasFlow(Node source, Node sink)} predicate.
Using that predicate, users can query the dataflow relation.
It holds whenever there is dataflow starting at the given node \texttt{source}
(which has to be a source) which flows to the given node \texttt{sink} (which has to be a sink).
Note that this interface does \textbf{not} provide information about specific
dataflow paths.
It also limits dataflow queries to dataflow between the (previously defined) sources and sinks.
This design limitation is intentional, because it allows performance optimizations
that make it feasible to compute whole-program dataflow.
The configuration also allows the user to disable dataflow via field reads and stores.
Doing so improves performance and allows the dataflow library to explore all branches
from virtual dispatch.
If field flow is enabled, the user has to set a limit on the number of branches
allowed for virtual dispatch in the presence of field flow.
Methods calls via virtual dispatch are only considered as flow targets if there are
sufficiently few possible implementations to analyze.
This restriction is needed to prevent the combinatorial explosion stemming from
fully respecting virtual dispatch and field flow.
This is explained in more detail later.
% TODO is it?
Furthermore, the configuration allows the specification of additional dataflow
steps, as well as barrier nodes.
These concepts are discussed in more detail below.
\subsection{General Design}
In theory, implementing a dataflow algorithm in a logic programming language is simple.
Given the \texttt{Node} concept discussed above, one only needs to provide a relation
\texttt{step(Node node1, Node node2)}.
This step relation holds if \texttt{node1} has dataflow in one step to \texttt{node2}
(local and global steps!).
Then, the transitive closure of \texttt{step} has all possible dataflow paths in the program.
These dataflow paths then can be filtered so that only paths which begin in a source and
end in a sink are considered.
In practice however, this example wouldn't perform well.
First, note that the number of \texttt{Node}s grows roughly linear in the size of
the program.
Second, the transitive closure of the \texttt{step} relation enumerates all dataflow
paths in the program from one node to another.
It is defined on the Cartesian product of \texttt{Node} with itself that is roughly quadratic
in the size of the program.
We would end up with a predicate that grows probably superlinear in the program size.
The CodeQL toolchain is capable of handling very large programs,
for which asymptotic superlinear algorithms typically don't scale.
Thus, a more involved approach to compute dataflow is needed.
On a very high level, the algorithm used in the dataflow library is very similar
to the simple algorithm sketched above.
The last step of it is indeed taking the transitive closure of a step-relation.
However, a lot of care is taken to define the step relation only on suitable
nodes that with high probability will be part of a dataflow path from a source to a sink.
For example, nodes that cannot be reached from any source via a dataflow path
are ignored.
This pre-filtering of the suitable nodes makes up the most of the code for the
algorithm.
Reducing the node set makes it possible to define the step relation on a much
smaller set that is not quadratic in the size of the program.
\subsection{Description of the algorithm}
We will now present the details of how dataflow is computed for Java programs.
First, the algorithm runs four passes of finding candidate nodes that are part
of dataflow paths.
Every pass refines the set of nodes by computing more accurate information about the
possible dataflow.
Recording only candidate nodes (and not candidate dataflow edges) means that
the predicates for node filtering are only defined on the node set,
not on the node set times the node set.
Thus, the candidate node sets are inherently linear in the size of the program.
This makes their computation feasible.
Second, the algorithm computes a step relation and takes the transitive closure of that.
As the step relation is only defined on a very refined set of candidate nodes,
it's size is small enough to handle even for large programs.
The implementation of the algorithm is divided into a generic dataflow algorithm that is
used for Java, C/C++ and C\# and a specific language-specific implementation of a common interface.
In this description, we are glossing over the separation of those two modules and
describe the algorithm formed by the generic module together with the language-specific
module for Java.
The implementation of the dataflow library for Java encompasses roughly 4000 lines of
QL code. Describing every single detail of the implementation would be neither feasible nor
useful. Thus, some details are omitted or the implementation of a particular step
of the algorithm is only described on a very abstract level.
Other parts of the algorithm are described more in-depth.
\subsubsection*{Terminology}
Before we can start with the steps of the algorithm, we need to introduce some terminology.
The \emph{virtual dispatch} library provides functionality to resolve calls of
methods through interfaces and class hierarchies.
For any given method call, it provides a list of all potential targets of that
call site.
With an optionally provided call context, it can further narrow down the list of
possible method implementations.
A \emph{call context} encapsulates the information that the analyzed callable was
called from a specific call site, so information about the arguments to the
method are available.
\begin{listing}
\begin{javacode}
interface I {
public void f();
}
class A implements I {
@Override
public void f() {
// Implementation A
}
}
class B implements I {
@Override
public void f() {
// Implementation B
}
}
class User {
public void g(I i) {
i.f();
}
public void h() {
A a = new A();
g(a);
}
}
\end{javacode}
\caption{Example code for virtual dispatch resolution}
\label{lst:virtual-dispatch}
\end{listing}
In the example~\autoref{lst:virtual-dispatch}, the \java{interface I} defines a
method \java{f()}, and the two
classes \texttt{A} and \texttt{B} both implement the interface \texttt{I}.
Without call contexts, the virtual dispatch library would resolve the call
of \java{f()} in the method \java{User.g(i)} to both
implementations in \texttt{A} and \texttt{B}.
However, if call contexts are provided to the virtual dispatch library
it is able to infer that when \java{User.g(i)} is invoked by the method
\java{User.h()} it only can dispatch to the implementation of \texttt{f()} in
\java{class A}.
This functionality is orthogonal to the dataflow library, and the implementation
of virtual dispatch is not described.
A \emph{local} flow step is a dataflow step where data flows from one node to another
and both nodes are inside the same callable.
Local flow computation utilizes a SSA form.
A standard textbook SSA with $\phi$-nodes is used.
Some examples of local flow steps are:
\begin{itemize}
\item Data flows from the right-hand side of a variable assignment to a
first read of that variable
\item Data flows from a variable read to an adjacent variable read (a use-use chain in SSA lingo).
This means that both nodes read the same variable, and there is no other operation between the
reads, except possibly the presence of $\phi$-nodes.
\item Data flows from a parameter node to a first read of the parameter
\item Data flows from the inside of an expression in parentheses to the dataflow node
representing the outer expression, i.e.\ in the expression \java{(var)} there is dataflow from the node
representing the variable read \java{var} to the dataflow node representing the entire expression \texttt{(var)}.
\item Data flows from a post-update node that represents an object to an adjacent read
of that object
\end{itemize}
There are some more local flow steps that are quite technical in nature.
The local flow computation is sound, i.e.\ it never omits dataflow that is
present in the program.
% TODO check again, we're talking about single steps here, not entire graphs!
% This might be a bit early here
Unlike for global dataflow, the entire local dataflow graph for each function
is computed.
This is necessary because the local dataflow is an important building block in computing
the global dataflow.
A priori it is not clear whether a local dataflow path is part of a (global) dataflow path
from a source to a sink, thus it is not possible to filter paths based on that condition.
As methods are much smaller in relation to the whole program, computing the whole local
dataflow graph is feasible and no combinatorial explosion occurs.
A \emph{jump} flow step is a dataflow step that jumps between callables
in a way that ignores call contexts.
In Java, there are two possibilities for a jump step to occur.
\begin{itemize}
\item \textbf{Static fields:} Data flows from an assignment to a static field
to a read of that static field, if the field read is not determined by SSA.
The field read is only determined by SSA if a local value is assigned to the
static field.
However, in most cases the static field will be assigned to in one callable
and read in another.
\item \textbf{Closures:} Data flows from a use or a definition of a variable
in the outer callable to a use of a captured variable in the inner function.
\end{itemize}
An \emph{additional} (local or jump) flow step is a dataflow step that the user
provides via the \texttt{Configuration}.
The prime example of additional flow steps are taint steps.
In Java, string concatenation is a taint step, but not a (regular) dataflow step.
The object that flows from the source to the sink is changed by the concatenation,
but for example adding a newline to the end of a string makes it still dangerous
to use in a SQL query without further checks.
Note that an additional step (by definition) does not preserve the value of the
variable in question, as all value-preserving steps are already contained in the
dataflow library.
A \emph{barrier} is a node that prevents dataflow.
Barriers are user-specified via the \texttt{Configuration}.
For example, they are used by the taint tracking library to model sanitizing steps
that stop tained values to flow to a source.
% TODO remove this, check for other references to it
%An \emph{extended return} is a generalized return that representing dataflow out of a callable.
%It is either a return statement, which we call a \emph{value return} or a parameter update.
%Parameter updates are often called having an \emph{out parameter}.
%The programmer supplies the reference to an object in a parameter, and the method then
% TODO check what exactly out parameters are in Java, have a look at https://lgtm.com/query/4328056797028651291/
% Should be in fact vacuous
\subsubsection*{Field Flow and Access Paths}
The dataflow library has support for (an approximation of) storing the data in a
field of an object. The details of the approximation used will be described later, but
we will introduce the concepts now.
Whenever data is stored in an object, we need to switch the data we are tracking ---
before a store, we track an object that originated at a source, after a store the object
containing the field needs to be tracked, together with the information in which
field the originally tracked data is stored in.
Conversely, when reading out of the field, we need to switch back from tracking
the containing object to tracking the data we were originally interested in.
Because objects can be stored in objects, just keeping track of the state
direct tracking versus tracking a field stored in an object is not enough.
Instead, we need to keep track of a list of all the fields objects were stored into along
the current path.
We call such a list an \emph{access path}.
This access path provides the needed context to track dataflow through fields.
The access path is empty if the data that originated from the source was not
stored in any field.
The \emph{head} of the access path indicates which field was last stored into,
if the access path is non-empty.
The access path is treated as a stack-like datastructure.
At every field store, the field is pushed to the access path, thus the head is
replaced by the current context.
At every field read, we pop the head from the access path which restores the old context.
Using the access path, we can precisely track field flow.
As access paths can be potentially lists of infinite length, for example
when dealing with recursive datastructures, we cannot deal with access paths
with full precision.
Thus, they need to be approximated.
Throughout the algorithm, several approximations of access paths and the push and
pop operation on the access path are used.
These approximations will be discussed in detail.
\subsubsection*{Flow Through Methods}
As a preparatory step, the dataflow algorithm summarizes methods that at some point
return one of their parameters to a single dataflow step.
This is independent of the configuration, and does not take into account
neither additional steps nor barriers.
Later steps in the algorithm offset these shortcomings.
The dataflow from the argument node to the node for the value returned by
the method is thus represented by only a single edge.
\begin{listing}[H]
\begin{javacode}
class A {
String name;
public void setName(String name) {
this.name = name;
return this;
}
}
class B {
public void caller() {
A a = new A();
a.setName("a name");
}
}
\end{javacode}
\caption{Example code for flow through method summarization}
\label{lst:flow-through-example}
\end{listing}
In the code example in~\autoref{lst:flow-through-example}, there is dataflow of the instance
of class \java{A} to the method \java{setName}, because the instance a
is (implicitly) passed as \java{this} parameter.
Because of the return statement, there is dataflow back from \texttt{setName} to
\texttt{caller}.
After the summarization step, the dataflow in line 13 is described in a single step,
from the node representing \texttt{a} as implicit argument to the method call, to the node
representing the expression \java{a.setName("a name")}.
This summarization helps the dataflow algorithm in several ways.
In the first passes of the dataflow algorithm, we do not keep track of call contexts
for method calls, as that would be too expensive.
However, that means that calls and returns are not matched up, and every return out of
a method flows to any of the callsites.
For example, some programmers write their setters in a way that they always return \java{this},
as that allows chaining setters.
If one of these setters has 100 call-sites and we wouldn't match up the callers with the callees,
the dataflow analysis would be very imprecise.
Using method summaries is an inexpensive way
to rectify that problem, as they match up callers and callees.
It also ensures that any dataflow path first returns out of methods, and only
then enters other methods. As soon as a path starts entering methods, it can
never encounter return steps anymore.
Furthermore, summarization provides shorter paths with fewer nodes involved.
This helps performance.
The summarization computes for every parameter a relation with all nodes
that are reachable by taking local (value-preserving) dataflow steps from that parameter.
This means all dataflow nodes that receive that parameter are marked.
Then a \emph{parameter flows through a method} if the parameter reaches a
dataflow node associated with a return statement.
This information is then used together with the virtual dispatch analysis
(not described here) to match up arguments to parameters, yielding a predicate
that describes all arguments that flow through their method calls.
% TODO the sentence before is very complicated
These predicates are mutually recursive.
This means that in the setting of an outer method that calls an inner
method both methods are summarized if a parameter flows through both of them.
The summarization is run in two passes.
The first pass works exactly as described above.
Its output is then used to detect getter and setter methods.
This relies on the predicate that indicates which dataflow nodes in a method are
reachable from a parameter.
A \emph{getter} is a method where data flows from a parameter node
(the implicit \java{this} argument to methods) to a field read (or recursively another getter)
and then to a return statement.
A \emph{setter} is a method with (at least) two parameters,
where data flows from one parameter to a field store (or recursively another setter),
and the store updates a field in an object that was provided as a parameter as well.
The second pass enhances the detection of method calls where an argument flows through
by also taking \emph{local store-read steps} into account.
These are stores of data into a field, that are read later, and both
the store and the read happen in the same method.
This second pass relies on the first pass, as the store and read steps take the
(previously computed) getters and setters into account.
\subsubsection*{Node Filtering --- Phase 1 forward}
Before computing the dataflow graph for a program,
the algorithm identifies several sets of candidate nodes that are possibly part of
that dataflow graph.
These sets are restricted further and further with every filtering pass.
These sets are configuration-specific.
They take sources, sinks, barriers and additional steps from the configuration into
account.
Note that in the description of the passes, \emph{candidate node} always refers
to being a candidate node of that pass, unless otherwise noted.
The first pass is done in the predicate \texttt{nodeCandFwd1}.
It uses a flooding-style algorithm to compute all nodes reachable from the source nodes
using dataflow steps, while ignoring all call contexts and access paths.
It is a huge overapproximation of the node set of the dataflow graph.
But because it is very simple it is cheap to compute.
The candidate node set is described recursively and computed by a fixed-point
iteration.
A \texttt{Node n} is a candidate node if it is not a barrier node and
one of the following conditions holds:
\begin{itemize}
\item The node \texttt{n} is a source node.
\item There exists a candidate node \texttt{mid} such that \texttt{n} is reachable with a single flow step,
either a local step, a jump step or an additional step.
\item There exists a candidate dataflow node \texttt{arg}, such that
\texttt{arg} is an argument in a call, and \texttt{n} is a dataflow node
representing the parameter in the callee.
In this case, we say that we have dataflow \emph{into} a callable.
\item The opposite case is that we have dataflow \emph{out} of a callable.
That happens when the node \texttt{n} is a node representing a value returned by
a call, either directly or by being a \texttt{PostUpdateNode} that
represents a parameter that is possibly modified by a call.
Furthermore, there exists a candidate node that returns from the callee.
\item The node \texttt{n} is a \texttt{PostUpdateNode}, representing the state of an
object immediately after a field store (either direct, or by using a setter).
Furthermore, the data that is stored needs to be a candidate node already.
\item The node \texttt{n} takes the result of a field read
(either directly, or via a getter), the node from which we read is already a
candidate node \textbf{and} the field we read from is stored to
by a candidate node.
\end{itemize}
In the case of calls and returns, virtual dispatch is taken into account,
so calls branch out to all possible targets, and returns go back to all possible callers.
Field flow is only considered when the field flow is enabled in the configuration.
If field flow is enabled, we do not only track candidate nodes, as computed by
\texttt{nodeCandFwd1}, but also candidate fields in which we possibly store as
part of a dataflow path.
These fields are tracked by the predicate \texttt{storeCandFwd1}, which is mututally
recursive with \texttt{nodeCandFwd1}.
While we do not track any access paths in this first pass,
the candidate fields can be understood as (a superset of) the nodes that make up
access paths, as only fields in which we store can be a part of an access path.
\subsubsection*{Node Filtering --- Phase 1 backwards}
The second pass is done in the predicate \texttt{nodeCand1}.
The second pass filters the results from the first pass by only keeping nodes
that also are reachable from the sinks when taking all the steps backwards.
This pattern is used for all the candidate node filtering steps --- first, we
filter the node set starting from the sources and taking steps forward,
and then we filter that set further by starting with the sinks and taking steps backwards.
This eliminates all intermediary nodes on paths that never reach a sink.
Like \texttt{nodeCandFwd1}, \texttt{nodeCand1} ignores all call contexts
and all access paths.
All cases are symmetric to the ones in \texttt{nodeCandFwd1}.
In detail, this means a \texttt{Node n} is a candidate node as considered by
\texttt{nodeCand1} if it is contained in
\texttt{nodeCandFwd1} and either of the following is true:
\begin{itemize}
\item The node \texttt{n} is a sink node.
\item There exists a candidate node \texttt{mid} such that \texttt{mid} is reachable with a single flow step,
either a local step, a jump step or an additional step from the node \texttt{n}.
\item Flow into a callable:
There exists a candidate node \texttt{param} that is a parameter of a callable,
such that \texttt{node} is a possible argument that gets passed to the callable.
\item Flow out of a callable:
There exists a candidate node that is a return statement, and \texttt{n} is
a node in a caller, receiving the returned value.
\item The node \texttt{n} is an object from which a field \texttt{f} is read,
where the value of the read flows to a candidate node, and \texttt{f}
is a candidate field as identified by the predicate \texttt{storeCandFwd1}.
\item The node \texttt{n} is stored into a field \texttt{f}, where the
object after the store is already a candidate node, \texttt{f} is a
candidate field as identified by the predicate \texttt{storeCandFwd1}
\textbf{and} the field is read from by a candidate node.
\end{itemize}
Again, if field flow is enabled, we also track candidate fields that could make up
access paths. This predicate is called \texttt{readCand1} and is a subset of
\texttt{storeCandFwd1}.
It tracks the fields that are read from in a path from the sink to a source.
Both passes in phase 1 take virtual dispatch into account by dispatching to all possible callees
for each call site. Only later phases use a more advanced analysis to use the call context
to restrict the virtual dispatch targets.
% TODO elaborate on type checking
%Note that no type checking is done in phase 1, so
\subsubsection*{Node Filtering --- Phase 2 forward}
The second phase of node filtering works very similar to phase 1 --- it
operates on all nodes that are present in \texttt{nodeCand1} and filters them further.
It is again implemented in a flooding-style algorithm with a forward and backward
pass, like in phase 1.
However, it starts taking very simple call contexts and access paths
into account to make the analysis more precise.
In phase 2 both the call context and the access path are approximated by booleans.
The boolean for the call context tracks whether the data flow entered through
a parameter or not.
The boolean for the access path tracks whether the access path is empty or not.
The forward pass is implemented in the predicate \texttt{nodeCandFwd2}.
It is different in the following steps from phase 1:
It skips over methods where the argument value
flows through and uses the summarized dataflow instead.
To take additional non-value-preserving steps into account,
summaries are computed for methods which take at least one
non-value-preserving step as well.
Flow out of a callable is only taken into account when the call context
indicates that we did not end up in the current node via a method call.
This excludes the non-summarized paths from the predicate.
Whenever we encounter a store, we simulate pushing to the head of the access path
by setting the flag indicating that the access-path is non-empty.
Non-value-preserving steps are only allowed when the access path is empty.
When the access-path is non-empty, we are tracking an object in which the tracked
data is stored. However, when a non-value-preserving step is applied to that object,
object identity does not hold anymore, and the value of the field with the tracked
data is lost.
Thus, additional dataflow steps are only allowed to be taken when the access path
is empty.
Whenever we encounter a read, we need to simulate a pop operation on the access path.
However, with the limited information we track it is not possible to determine
whether the access path is empty after a read or not.
Thus, we continue with both options considered possible.
Additionally, both reads and stores are only considered when they appear in the
predicate \texttt{readCand1}.
Furthermore, just like in \texttt{nodeCandFwd1}, the predicate
\texttt{storeCandFwd2} tracks all fields that are stored into in the dataflow
covered by this pass.
As we only consider fields from \texttt{readCand1}, it is a further refinement
of that set.
Whenever we encounter flow into or out of callables while having a non-empty
access path \textbf{and} we encounter virtual dispatch,
we check that the branch limit holds.
Virtual dispatch targets which branch out a lot are ignored when tracking objects
into which we stored earlier. What exactly a lot means is determined by the
configuration.
This pruning obviously eliminates real dataflow paths from the results, but is needed
to make the analysis feasible.
\subsubsection*{Node Filtering --- Phase 2 backwards}
The exact same filtering as done in phase 2 forwards from the sources to the sinks
is afterwards applied going backwards from the sinks to the sources.
The predicate for this is \texttt{nodeCand2}.
Furthermore, it also refines the potential fields making up access paths even further,
only recording fields that are present in \texttt{storeCandFwd2} and that are
also read from in the flow covered by \texttt{nodeCand2}.
The fields are stored in the predicate \texttt{readCand2}.
\subsubsection*{Local Flow Summarization --- The Bigstep Local Flow Relation}
Before filtering the node set even further, the local flow steps are summarized
into the so-called bigstep relation.
The local flow step relation contains very detailed dataflow information.
For example, the flow from a parameter to a method call
is in general not a single step.
Indeed, every use of the parameter along the control flow graph until the method
call is a part of the dataflow.
Having a such refined model of the dataflow is good for precision, but
reporting that dataflow path to the user would not be useful.
Furthermore, it incurs quite a performance overhead, as a lot of nodes and edges
are needed to model local dataflow on that level.
To address those problems, a coarser version of the
local dataflow step relation is computed.
However, a simple transitive closure over the step relation would lose too much
information, as it would skip over interesting local flow steps,
like storing the data in a field, or calling methods with the information as
argument.
The \emph{local flow bigstep relation} strikes a balance between the refined
local flow step relation and its transitive closure.
It conflates steps until it reaches nodes that are considered to be interesting.
The intent is to skip over steps that are irrelevant in the presentation
to the programmer, while still
reporting enough steps in the dataflow graph that the programmer can understand
where exactly dataflow is happening.
It thus stops conflating steps when it encounters casts, flow into or out of callables,
jump steps, calls to methods where the argument value flows through (see above), field
stores and reads as well as when it reaches a sink.
Cast nodes are important stopping points, because in later steps the dataflow
algorithm actually prunes path based on the type information to rule out
paths where the type is not compatible.
The bigstep relation also keeps track if non-value-preserving steps have been used
in the summarization or not, because as explained above, those steps are only allowed
when the current access path is empty.
\subsubsection*{Node Filtering --- Phase 3 forward}
The next filtering step filters the candidate nodes from \texttt{nodeCand2}
and the candidate fields for the access paths from \texttt{readCand2} even further.
It is implemented in the predicate \texttt{flowCandFwd}, that stores the candidate nodes,
and in the predicate \texttt{consCandFwd} that tracks the fields that potentially
make up the access path.
The main differences to \texttt{nodeCand2} are that the bigstep relation is
used for local flow (thus reducing the number of nodes needed),
a more precise approximation of the access path than just a boolean,
and type-checking along the potential dataflow paths.
Call contexts are still approximated by only a boolean that indicates whether
in the current context, dataflow came from a method parameter.
To approximate the access path, we now store the front of the access path.
The front of an access path is either Nil if the access path is empty,
or the head of the access path (a field).
This is clearly more accurate than the boolean that just stores whether the
access path is empty.
This approximation relies on the observation made by Semmle engineers that
in most programs the current head of the access path is a
good predictor for the next element of the access path.
Furthermore, given the current head of an non-empty access path, we get the
type of the currently-tracked object for free.
Thus, by tracking type information when the access path is empty, we can type-check
the whole path with low overhead.
The access path handling is implemented in the following way:
Sources enforce that the access path front is Nil.
Furthermore, non-value-preserving steps can only be used in contexts with
a Nil access path front.
Again, like in phase 2, we ignore calls to methods that branch out a lot when the
access path front is Nil.
When encountering a field store, we simulate a push to the access path by replacing
the current access path front with a new one that uses the stored-to field
as head.
This operation stores exactly the current content.
However, when encountering a field read, we need to simulate a pop operation.
As we throw away the information about the remainder of the access path as soon as
we encounter a store, we have no reliable way to know what to set the access path front
to.
Thus, we explore all possible options and consider all access path fronts that could fit.
Precisely, during a read from a field \texttt{f},
we explore data flow for all access path fronts that were encountered before
any store into the field \texttt{f} in flow covered by phase 3.
% TODO type check - why?
The example~\autoref{lst:accesspathfronts} explains how the access path approximation as done in the forward-phase
3 of the node filtering works in practice.
It creates an ambiguity in line 25 (the field read from the box),
because there are two places in the program where data is stored in the box
(lines 21 and 35), and these two places have two different access path fronts.
Thus, after reading from \java{box.fo1}, both access path fronts are candidates
and the algorithm cannot distinguish between them.
To create the ambiguity, \java{flowMethod2} is provided that stores
in the second field of class \texttt{A}.
\begin{listing}[H]
\begin{javacode}
class A {
public String stringField1;
public String stringField2;
}
class Box {
public A fo1;
}
class Flow {
// flow enters via the parameter source, access path front is Nil,
// type is String
public void flowMethod(String source) {
A a1 = new A();
// field store, access path front changes to the head for the field
// stringField1 in class A - the new type of the object we're tracking is A
a1.stringField1 = source;
Box box = new Box();
// field store, access path front changes to the head for
// field fo1, class Box - the new type we're tracking is Box
box.fo1 = a1;
// field read, access path front needs to be restored.
// two options, either A.stringField1, or A.stringField2
// the new type we're tracking is A
A a2 = box.fo1;
// field read, access path front becomes Nil, tracked type String
// as there is no read from A.stringField1 anymore on this path
// it is a dead end and will be filtered out in the backwards pass
String trackedObject = a2.stringField1;
// flow of the tracked object to a sink
sink(trackedObject);
}
public void flowMethod2(String source) {
A a1 = new A();
a1.stringField2 = source;
Box box = new Box();
box.fo1 = a1;
A a2 = box.fo1;
String trackedObject = a2.stringField2;
sink(trackedObject);
}
}
\end{javacode}
\caption{Example that shows how the access path front approximation works}
\label{lst:accesspathfronts}
\end{listing}
Types along dataflow paths are mostly checked by the compiler.
For example, we can be sure that the types of the left and righ-hand side of an
assignment are compatible.
However there are a few places where the final type check is either deferred to
the runtime (casts) or where the dataflow analysis could get confused.
Thus, we prune dataflow nodes using the recorded type information at nodes that represent
casts, parameters and returns.
These nodes are only candidate nodes if their type
matches the type recorded in the access path front.
% TODO maybe with example as well?
\subsubsection*{Node Filtering --- Phase 3 backwards}
This pass is symmetric to the forwards pass.
The resulting candidate nodes are stored in the predicate \texttt{flowCand},
and the fields encountered along the paths are recorded in
the predicate \texttt{consCand}.
Both the node and the field candidate sets are subsets of the ones generated
in the forward part of phase 3.
\subsubsection*{Node Filtering --- Phase 4 forward}
The fourth phase is the last phase of node filtering.
Afterwards, the node set is all the nodes in the transitive closure of
the actual step relation (and thus not explicitly computed anymore).
The forwards pass is implemented in the predicate \texttt{flowFwd}.
The set of candidate fields that make up the access path is not
restricted further.
The fourth phase still approximates call contexts by a boolean indicating
whether we entered the current context via a parameter or not.
However, the access path approximation is more accurate.
This is in fact the last access path approximation done by the algorithm.
Storing the full access path is infeasible, as any recursive datastructures
would lead to access paths of infinite size.
As lists in QL have to be fully instantiated this would be impossible to
represent.
Furthermore, even for access paths with finite size $N$ the number of access paths
can grow exponentially in $N$.
This can be seen by looking at simple binary trees, where an access path
of a leaf node is exactly a path from the root to a leaf node, and for a tree
of depth $N$ (the length of the access path) there are $O(2^N)$ different paths.
Thus, the maximum length of access paths has to be chosen carefully to not
cause a combinatorial explosion.
The approximation chosen for access paths is that the first two elements of
the access path are tracked, and for the tail elements the length is tracked.
Furthermore, only access paths up to the length of 5 are considered at all.
Again, like in phase 3, the access path always tracks the type information of the
dataflow, even when the access path is empty.
Pushing to such an approximation is easy - we just preprend the new field to the
existing list.
Should the resulting list have more than 5 entries the dataflow path is not considered
at all.
This helps performance for example in the case of recursive datastructures,
where there are candidate dataflow paths with infinite length.
As in phase 3, popping from a list approximation is more complicated.
However, in phase 4 we know exactly
what to restore the access path to after a read for short access paths.
When the access path has length one or two popping from that list
is possible with full precision.
In the case of access paths of length three to five, we can precisely determine
the new head \texttt{f1} of the access path, as we store the second element.
However, the second element \texttt{f2} in the access path after the read cannot be determined.
Again, like in phase 3, we have to resort to just trying all possible options.
For the second element, we consider all possible fields that occur in dataflow
covered by phase 3 (i.e.\ exactly the fields recorded in \texttt{consCand}).
This is further filtered by just considering fields that occur along any access path
where the head of that access path is the field \texttt{f1}.
This ensures a relatively high precision in most cases.
\subsubsection*{Node Filtering --- Phase 4 backwards}
The backwards pass of phase 4 is implemented in the predicate \texttt{flow}.
This is the final candidate set of nodes on which the step relation is computed.
\subsubsection*{The Step Relation}
The step relation is computed in the predicate \texttt{pathStep}.
It can be viewed as the edge set of a directed acyclic graph,
the dataflow graph.
The node set of that graph is the union of all nodes that are either the beginning
or the end of a single dataflow step.
Taking the transitive closure of the step relation yields a relation that indicates
which sinks are reachable from a given source.
It reuses the same access path approximation as phase 4 of the node filtering,
but keeps track of more call contexts.
It tracks the context of the current start node of the step, i.e.\ whether it is
\begin{enumerate}
\item in a context where the call context doesn't matter
\item in a context where data entered via a parameter. The parameter is recorded,
but the originating call itself is not tracked.
\item in a context where data entered via a parameter. The originating call,
as well as the parameter through which the dataflow entered are recorded.
This type of context is only used when it helps the virtual
dispatch analysis (see the example above). Otherwise, call contexts that don't
track the precise originating call are used.
%\item TODO something about returns (needs Anders)
\end{enumerate}
Distinguishing between entering via a parameter without tracking the exact originating
call and tracking the originating call helps performance, as
less call contexts are tracked.
In general, we do not have a model of a call stack.
Thus, when encountering a call all possible targets of the call have to be considered.
When encountering a return it is not clear which caller should be returned to,
so all callers are considered.
When taking virtual dispatch into account that means that all callers of an interface
method are candidates to return to, even though they might call a different
implementation.
%TODO some call contexts make it possible to restrict this set of possible return
%targets.
The different steps taken are the same as in all the node filtering passes,
the only difference is that both the start and the end point of the step is
recorded.
\subsubsection*{Dataflow}
Using the step relation, it is now almost trivial to check if there is dataflow
between sources and sinks.
One only needs to use the transitive closure of the step relation, as it contains
all nodes reachable from sources, and restrict to those nodes that are sinks.
All those are reachable from sources, with the dataflow path being encoded in the
\texttt{pathStep} predicate.
% TODO: why localFlowStep flows through casts?
% simpleLocalFlowStep (Util): Follows SSA adjacent uses and def-first-use.
% Data also flows from assignments to asignees (Check word??)
% interestingly enough, also through casts
% TODO: local data flow: We have flow from PostUpdateNodes to the next adjacent read,
% but we also have flow from the pre-update node to the next adjacent read?
% Isn't that (more or less) two paths for the same?
% flowthrough matches callsites with returns (flow summary)
% start of considered nodes: nodeCandFwd1(node, config) - simple local flow
% node is reachable via dataflow from the source,
% including local flow, field stores, reads, flows into callables via parameters
% and flow out of callables via modified parameters (takes VD into account)
% doesn't take call contexts into account
% then nodeCand1 filters nodeCandFwd1 by just keeping all nodes which are
% reachable from a sink (through the same steps as before)
% nodeCand2: very simple call contexts, TODO find out exact differences to Cand1
% nodeCand:=nodeCand2 excluding call contexts
% bigstep relation: localFlowBigStep
% flowCandFwd0 something something accesspathfront, starts again with flooding
% from the source with all nodeCand2's
% flowCand0 filters flowCandFwd from the sink backwards to the source
% flowCandFwd = flowCandFwd0 AND CastingNode compatible type check
% flowCand = flowCand0 AND flowCandFwd
% flowCand0 uses flowCand and flowCandFwd, so mutual recursion, goes backwards from sink
% TODO: Sections about casts
% TODO note about numerical types
% TODO: simpleArgumentValueFlowsThrough is not mentioned+only in empty AP
% ANDERS:
% consCandFwd why type check? Every store in a field needs to be compatible
% with the type of the field?
% TODO mention a bit more about the general design/ node exists versus computing explicit
% step relation - we have an implicit union over the different step relations, but we don't connect
% them to paths just yet, because that would've size NxN instead of size N
% TODO callable versus method