-
Notifications
You must be signed in to change notification settings - Fork 0
/
1-introduction.tex
190 lines (166 loc) · 11.5 KB
/
1-introduction.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
% !TEX root = main.tex
\section{Introduction}
\subsection{Static Analysis}
Static analysis is the analysis of computer programs, while not executing the program.
Usually, the static analysis has access to the source code of the analyzed program.
Static analysis has several goals, one of them is to support the programmer in finding (possible) mistakes and
security issues in their programs.
Furthermore, static analysis is also employed by compilers in order to optimize code.
If for example a compiler can prove that the condition expression of an if statement is always false,
a branch instruction can be avoided.
In this project outside the course scope, we use a static analyser that is developed
solely to find bugs and report them to the developers.
\subsection{Description of CodeQL}
The software package we are using for static analysis is called \emph{CodeQL}.
It was first developed by a company called Semmle and sold to corporate customers worldwide.
Additionally, open-source projects hosted on GitHub and other open-source platforms could
get analysis results of their projects for free on the LGTM.com platform\footnote{LGTM
is an acronym for \emph{Looks good to me}, a common statement of reviewers before merging
pull requests.}.
In September 2019, GitHub announced that they bought Semmle.
Following the acquisition, the CodeQL CLI utility was released to the public,
which allows the analysis of any software, not only publicly hosted projects.
However, the license specifies that any usage has to either be academic research,
or that the analyzed codebase is released under an OSI-approved open-source license.
Still, this was quite a step up from only having code that has been uploaded to
GitHub analyzed.
During this project, the CodeQL CLI utility was used under the clause allowing academic research.
Throughout the report, there are links to CodeQL analyses on open-source projects.
These refer to a single analysis (called query in CodeQL lingo), as well as a specific result
set associated with a project revision at the time of writing the report.
We expect that these links are snapshots, i.e.\ as long as they are available,
the content does not change.
% TODO terminology callable, argument, parameter
\subsubsection*{Architecture of CodeQL}
Now we give a short overview of the CodeQL architecture, as well as where our
project fits in.
The CodeQL tool is built upon the domain specific language QL\footnote{QL is an acronym for \emph{Query Language}}
that enables declarative, logic-based programming.
It is based on first-order predicate logic.
Every analysis of a program is described as a query of a relational database built using
the program's source code. These queries are written in QL.
In order to analyze source code, it first needs to be processed into a database representation.
For each language an \emph{extractor} parses the source code and outputs a database file.
The extractor is language specific and usually hooks into a (possibly modified) compiler for that language.
The output of the extractor is a relational database in a language-specific \emph{database schema}.
This database schema evolves along with the language, as well as extractor capabilities.
For example, the Java database schema changed when preliminary support for the Java 13 yield statement was
added to the extractor.
The \emph{database engine} allows users to query the database file produced by the extractor.
These queries, written in QL, implement the actual static analyses.
Both the database engine and the QL language are agnostic to the language the analyzed source code
is written in.
This allows developers at GitHub with relative ease to support new languages in the CodeQL ecosystem.
In theory, QL queries are also independent of the language the analyzed source code is written.
However, in practice queries are tied to the language of the analyzed source code,
as they depend on the language-specific database schema.
Furthermore, queries can't gloss over language-specific semantics either {---} one example of many would be
that signed integer overflow needs to be treated differently in Java than in C/C++.
For each language, a set of \emph{libraries} assists in the development of queries.
These libraries are implemented in QL.
First, the libraries abstract away from the database view of the program source code,
and provide an object-oriented representation of the program.
Second, features such as taint tracking, unsafe dataflow analysis or control flow graphs is implemented in libraries.
For every language supported by the CodeQL analyzers, developers usually re-implement all of these libraries from
scratch, as sharing components across languages is quite difficult.
The only exception is the dataflow library, which is separated into a language-agnostic and
language-specific part.
This language-agnostic part of the dataflow library is shared between the libraries for Java, C/C++ and C\#.
Using those libraries, developers at GitHub write \emph{queries} to detect bugs in the analyzed program.
For example, one query detects when variable dereference in Java might be null.
Another query detects if user input may end up in a database query, without the user input being
sanitized properly for use in such a query.
Queries often rely on the supporting libraries to do the heavy lifting, e.g.\ the SQL injection detection
is built on top of the dataflow library.
All queries as well as the supporting libraries are open-source and published
on GitHub\footnote{\url{https://github.com/Semmle/ql}}.
Most of the extractors and the query engine are only available in compiled form under a
proprietary license\footnote{\url{https://github.com/github/codeql-cli-binaries}}.
While the license is quite restrictive,
usage of the CodeQL CLI utility for academic research is explicitly allowed.
Our project outside the course scope contributes a small improvement to the dataflow library,
both the shared and the language-specific part. We only implemented the language-specific
functionality, and a GitHub employee provided stubs for C/C++ and C\#.
Furthermore, the theoretical framework laid out in this project aims to further the understanding
of the dataflow algorithm employed by CodeQL.
\subsection{Philosophy of Bug Finding}
Using static analysis for bug finding is lacking in the theoretical foundation {---}
analyses are often neither sound nor complete.
This means that a query can report \emph{false positives}, problems flagged by the query that
are not bugs at all.
Furthermore, queries can also not report instances of a bug which are present in the code,
that instance is then called a \emph{false negative}.
Obviously both false positives and negatives are not desirable, but static analyzers go a long way
to reduce false positives as a means to reduce noise.
If false positives are reduced, almost every problem flagged by an analysis is a real bug.
This makes the static analyzer a very useful tool.
However, this makes it difficult to express in a theoretical framework to prove that
an analysis is useful, as this approach eludes an easy definition.
In this report, we deal with this problem by reducing the complexity of the analyzed language,
so in the theoretical setting we are dealing with a (provably) sound analysis, i.e.\
an analysis that has no false negatives.
\subsection{Evaluation Model of QL}
QL is an object-oriented logic programming language built on top of Datalog.
However, its syntax is very much inspired by Java, so it might look weird to the
experienced logic programmer at first.
The paper~\cite{qlpaper} provides a thorough introduction to QL, gives a
formal syntax, semantics and a translation to ordinary Datalog.
Here we provide a short overview about the evaluation model that should
enable the reader to understand this report.
At the root of QL is the concept of a \emph{predicate}.
A predicate contains a finite set of tuples of typed data.
There are two types of predicates in QL.
The contents of tables in the database are exposed to the programmer as \emph{database
predicates} --- these are predefined in the program and can only be queried, but not modified.
The types and names of these predicates are specified in the relational database schema.
An example of a database schema for a small language (introduced later) can be
seen in the appendix.
The data for these predicates is provided by the extractor (according to the database schema),
and loaded at runtime by the database engine.
All other predicates in a QL program are \emph{user-provided predicates}.
These predicates are built declaratively.
Thus, the programmer states, using first-order predicate logic,
which data tuples belong in a predicate.
Compared to Prolog, probably the best-known logic programming language,
predicates in QL can be built much more expressively,
as quantifications and logic operators can be freely mixed.
In Prolog, programmers are restricted to (implicitly universally quantified) Horn clauses.
Another difference to Prolog is that there is only single definition for each predicate,
whereas in Prolog a predicate is a list of Horn clauses.
In QL, user-provided predicates can often separated in two parts:
Base case(s), i.e.\ disjuncts that contain a list of logical requirements for data
to be part of the predicate, without invoking recursion.
Recursive case(s), i.e.\ disjuncts that declare how existing members of the predicate
can be used to generate new members of the predicate.
For example, a predicate that traverses a binary tree might contain a base case that visits
the root node, and then two recursive cases that visit the two children (if they exist).
All cases are joined together by a top-level disjunction.
Unlike Prolog, QL uses forward chaining.
This means that it uses the existing tuples in a predicate and applies the declarative
rules in the predicate as long as new tuples are generated.
In other words, the QL database engine computes the set of tuples that are a least fixed point
of the first-order formulae of the user-defined predicates.
In a general setting, this fixed-point might not always exist.
To ensure existence of least fixed-points while allowing recursion,
QL only allows recursion if the predicate appears under an even number of negations.
Another rule that is needed to ensure termination is that all data included in a tuple
ranges over a finite domain.
For data types defined in the database, this is trivially true.
For numbers, Booleans and strings, this is ensured by requiring that they are bound
in the predicate body, for example by restricting their values to a finite set of constants,
or by restricting their values to be in another predicate.
One of the user-provided predicates is designated to be the \emph{query predicate}.
Its content is provided as the output of a running a query.
Often, a query predicate will contain source code locations together with a message,
explaining a problem that was detected at that location.
While QL gives the programmer a very expressive language to state their intent,
the programmer can not control how the query is executed.
It is the sole task of the query engine to come up with and optimize a
query plan suitable for execution on the database.
A useful feature of QL is that it includes syntactic sugar to compute the reflexive-transitive
and the transitive closure of binary predicates.
For example, a predicate \texttt{getSucc} on control flow nodes that contains the
successor relationship can be turned into the set of all reachable
nodes of that control flow node by invoking \texttt{getSucc*}.
For the details of the object model, as well as more details of the evaluation, we refer
the interested reader to~\cite{qlpaper}.