Writing fast programs doesn't magically happen. This guide contains tips and best practices for making things go fast in Swift.
Performance optimization is incredibly subtle. Both hardware and the Swift compiler are very sophisticated; and both have non-obvious performance characteristics. By doing some preparation, you can remove a lot of guesswork and save yourself time.
-
Write a benchmark: The current recommendation is to use the Swift Benchmark library, as it has a nice suite of features. Ensure it is representative of the program / subroutine you're trying to optimize. Try to write your benchmarks so they run quickly to enable a fast iteration cycle. (Note: use Swift Benchmark's
--filter
flag to run only the relevant benchmarks you're trying to optimze.)When developing with SwiftPM, we've found that defining an (unexported)
Benchmarks
executable target that aggregates all your benchmarks across your project into a single binary works well. -
Start a spreadsheet & gather baselines: Keep track of what changes you're making, and how they affect performance. One spreadsheet template that works well looks like:
Perf (ns) Hash Notes 1234.5 abc123fd
Baseline 1500.2 cba321df
Switch to using classes instead of structs (whoops!) 36.4 df123abc
Avoid accidentally quadratic behavior by using in-place operations ... ... ... Tip: Every time you run the benchmark, commit the changes. That way you know exactly what you ran and when. (You can always squash commits & rewrite history later.) Additionally, always start with a baseline measurement for your benchmark on your machine. Don't rely on numbers from a previous version of the compiler or library, or having run them on another machine!
Profiling occurs differently depending on which platform you're running on.
Linux has sophisticated performance profiling infrastructure available. We currently recommend using
perf
which is available on all modern Linux distributions & kernels. You can install it on Debian
or Ubuntu machines by executing: sudo apt-get install linux-perf
.
To capture a profile, do something like the following:
swift build -c release -Xswiftc -cross-module-optimization
perf record -F 499 -g -- .build/release/Benchmarks --filter myBenchmark
where Benchmarks
is the name of the SwiftPM executable containing the benchmarks, and the
specific benchmark you're trying to optimize has myBenchmark
somewhere in its name. Once you've
captured the profile, it's time to analyze it!
Prerequisite: clone https://github.com/brendangregg/FlameGraph somewhere.
In the same directory where you ran perf record
above, execute (assuming you checked out the
FlameGraph repository at ../FlameGraph
):
perf script | ../FlameGraph/stackcollapse-perf.pl | swift-demangle | ../FlameGraph/flamegraph.pl --width 1800 > flamegraph.svg
You can then visualize the resulting SVG file with your browser.
Be sure to adjust the paths to the FlameGraph repository if you did not check it out in a sister directory to your current package.
To learn more about how to interpret flame graphs, check out:
gprof2dot
can provide even more sophisticated insights
into the performance of a program.
Prerequisite:
- Debian / Ubuntu:
sudo apt-get install python3 graphviz && pip install gprof2dot
- RedHat / Fedora:
sudo yum install python3 graphviz && pip install gprof2dot
In the same directory where you ran perf record
above, execute:
perf script | swift-demangle | gprof2dot -f perf | dot -Tpng -o perf.png
You can the visualize the resulting PNG file with your browser or other image viewing software.
gprof2dot has a number of extra variations / configuration options that can help you understand performance more deeply.
By default, gprof2dot
colors vertices based on the total time spent, resulting in a similar
visualization to flame graphs. While this is very useful to understand where time is spent in the
program, it can sometimes be useful to understand which function is itself consuming a significant
amount of time. You can use the --color-nodes-by-selftime
to see this quickly. Example:
perf script | swift-demangle | gprof2dot -f perf --color-nodes-by-selftime | dot -Tpng -o perf-selftime.png
When optimizing reference counting overheads (e.g. retain / releases), it can be helpful to filter
down the graph to more easily identify where the ARC operations are coming from. (ARC traffic can
be especially bad in performance-critical code, as it involves CPU atomics.) The following command
filters down the graph to the subset that calls swift_retain
. (You can perform similar analysis
looking for swift_release
as well.)
perf script | swift-demangle | gprof2dot -f perf -l "swift_retain" | dot -Tpng -o perf-retains.png
Note: filtering for swift_retain
and swift_release
can of course be combined with the
--color-nodes-by-selftime
option, as follows:
perf script | swift-demangle | gprof2dot -f perf -l "swift_retain" --color-nodes-by-selftime | dot -Tpng -o perf-retains-selftime.png
The Swift compiler is very good at optimizing and inlining code. While this is great for
performance, it can sometimes make it very difficult to understand where a particular performance
problem is coming from. For example: if functions b()
, c()
, and d()
are all inlined into
function a()
, only a()
will show up in the profiles and not b()
, c()
, or d()
, even if
a()
is itself fast, and it's actually c()
that's a very slow function. To pinpoint where the
hotspots are, you can annotate b()
, c()
, and d()
with @inline(never)
(temporarily) and
capture a new profile.
Be sure to remove unnecessary
@inline(never)
annotations, as they can adversely affect performance.
When trying to understand why a particular piece of Swift code is performing poorly, it can be helpful to look at the code the Swift compiler generates for a particular pattern.
The Godbolt compiler explorer supports Swift, and is a great way to look at
small snippets of code in isolation. Be sure to set the -O
flag when looking at the generated
assembly. Here is a quick link that sets the compiler to Swift nightly and enables
optimizations.
While Godbolt is fantastic for understanding what assembly is produced by the Swift compiler for small snippets of code, it is sometimes more effective to look at the code generated by the compiler in the context of your project. (e.g. Dependencies are difficult to disentangle / understand in isolation, etc.) Additionally, it can sometimes be more effective to see higher level representations of the program used by the compiler such as the SIL program instead of the LLVM IR or assembly instructions. Advantages of looking at SIL include much better tie-in with the original Swift code, and higher-level semantic operations. Looking at LLVM or the assembly has the advantage of being closer to the hardware, and thus there are fewer layers of abstraction to peer through. (Note that for the most performance critical operations, you'll want to look into the details of your hardware to understand hardware ports, instruction latency, prefetch buffer sizes, and beyond. This is out of scope for this document.)
Debugging the compiler
contains a section called "Printing the Intermediate
Representations"
which describes how to print the intermediate representations based on direct swiftc
invocations.
Thankfully, SwiftPM (when building) can be persuaded to tell you what swiftc
invocations it's
doing when building your project. With a little bit of munging, you can get a command line
invocation you can use to dump the IR's for a given file:
-
Clean: We begin by cleaning the generated artifacts:
swift package clean
-
Build, capturing command lines: Next, we re-build the project, capturing the command lines:
swift build -v -c release -Xswiftc -cross-module-optimization | tee /tmp/build_commands.txt
-
Generate swiftc invocation: We grep for the source filename that contains the code we'd like to analyze (in the example below:
CollectionAlgorithms.swift
), and perform some surgery to make it work independently from SwiftPM:cat /tmp/build_commands.txt | \ grep "swiftc" | \ grep "CollectionAlgorithms.swift" | \ sed 's/-parseable-output //'
This should get you a single command line like the following:
/Library/Developer/Toolchains/swift-tensorflow-DEVELOPMENT-2020-04-15-a.xctoolchain/usr/bin/swiftc -module-name PenguinStructures -incremental -emit-dependencies -emit-module -emit-module-path /Users/saeta/tmp/penguin/.build/x86_64-apple-macosx/release/PenguinStructures.swiftmodule -output-file-map /Users/saeta/tmp/penguin/.build/x86_64-apple-macosx/release/PenguinStructures.build/output-file-map.json -parse-as-library -whole-module-optimization -num-threads 12 -c /Users/saeta/tmp/penguin/Sources/PenguinStructures/AnyArrayBuffer.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/ArrayBuffer.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/ArrayStorage.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/CollectionAlgorithms.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/DefaultInitializable.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/Deque.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/Empty.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/FactoryInitializable.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/FixedSizeArray.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/Heap.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/HierarchicalArrays.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/HierarchicalCollection.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/IdIndexable.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/KeyValuePair.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/NominalElementDictionary.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/PCGRandomNumberGenerator.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/Random.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/Tuple.swift /Users/saeta/tmp/penguin/Sources/PenguinStructures/UnsignedInteger+Reduced.swift -I /Users/saeta/tmp/penguin/.build/x86_64-apple-macosx/release -target x86_64-apple-macosx10.10 -swift-version 5 -sdk /Users/saeta/Downloads/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk -F /Users/saeta/Downloads/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/Library/Frameworks -I /Users/saeta/Downloads/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/usr/lib -L /Users/saeta/Downloads/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/usr/lib -O -g -j12 -DSWIFT_PACKAGE -module-cache-path /Users/saeta/tmp/penguin/.build/x86_64-apple-macosx/release/ModuleCache -emit-objc-header -emit-objc-header-path /Users/saeta/tmp/penguin/.build/x86_64-apple-macosx/release/PenguinStructures.build/PenguinStructures-Swift.h -cross-module-optimization
-
Append dump operations: You'll now need to modify the command line to request
swiftc
to dump whichever IR you'd like. You can see early SIL by passing-emit-silgen
, or optimized SIL by passing-emit-sil
. To see optimized LLVM IR, you can pass-emit-ir
. Finally, it's often helpful to put the output into a file, which can be done by appending-o $FILENAME.txt
as well.For example, to see optimized SIL, we would append
-emit-sil -o PenguinStructures.sil
to the command line generated in the previous step. After running the compiler, we can inspect the SIL by opening the filePenguinStructures.sil
with our favorite text editor. -
Making changes: As you tweak the program, just re-run the command from the previous step, and it will output the updated SIL into the same file. This makes it very easy to see the effect different changes to the source has on the resulting program. Beware, if you change files in other modules, you may need to re-build with SwiftPM and re-generate the command line to dump IRs.
The following are some common performance pitfalls in our experience:
-
Building without optimizations. The Swift language is designed to be easily optimizable. Additionally, the Swift compiler operates in 2 modes: optimizing mode and fast-build mode. Fast-build mode optimizes for quick iteration cycles, and as a result skips a significant number of optimizations. Code compiled without optimizations can easily be an order of magnitude slower than code built with optimizations. Additionally, Swift supports a stable ABI (on some platforms) including resilience boundaries which support library evolution. While this unlocks flexibility, it comes at the cost of performance. Fortunately, you can easily enable cross module optimization which allows for specialization across modules within a codebase.
-
Symptoms: Shockingly slow execution, lots of extra copies, ARC traffic, and thunking showing up in profiles.
-
Medicine: Turn on optimizations in the compiler by passing
-O -cross-module-optimization
. In SwiftPM, build with-c release -Xswiftc -cross-module-optimization
(e.g.:swift build -c release -Xswiftc -cross-module-optimization
). When using Bazel, build with-c opt
.
Note: Swift Benchmark will automatically detect if you've compiled without optimizations. :-)
-
-
Accidentally quadratic: CoW-based data structures can make what should be a linear algorithm into an O(n^2) algorithm.
-
Symptoms: If you see a lot of
swift_arrayInitWithCopy
,memcpy
,_swift_release_dealloc
,swift_release
, and...makeMutableAndUnique...
, you likely are accidentally quadratic. -
Medicine: Figure out where the copy is occuring (use
@inline(never)
as needed), and refactor the algorithm to either (a) avoid making mutations, or (b) avoid making a copy.
-
-
Classes inside Array's: Avoid putting classes inside arrays, as this inhibits a variety of optimizations.
-
Symptoms: Lots of ARC traffic (
swift_retain
,swift_release
) shows up in profiles (usegperf2dot
instead of flame graphs). -
Medicine: You can either (a) use a
struct
instead of aclass
, or (b) (dangerous) use one of the unsafe abstractions (UnsafeBufferPointer
,ManagedBuffer
, etc.), or (c) (even more dangerous) useUnsafePointer
and/orUnmanaged
.
-
-
Reflection: Swift's optimizer is very good at optimizing protocols and generics, but is very bad at optimizing reflection-based code. Avoid using reflection-based code, and instead optimize for using protocols and generics instead.
-
Escaping Closures: Swift has support for closures, and closures are often leveraged in Swift API design. While non-escaping closures are often inlined and specialized away, escaping closures incur heap allocations, reference counting traffic, and cost at least as much as a virtual function dispatch.
-
Symptoms: If you see "thunks" in your flame graphs and/or unexpectedly high numbers of ARC operations and an API uses escaping closures, it's possible that escaping closures are inhibiting performance optimizations.
-
Medicine: Consider (if possible) re-designing the API / library to avoid escaping closures. Note that sometimes an escaping closure is the most appropriate API.
Note: the "costs" of escaping closures are typically not in the escaping closure itself, but rather come from preventing a variety of other optimizations that would otherwise be possible.
-