diff --git a/_posts/2025-02-04-data-wants-to-be-free.md b/_posts/2025-02-04-data-wants-to-be-free.md new file mode 100644 index 00000000000..271074e3c34 --- /dev/null +++ b/_posts/2025-02-04-data-wants-to-be-free.md @@ -0,0 +1,306 @@ +--- +layout: post +title: "Data wants to be free: fast data exchange with Apache Arrow" +description: "" +date: "2025-02-04 00:00:00" +author: David Li, Ian Cook, Matt Topol +categories: [application] +image: + path: /img/arrow-result-transfer/part-1-share-image.png + height: 1200 + width: 705 +--- + + + + + +_This is the second in a series of posts that aims to demystify the use of +Arrow as a data interchange format for databases and query engines._ + +As data practitioners, we often find our data “held hostage”. Instead of being +able to use data as soon as we get it, we have to spend time—time to parse and +clean up inefficient CSV files, time to wait for an outdated query engine to +struggle with a few gigabytes of data, and time to wait for the data to make +it across a socket. It’s that last point we’ll focus on today. In an age of +multi-gigabit networks, why is it even a problem in the first place? And it is +a problem—research by Mark Raasveldt and Hannes Mühleisen in their [2017 +paper](https://ir.cwi.nl/pub/26415/p852-muehleisen.pdf) demonstrated that some +systems take over **ten minutes** to transfer a dataset that should only take +ten *seconds*. + +Why are we waiting 60 times as long as we need to? [As we've argued before, +serialization overheads plague our +tools](https://arrow.apache.org/blog/2025/01/10/arrow-result-transfer/)—and +Arrow can help us here. So let’s make that more concrete: we’ll compare how +PostgreSQL and Arrow encode the same data to illustrate the impact of the data +serialization format. Then we’ll tour various ways to build protocols with +Arrow, like Arrow HTTP and Arrow Flight, and how you might use each of them. + +# PostgreSQL vs Arrow: Data Serialization + +Let’s compare the [PostgreSQL binary +format](https://www.postgresql.org/docs/current/sql-copy.html#id-1.9.3.55.9.4) +and [Arrow +IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) +on the same dataset, and show how Arrow (with all the benefit of hindsight) +makes better trade-offs than its predecessors. + +First, we’ll create a table and fill it with data: + +``` +postgres=# CREATE TABLE demo (id BIGINT, val TEXT, val2 BIGINT); +CREATE TABLE +postgres=# INSERT INTO demo VALUES (1, 'foo', 64), (2, 'a longer string', 128), (3, 'yet another string', 10); +INSERT 0 3 +``` + +We can then use the COPY command to dump the raw binary data from PostgreSQL into a file: + +``` +postgres=# COPY demo TO '/tmp/demo.bin' WITH BINARY; +COPY 3 +``` + +Then we can look at the actual bytes of the data and annotate them, based on the [documentation](https://www.postgresql.org/docs/current/sql-copy.html#id-1.9.3.55.9.4) for parsing the PostgreSQL binary format: + +
00000000: 50 47 43 4f 50 59 0a ff  PGCOPY..  COPY signature, flags,
+00000008: 0d 0a 00 00 00 00 00 00  ........  and extension
+00000010: 00 00 00 00 03 00 00 00  ........  Values in row
+00000018: 08 00 00 00 00 00 00 00  ........  Length of value
+00000020: 01 00 00 00 03 66 6f 6f  .....foo  Data
+00000028: 00 00 00 08 00 00 00 00  ........
+00000030: 00 00 00 40 00 03 00 00  ...@....
+00000038: 00 08 00 00 00 00 00 00  ........
+00000040: 00 02 00 00 00 0f 61 20  ......a
+00000048: 6c 6f 6e 67 65 72 20 73  longer s
+00000050: 74 72 69 6e 67 00 00 00  tring...
+00000058: 08 00 00 00 00 00 00 00  ........
+00000060: 80 00 03 00 00 00 08 00  ........
+00000068: 00 00 00 00 00 00 03 00  ........
+00000070: 00 00 12 79 65 74 20 61  ...yet a
+00000078: 6e 6f 74 68 65 72 20 73  nother s
+00000080: 74 72 69 6e 67 00 00 00  tring...
+00000088: 08 00 00 00 00 00 00 00  ........
+00000090: 0a ff ff                 ...
+ +Honestly, PostgreSQL’s binary format is quite understandable, and pretty compact at first glance. But a closer look isn’t so favorable. **PostgreSQL has overheads proportional to the number of rows and columns**: + +* Every row has a 2 byte prefix for the number of values in the row. *But the data is tabular—we already know this info, and it doesn’t change\!* +* Every value of every row has a 4 byte prefix for the length of the following data, or \-1 if the value is NULL. *But we know the data types, and those don’t change—most values have a fixed, known length\!* +* All values are big-endian. *But most of our devices are little-endian, so the data has to be converted.* + +Sure, we need to store if a value is NULL or not, but 4 bytes is a *bit* much for a boolean. String data and other non-fixed-length types need per-value lengths, but PostgreSQL adds the length for *every* type of value. And converting big-endian to little-endian is pretty trivial…but it’s still work that stands in between you and your data. To PostgreSQL’s credit, its format is at least cheap and easy to parse—[other formats](https://protobuf.dev/programming-guides/encoding/) get fancy with tricks like “varint” encoding which are quite expensive. + +For example, a single column of int32 values would have 4 bytes of data and 6 bytes of overhead per row—**60% is “wasted\!”**[^1] The ratio gets a little better with more columns (but not with more rows); in the limit we approach “only” 50% overhead. + +How does Arrow compare? We can use [ADBC](https://arrow.apache.org/adbc/current/driver/postgresql.html) to pull the PostgreSQL table into an Arrow table, then annotate it like before: + +```console +>>> import adbc_driver_postgresql.dbapi +>>> import pyarrow.feather +>>> conn = adbc_driver_postgresql.dbapi.connect("...") +>>> cur = conn.cursor() +>>> cur.execute("SELECT * FROM demo") +>>> data = cur.fetchallarrow() +>>> pyarrow.feather.write_feather(data, "demo.arrow") +``` + +(Aside: look how easy that is!) + +
00000000: 41 52 52 4f 57 31 00 00  ARROW1..
+00000008: ff ff ff ff d8 00 00 00  ........
+00000010: 10 00 00 00 00 00 0a 00  ........
+00000018: 0c 00 06 00 05 00 08 00  ........
+00000020: 0a 00 00 00 00 01 04 00  ........
+00000028: 0c 00 00 00 08 00 08 00  ........
+00000030: 00 00 04 00 08 00 00 00  ........
+00000038: 04 00 00 00 03 00 00 00  ........
+00000040: 74 00 00 00 38 00 00 00  t...8...
+00000048: 04 00 00 00 a8 ff ff ff  ........
+00000050: 00 00 01 02 10 00 00 00  ........
+00000058: 18 00 00 00 04 00 00 00  ........
+00000060: 00 00 00 00 04 00 00 00  ........
+00000068: 76 61 6c 32 00 00 00 00  val2....
+00000070: 9c ff ff ff 00 00 00 01  ........
+00000078: 40 00 00 00 d8 ff ff ff  @.......
+00000080: 00 00 01 05 10 00 00 00  ........
+00000088: 18 00 00 00 04 00 00 00  ........
+00000090: 00 00 00 00 03 00 00 00  ........
+00000098: 76 61 6c 00 04 00 04 00  val.....
+000000a0: 04 00 00 00 10 00 14 00  ........
+000000a8: 08 00 06 00 07 00 0c 00  ........
+000000b0: 00 00 10 00 10 00 00 00  ........
+000000b8: 00 00 01 02 10 00 00 00  ........
+000000c0: 1c 00 00 00 04 00 00 00  ........
+000000c8: 00 00 00 00 02 00 00 00  ........
+000000d0: 69 64 00 00 08 00 0c 00  id......
+000000d8: 08 00 07 00 08 00 00 00  ........
+000000e0: 00 00 00 01 40 00 00 00  ....@...
+000000e8: ff ff ff ff 08 01 00 00  ........
+000000f0: 14 00 00 00 00 00 00 00  ........
+000000f8: 0c 00 18 00 06 00 05 00  ........
+00000100: 08 00 0c 00 0c 00 00 00  ........
+00000108: 00 03 04 00 1c 00 00 00  ........
+00000110: c8 00 00 00 00 00 00 00  ........
+00000118: 00 00 00 00 0c 00 1c 00  ........
+00000120: 10 00 04 00 08 00 0c 00  ........
+00000128: 0c 00 00 00 98 00 00 00  ........
+00000130: 1c 00 00 00 14 00 00 00  ........
+00000138: 03 00 00 00 00 00 00 00  ........
+00000140: 00 00 00 00 04 00 04 00  ........
+00000148: 04 00 00 00 07 00 00 00  ........
+00000150: 00 00 00 00 00 00 00 00  ........
+00000158: 00 00 00 00 00 00 00 00  ........
+00000160: 00 00 00 00 00 00 00 00  ........
+00000168: 2a 00 00 00 00 00 00 00  *.......
+00000170: 30 00 00 00 00 00 00 00  0.......
+00000178: 00 00 00 00 00 00 00 00  ........
+00000180: 30 00 00 00 00 00 00 00  0.......
+00000188: 27 00 00 00 00 00 00 00  .......
+00000190: 58 00 00 00 00 00 00 00  X.......
+00000198: 3b 00 00 00 00 00 00 00  ;.......
+000001a0: 98 00 00 00 00 00 00 00  ........
+000001a8: 00 00 00 00 00 00 00 00  ........
+000001b0: 98 00 00 00 00 00 00 00  ........
+000001b8: 2a 00 00 00 00 00 00 00  *.......
+000001c0: 00 00 00 00 03 00 00 00  ........
+000001c8: 03 00 00 00 00 00 00 00  ........
+000001d0: 00 00 00 00 00 00 00 00  ........
+000001d8: 03 00 00 00 00 00 00 00  ........
+000001e0: 00 00 00 00 00 00 00 00  ........
+000001e8: 03 00 00 00 00 00 00 00  ........
+000001f0: 00 00 00 00 00 00 00 00  ........
+000001f8: 18 00 00 00 00 00 00 00  ........
+00000200: 04 22 4d 18 60 40 82 13  ."M.@..
+00000208: 00 00 00 22 01 00 01 00  ..."....
+00000210: 12 02 07 00 90 00 03 00  ........
+00000218: 00 00 00 00 00 00 00 00  ........
+00000220: 00 00 00 00 00 00 00 00  ........
+00000228: 10 00 00 00 00 00 00 00  ........
+00000230: 04 22 4d 18 60 40 82 10  ."M.@..
+00000238: 00 00 80 00 00 00 00 03  ........
+00000240: 00 00 00 12 00 00 00 24  .......$
+00000248: 00 00 00 00 00 00 00 00  ........
+00000250: 24 00 00 00 00 00 00 00  $.......
+00000258: 04 22 4d 18 60 40 82 24  ."M.@.$
+00000260: 00 00 80 66 6f 6f 61 20  ...fooa
+00000268: 6c 6f 6e 67 65 72 20 73  longer s
+00000270: 74 72 69 6e 67 79 65 74  tringyet
+00000278: 20 61 6e 6f 74 68 65 72   another
+00000280: 20 73 74 72 69 6e 67 00   string.
+00000288: 00 00 00 00 00 00 00 00  ........
+00000290: 18 00 00 00 00 00 00 00  ........
+00000298: 04 22 4d 18 60 40 82 13  ."M.@..
+000002a0: 00 00 00 22 40 00 01 00  ..."@...
+000002a8: 12 80 07 00 90 00 0a 00  ........
+000002b0: 00 00 00 00 00 00 00 00  ........
+000002b8: 00 00 00 00 00 00 00 00  ........
+000002c0: ff ff ff ff 00 00 00 00  ........
+000002c8: 10 00 00 00 0c 00 14 00  ........
+000002d0: 06 00 08 00 0c 00 10 00  ........
+000002d8: 0c 00 00 00 00 00 04 00  ........
+000002e0: 34 00 00 00 24 00 00 00  4...$...
+000002e8: 04 00 00 00 01 00 00 00  ........
+000002f0: e8 00 00 00 00 00 00 00  ........
+000002f8: 10 01 00 00 00 00 00 00  ........
+00000300: c8 00 00 00 00 00 00 00  ........
+00000308: 00 00 00 00 08 00 08 00  ........
+00000310: 00 00 04 00 08 00 00 00  ........
+00000318: 04 00 00 00 03 00 00 00  ........
+00000320: 74 00 00 00 38 00 00 00  t...8...
+00000328: 04 00 00 00 a8 ff ff ff  ........
+00000330: 00 00 01 02 10 00 00 00  ........
+00000338: 18 00 00 00 04 00 00 00  ........
+00000340: 00 00 00 00 04 00 00 00  ........
+00000348: 76 61 6c 32 00 00 00 00  val2....
+00000350: 9c ff ff ff 00 00 00 01  ........
+00000358: 40 00 00 00 d8 ff ff ff  @.......
+00000360: 00 00 01 05 10 00 00 00  ........
+00000368: 18 00 00 00 04 00 00 00  ........
+00000370: 00 00 00 00 03 00 00 00  ........
+00000378: 76 61 6c 00 04 00 04 00  val.....
+00000380: 04 00 00 00 10 00 14 00  ........
+00000388: 08 00 06 00 07 00 0c 00  ........
+00000390: 00 00 10 00 10 00 00 00  ........
+00000398: 00 00 01 02 10 00 00 00  ........
+000003a0: 1c 00 00 00 04 00 00 00  ........
+000003a8: 00 00 00 00 02 00 00 00  ........
+000003b0: 69 64 00 00 08 00 0c 00  id......
+000003b8: 08 00 07 00 08 00 00 00  ........
+000003c0: 00 00 00 01 40 00 00 00  ....@...
+000003c8: 00 01 00 00 41 52 52 4f  ....ARRO
+000003d0: 57 31                    W1
+ +Arrow looks quite…intimidating…at first glance. There’s a giant header, and lots of things that don’t seem related to our dataset at all, plus mysterious padding that seems to exist solely to take up space. But the important thing is that **the overhead is fixed**. Whether you’re transferring one row or a billion, the overhead doesn’t change. And unlike PostgreSQL, **no per-value parsing is required**. + +Instead of putting lengths of values everywhere, Arrow groups values of the same column (and hence same type) together, so it just needs the length of the buffer. Strings do still require a length per value, but the overhead isn’t added where it isn’t otherwise needed. And nullability is instead stored in a bitmap, which is omitted if there aren’t any NULL values in the first place, saving space. Because of that, more rows of data doesn’t increase the overhead; instead, the more data you have, the less you pay\! + +Even the header isn’t actually the disadvantage it looks like. The header contains the schema, which makes the data stream self-describing. With PostgreSQL, you need to get the schema from somewhere else. So we aren’t making an apples-to-apples comparison in the first place: PostgreSQL still has to transfer the schema, it’s just not part of the “binary format” that we’re looking at here. +Meanwhile, there’s actually a more insidious problem with PostgreSQL we’ve overlooked so far: alignment. Remember that 2 byte field count at the start of every row? Well, that means all the 4 byte integers after it are now unaligned…so you can’t use them without copying them (or doing a very slow unaligned load). Arrow, on the other hand, strategically adds some padding (overhead) to align the data, and lets you use little-endian or big-endian byte order depending on your data. And Arrow doesn’t apply expensive encodings to the data that require further parsing; there’s just optional compression that can be enabled if it suits your data[^2]. So **you can use Arrow data as-is without having to parse every value**. + +That’s the benefit of Arrow being a standardized, in-memory data format. Data coming off the wire is already in Arrow format, and can be passed on directly to DuckDB, pandas, polars, cuDF, DataFusion, or any number of systems. Even if the PostgreSQL format fixed many of these problems we’ve discussed—adding padding to align fields, using little-endian or making endianness configurable, trimming the overhead—you’d still end up having to convert the data to another format to use downstream. And even then, if you really did want to use the PostgreSQL binary format[^3], the documentation rather unfortunately points you to the source code for the details. Arrow, on the other hand, has a specification, documentation, and multiple implementations (including third-party ones) across a dozen languages for you to pick up and use in your own applications. + +Now, we don’t mean to pick on PostgreSQL here. Obviously, PostgreSQL is a full-featured database with a storied history and many happy users. Arrow isn’t trying to compete in that space anyways. But their domains do intersect. PostgreSQL’s wire protocol has [become a de facto standard](https://datastation.multiprocess.io/blog/2022-02-08-the-world-of-postgresql-wire-compatibility.html), with even brand new products like Google’s AlloyDB using it, and so its design affects many projects[^4]. In fact, AlloyDB is a great example of a shiny new columnar query engine being locked behind a row-oriented client protocol from the 90s. So [Amdahl’s law](https://en.wikipedia.org/wiki/Amdahl's_law) rears its head again—optimizing the “front” and “back” of your data pipeline doesn’t matter when the middle is slow as molasses. + +# A quiver of Arrow (projects) + +So if Arrow is so great, how can we actually use it to build our own protocols? Luckily, Arrow comes with a variety of building blocks for different situations. + +* We just talked about [**Arrow IPC**](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) before. Where Arrow is the in-memory format defining how arrays of data are laid out, er, in memory, Arrow IPC defines how to serialize and deserialize Arrow data so it can be sent somewhere else—whether that means being written to a file, to a socket, into a shared buffer, or otherwise. Arrow IPC organizes data as a sequence of messages, making it easy to stream over your favorite transport, like WebSockets. +* [**Arrow HTTP**](https://github.com/apache/arrow-experiments/tree/main/http) is “just” streaming Arrow IPC over HTTP. It’s standardized so that different clients agree on how exactly to do this, and there’s examples of clients and servers across several languages, how to use HTTP Range requests, using multipart/mixed requests to send combined JSON and Arrow responses, and more. While not a full protocol in and of itself, it’ll fit right in when building REST APIs. +* [**Disassociated IPC**](https://arrow.apache.org/docs/format/DissociatedIPC.html) combines Arrow with advanced network transports like [UCX](https://openucx.org/) and [libfabric](https://ofiwg.github.io/libfabric/). For those who require the absolute best performance and have the specialized hardware to boot, this allows you to send Arrow data at full throttle, taking advantage of scatter-gather, Infiniband, and more. +* [**Arrow Flight SQL**](https://arrow.apache.org/docs/format/FlightSql.html) is a fully defined protocol for accessing relational databases. Think of it as an alternative to the full PostgreSQL wire protocol: it defines how to connect to a database, execute queries, fetch results, view the catalog, and so on. For database developers, Flight SQL provides a fully Arrow-native protocol with clients for several programming languages and drivers for ADBC, JDBC, and ODBC—all of which you don’t have to build yourself\! +* And finally, [**ADBC**](https://arrow.apache.org/docs/format/ADBC.html) actually isn’t a protocol. Instead, it’s an API abstraction layer for working with databases (like JDBC and ODBC—bet you didn’t see that coming), that’s Arrow-native and doesn’t require transposing or converting columnar data back and forth. ADBC gives you a single API to access data from multiple databases, whether they use Flight SQL or something else under the hood, and if a conversion is absolutely necessary, ADBC handles the details so that you don’t have to build out a dozen connectors on your own. + +So to summarize: + +* If you’re *using* a database or other data system, you want **ADBC**. +* If you’re *building* a database, you want **Arrow Flight SQL**. +* If you’re working with specialized networking hardware (you’ll know if you are—that stuff doesn’t come cheap\!), you want **Disassociated IPC**. +* If you’re *designing* a REST-ish API, you want **Arrow HTTP**. (gRPC users: stay tuned.) +* And otherwise, you can roll-your-own with **Arrow IPC**. + +![][image1] + +# Conclusion + +Existing client protocols can be absurdly wasteful, but Arrow offers better efficiency and avoids design pitfalls from the past. And Arrow makes it easy to build and consume data APIs with a variety of standards like Arrow IPC, Arrow HTTP, and ADBC. By using Arrow serialization in protocols, everyone benefits from easier, faster, and simpler data access, and we can avoid accidentally holding data captive behind slow and inefficient interfaces. + +[^1]: Of course, it’s not fully wasted, as null/not-null is data as well. But for accounting purposes, we’ll be consistent and call lengths, padding, bitmaps, etc. “overhead”. + +[^2]: And if your data really benefits from heavy compression, you can always use something like Apache Parquet, which implements lots of fancy encodings to save space and can still be decoded to Arrow data reasonably quickly. + +[^3]: [And people do…](https://datastation.multiprocess.io/blog/2022-02-08-the-world-of-postgresql-wire-compatibility.html) + +[^4]: [We have some experience with the PostgreSQL wire protocol, too.](https://github.com/apache/arrow-adbc/blob/ed18b8b221af23c7b32312411da10f6532eb3488/c/driver/postgresql/copy/reader.h)