The idea behind this project is to take inspiration from classic MapReduce, a still widely-used framework for processing very large amounts of data, but to simplify it by removing special framework features without loss of capability, ensuring that there is a "user-space" solution, i.e. the remaining features can still be assembled conveniently to achieve the same ends.
Where a MapReduce framework may cluster data by key without necessarily needing to fully sort them, here we are particularly interested in producing truly sorted output in Parquet, given its usefulness in other operations such as single-pass joins.
Also we are interested in supporting incremental updating, such that the sorting workload per-update is minimised.
In classic MapReduce an application processes very large amounts of data across multiple nodes by defining a pair of functions:
-
Map
: this accepts a single source key-value pair,$(K_s, V_s)$ and outputs zero or more mapped$(K_m, V_m)$ , where the data types used in the source and mapped pairs needn't be the same. -
Reduce
: this accepts a single$K_m$ and a list$[V_m]$ , the values in the list all having been emitted byMap
with the same key, and outputs zero or more$(K_r, V_r$ ) pairs, again with no requirement that these match the types of the input or the mapped pairs.
The framework is responsible for sorting the output of Map
by the output Reduce
. The work of calling Map
can easily be distributed across multiple nodes.
It is not explicitly stated whether the call to Reduce
for some Reduce
stage to work on it.
But it is possible to configure the algorithm for choosing the reducer node for each Reduce
separately (although of course each call still only receives values with the same
Applications may involve several separate Map
/Reduce
stages, each stage working from the results of the previous. The classic framework does not support incremental updating; repeat processing is carried out from scratch.
If the signature of Map
was to be modified to accept a list of source values associated with a source key, Reduce
. A single function called Produce
would take the place of both types of function, emitting target pairs Map
/Reduce
pattern becomes a special case.
As before the framework provides the "glue" that makes producer stages fit together naturally. The output of a producer is sorted by target
It is not strictly necessary to have a configurable algorithm for partitioning: partitioning by key is always sufficient. The user can always emit a compound key to achieve partitioning. Example: we want to know the average salary by age range for persons across an entire country. First solution:
- Producer that receives
person
records in arbitrary batches and emits(age_range, salary)
perperson
- Producer that receives an
age_range
key and a list ofsalary
values and emits(age_range, mean(salary))
Suppose that stage 2 is guaranteed to receive all salaries for the given age-range
(key-based partitioning), so can safely compute the average of them. The downside is that there may only be ~10 distinct age-ranges and we have 100 nodes available to share work. Second solution:
- Producer that receives
person
records in arbitrary batches and emits((age_range, R), salary)
perperson
, whereR
provides the second part of a compound key(age_range, R)
.R
is simply a random number between 0 and 9. - Producer that receives an
(age_range, R)
key and a list ofsalary
values and emits(age_range, (count(salary), sum(salary)))
, discardingR
, but also including the count as well as the sum of salaries in the output value. - Producer that receives an
age_range
key and list of(count_salary, summed_salary)
pairs, and emits(age_range, sum(summed_salary)/sum(count_salary))
, thus finding the final average per age-range.
Thus by optionally including a partitioning factor in the key, we retain the ability to split work between nodes. Note that the "tricky" part of solving this problem (realising that you can't simply average a set of averages obtained from subsets, and you must know the count per subset as well, i.e. knowing what an associative operator is) is ever-present in classic MapReduce. The concept of a custom partitioner does not solve that problem; it only adds feature-baggage to the framework without enabling any capability that cannot already be solved quite easily in user-space.
In classic MapReduce there is an input reader stage, which is responsible for presenting Map
function. No requirements are usually imposed on the ordering of this data.
Here we introduce a rule that this initial set of data must be ordered by Produce
is a
At first glance this imposes a difficult responsibility on the user: pre-sorting the data set, which may be very large. But the framework can take care of this. The list provided to Produce
is a single-use forward-iterable lazy sequence, and thus does not need to be all held in memory, so it can be arbitrarily large, i.e. it could be the entire input dataset, all associated with the same arbitrary key. The user can therefore pass into the framework a sequence of pairs (1, person)
, i.e. the constant key person
record. This trivially meets the requirement of being sorted by Produce
function will be called once, receiving a person
s, and can then emit with whatever person
that would be most useful to sort by, and thus the framework takes care of all large-scale sorting responsibilities.
Example: we have an unsorted file of PII and we want to find all email groups of persons who claim to have the same email address. This is an example of a problem where the entire list associated with a given key must be presented to a single Produce
call.
- Application lazily streams
person
records from a file and, combines each with the key$1$ on-the-fly and presents the resulting stream of(1, person)
to the framework as input - Producer receives a
$K$ that it can ignore (it's always1
) and a[person>]
list, and emits(email, person)
per person - Producer receives a
$K$ that is theemail
and a list ofperson
s with that email. If the list contains two or more persons it emits(email, person)
, otherwise it emits nothing.
Step 2 uses the framework to take care of the sorting problem.
If two producers emit compatible
A join uses a Produce
function that takes two more source groups, so accepts a tuple
The Produce
function described up to this point is simply a special case where there is only one source.
Note that as the lists of values provided to Produce
are implemented as single-use forward-iterable lazy sequences, the only way to produce the cartesian product of two sets of values is to make a copy in memory of one of the lists first. This may not be much of a limitation in practise; often the problem to be solved involves a one-to-many relationship, in which case only one value from one list has to be cached for reuse across the other list. In any case, these concerns are left to the author of the Produce
function.
Note: this section can be skipped.
The formal way to define a pipeline of chained transformations is through monads. A monad consists of three things:
- A type
M<T>
that "amplifies" an underlying typeT
(e.g.T[]
,Promise<T>
) - A
unit
operation that converts a plainT
into anM<T>
(e.g.[x]
,Promise.resolved(x)
) - A
bind
operation that takes a user-supplied functionF: T -> M<R>
and an instance ofM<T>
and returns anM<R>
, whereR
is an underlying type that need not be the same asT
.
The bind
may perform characteristic computation of its own, e.g. T[]
can contain multiple T
, so once it has passed all the T
through F
, it has multiple R[]
results, so it concatenates them into a single R[]
(this operation is called flatMap
in several platforms).
Now perform this conceptual substitution:
- The
T
is a$(K_s, [V_s])$ , i.e. a source key and all the source values associated with that key. - The
M<T>
is the complete dataset of all the groups presented as input to the framework, which must be ordered by the groups' keys. - The
unit
operation creates a dataset with a single group and an arbitrary$K_s$ (this is the initial input sorting trick discussed above.) - The user-supplied function
F
is theProduce
function (but see note below.) - The
bind
operation is the framework-supplied logic that acts on each group with theProduce
function, and sorts by the output keys to form new groups for the outputM
.
Note that for convenience the user doesn't have to return groups from their Produce
function in the correct order by Produce
calls anyway, so there is no advantage (and huge disadvantage) in making life difficult for the author of a Produce
function.
These two representations are clearly very easy to convert between on-the-fly:
-
flat: a list of
$(K, V)$ sorted by$K$ -
grouped: a list of groups having a
$K$ and a$[V]$ , the groups being sorted by$K$
They are so close to being the same thing, the difference between them is a matter of interpretation.
We use Parquet files with a schema divided into
The Produce
function is only allowed to forward-scan the
- only exposes the
$V$ of each underlying pair, and - claims to be exhausted when the
$K$ changes.
This wrapped cursor can be passed to the Produce
function along with the real cursor's current Produce
multiple times, once for each
The output Produce
can be buffered in memory up to some large number, sorted by
An incremental update is a subsequent processing run that accepts a new set of
It is only possible to replace all values for a given key. That is, if the initial input included a million values with key
It follows that the first stage's source should have some level of key granularity or else incremental updating is impossible. Example: we receive daily batches of records. Therefore the date could serve as the key that is associated with all the records in the daily file, so no pre-sorting logic is necessary, but it is then possible to correct all the data for a specific day (or set of days) in a subsequent update.
We also need to be able to delete the values associated with a key, without providing new values. So we extend the input format to be
Add
- The key is one not present in the source dataset until nowUpdate
- The key is being added to the source dataset for the first timeDelete
- The key (and all its values) are being removed from the source dataset
Ideally we would say the type of an update is one of:
$(Add, K, V)$ $(Update, K, V)$ $(Remove, K)$
i.e. the Remove
. But our target format, Parquet, is columnar, so there will be a value (e.g. null
) but it will be ignored by the consumer.
Note that when feeding updates to the framework, it is not necessary to specify Add
or Update
correctly; they have the same effect (the reason they exist as separate types of change will be dealt with below.) We can use either of them to describe an "upsert".
To avoid ambiguity, we require that the source data for an incremental update must contain only one of the following per key k
:
- a single deletion pair
(k, null)
- any number of upsert pairs
(k, v1)
,(k, v2)
, ...
That is, there is no mixing of deletions and upserts for the same key. There is never a need to mix them because the presence of any upsert for a key automatically implies the deletion of all previous values for that key. The only reason for sending a deletion pair for some key k
is because there are now no values associated with k
.
To implement this pattern, the framework needs to store more information. The challenge is that a source value Produce
function.
The following scheme is adopted. The Parquet containing the ultimate result of a producer (known as the content) has the logical schema
In addition we store a second Parquet file known as the key-mappings, with logical schema
The framework is thus able to perform a single-pass parallel scan of all the incoming update pairs (which are sorted by
The intermediate output of this discovery process, combined with the output of Produce
, is a set of internal instructions for how to update the target data. It consists of specific deletions and "upserts" (inserts or updates) that refer not just to the target key, but to the combination of source key and target key.
It is this set of instructions that must be sorted, in contrast to the first implementation that sorted the entire dataset. Two different versions of the instruction set are created, one of
This design is predicated on these assumptions:
- Performing forward scans through datasets is a comparatively fast operation
- Sorting datasets is a comparatively slow operation
- Incremental updates will affect a small subset of the keys in each stage, although they will be scattered unpredictably through the sort order
Therefore it only sorts the intermediate sets of instructions describing how the content and key-mappings are to be updated, for the affected
That is, suppose the existing data has these content records
(105, 6, "apple")
(105, 9, "banana")
(108, 4, "mango)
And therefore these key-mappings,
(4, 108)
(6, 105)
(9, 105)
The update replaces the values for source key 6
. As the update is scanned in parallel with the key-mappings, we discover that 6
previously produced a target key 105
and so we generate an intermediate instruction to delete (6, 105)
from the key mappings and to delete (105, 6, *)
from the content. None of this has any effect on the record that matches (105, 9, *)
.
A multi-stage update is like a transaction that involves updating every producing stage once. Producer stages form a directed acylic graph, so the stage that accepts external updates is updated first, then the stages that feed from it, and so on through the whole graph. The states of the whole dataset (consisting of multiple producers) can therefore be thought of as a series of versions 1, 2, 3...
. In each version every producer has three parquet files: content, key-mappings and update. To produce version
It follows that any subsequent processing stages, which maintain their own content and key-mappings from previous updates, do not want to consume the entire dataset from a previous stage. Instead, they want an incremental update in the form of a sequence of Delete
change or one or more Add
/Update
changes. Also if there has been any change to the set of values for a key, then all the remaining values of that key must appear in the update.
Therefore so that one producer can feed another, there is an additional output available from a producing stage, known as the update, which conforms to the
Although the ability of the type to distinguish between Add
and Update
is irrelevant as far as the framework goes when consuming these changes, when generating them into the update Parquet it specifies the type accurately. This makes the update Parquet useful for feeding external processes. Example: an RDBMS table must contain a row for every key in the dataset. This can be achieved by scanning the update Parquet looking for Add
and Delete
changes (ignoring Updates
). In a given wave of changes, a key can only appear as one Add
or one Delete
, but never both, making it exceedingly simple to ship these instructions into a pair of SQL set-based operations. Also they will already be sorted, which can help speed up such inserts. If a run of values with the same key appear for the first time, only the first change will have the type Add
.
Note that this update is not identical to any previously described dataset such as the intermediate instructions. We noted above that if the source key 6
has changed, and the previous content was:
(105, 6, "apple")
(105, 9, "banana")
This only requires the intermediate instructions to refer to the row(s) matching (105, 6, *)
, i.e. with the same source key.
Whereas a subsequent stage needs to be passed the first and third columns (target key and value) from all the rows matching (105, *, *)
, because it doesn't care what the previous stage's source keys are (that is an internal detail of the previous stage). It requires the full set of key-value pairs for each affected key, i.e.:
(Update, 105, "apple")
(Update, 105, "banana")
We previously mentioned how simple it is to do a parallel merge across multiple prior stages with the same key-value types and present the merged data to a subsequent stage.
This is less simple under incremental merging. The solution has three layers to it. We refer to the prior stages as feeders:
- Produce the combined distinct list of all keys that were affected by the current update, across all feeders. This is the list of updated keys.
- In each feeder, set up a three-way parallel scan between the updated keys, the feeders's content and the feeders's update. For each key in the updated keys, if the feeder has updates for that key, emit them, otherwise emit all the feeder's content for that key as if they were updates. The resulting sequence is the augmented updates for the feeder.
- Perform a parallel merge across the augmented updates from all the feeders, which is the effective merged update, This layer must take care to emit either a single deletion or one or more upserts per key.
Why does each feeder's augmented updates sequence need to potentially include a mixture of updates and static content? Because when the consuming Produce
function is fed the values for a key, the list must include all the values for that key.
In this type of framework keys are usually "natural", being supplied externally. Sometimes it is useful to be able to allocate a surrogate key, such as an incrementing integer or a sequential GUID. This is particularly useful when preparing data for an RDBMS.
The challenge is to augment the source data with extra information, which means that the target data is no longer strictly a pure function of the source, and for information to be preserved across updates. Specifically, the target key
The solution has to work outside of the Produce
function, as that executes within a scan in PreserveKeyValues
hook that executes later during the Action<TV, TV?>
. The first parameter is the value about to be stored in the updated target data, and the second is some randomly chosen value from the previous state associated with the same target key, or else null
if the key is novel. Example:
(target, example) => target.Id = example?.Id ?? nextId++
The above copies an Id
property or substitutes a new incremented value. (The matter of persisting the counter between updates is currently outside the scope of this framework.)
The upshot is that the affected property (Id
in the above example) is repeated identically across all rows with the same target key.
This feature plays well with the idea of syncing keys to an external RDBMS. Suppose the key is some identifier of real-world significance (i.e. a natural key) such as a URL. The above technique may be used to automatically generate a compact integer ID for each distinct URL. Then the update stream can be read to find the Add
and Delete
that signal the need to create or remove records from the URL table, which can use the integer ID as the primary key.
It was noted above that a Produce
stage can accept multiple sources with different value types, thus acting as a join. This concept also requires revisiting in the face of incremental updates.