Skip to content

Commit

Permalink
Merge pull request #537 from ska-sa/NGC-573-multi-streams
Browse files Browse the repository at this point in the history
Refactor fgpu to support multiple output streams
  • Loading branch information
bmerry authored Mar 29, 2023
2 parents 282a278 + 12ee799 commit 30219b1
Show file tree
Hide file tree
Showing 10 changed files with 895 additions and 761 deletions.
6 changes: 3 additions & 3 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,9 +119,9 @@

todo_include_todos = True

# Adds \usetikzlibrary{...} to the latex preamble. We need "chains" for
# rendering flowcharts.
tikz_tikzlibraries = "chains"
# Adds \usetikzlibrary{...} to the latex preamble. We need "chains" and
# "fit" for rendering flowcharts.
tikz_tikzlibraries = "chains,fit"

# Force MathJax to render as SVG rather than CHTML, to work around
# https://github.com/mathjax/MathJax/issues/2701
Expand Down
63 changes: 46 additions & 17 deletions doc/engines.rst
Original file line number Diff line number Diff line change
Expand Up @@ -81,42 +81,63 @@ The general operation of the DSP engines is illustrated in the diagram below:

.. tikz:: Data Flow. Double-headed arrows represent data passed through a
queue and returned via a free queue.
:libs: chains
:libs: chains, fit

\tikzset{proc/.style={draw, rounded corners, minimum width=4.5cm, minimum height=1cm},
pproc/.style={proc, minimum width=2cm},
pproc-base/.style={minimum width=2cm, minimum height=1cm},
pproc/.style={proc, pproc-base},
flow/.style={->, >=latex, thick},
queue/.style={flow, <->},
fqueue/.style={queue, color=blue}}
\node[proc, start chain=going below, on chain] (align) {Align, copy to GPU};
\begin{scope}[start chain=chain going below]
\node[proc, on chain] (align) {Align, copy to GPU};
\node[pproc, draw=none, anchor=west,
start chain=rx0 going above, on chain=rx0] (align0) at (align.west) {};
\node[pproc, draw=none, anchor=east,
start chain=rx1 going above, on chain=rx1] (align1) at (align.east) {};
\node[proc, on chain] (process) {GPU processing};
\node[proc, on chain] (download) {Copy from GPU};
\node[proc, on chain] (transmit) {Transmit};
\node[pproc, draw=none, anchor=west,
start chain=tx0 going below, on chain=tx0] (transmit0) at (transmit.west) {};
\node[pproc, draw=none, anchor=east,
start chain=tx1 going below, on chain=tx1] (transmit1) at (transmit.east) {};
\begin{scope}[start branch=stream0 going below]
\node[proc, on chain=going below left] (process0) {GPU processing};
\end{scope}
\begin{scope}[start branch=stream1 going below]
\node[proc, on chain=going below right] (process1) {GPU processing};
\end{scope}
\foreach \s in {0, 1} {
\begin{scope}[continue chain=chain/stream\s]
\node[proc, on chain] (download\s) {Copy from GPU};
\node[proc, on chain] (transmit\s) {Transmit};
\node[pproc, draw=none, anchor=west,
start chain=tx\s-0 going below, on chain=tx\s-0] (transmit\s-0) at (transmit\s.west) {};
\node[pproc, draw=none, anchor=east,
start chain=tx\s-1 going below, on chain=tx\s-1] (transmit\s-1) at (transmit\s.east) {};
\foreach \i in {0, 1} {
\node[pproc-base, on chain=tx\s-\i] (outstream\s-\i) {};
\draw[flow] (transmit\s-\i) -- (outstream\s-\i);
}
\draw[queue] (align) -- (process\s);
\draw[queue] (process\s) -- (download\s);
\draw[queue] (download\s) -- (transmit\s);
\end{scope}
}
\node[proc, fit=(outstream0-0) (outstream1-1), inner sep=0pt, outer sep=0pt] (outstream) {};
\node at (outstream.center) {Stream};
\foreach \i in {0, 1} {
\node[pproc, on chain=rx\i] (receive\i) {Receive};
\node[pproc, on chain=rx\i] (stream\i) {Stream};
\node[pproc, on chain=tx\i] (outstream\i) {Stream};
}
\foreach \i in {0, 1} {
\draw[flow] (stream\i) -- (receive\i);
\draw[queue] (receive\i) -- (align\i);
\draw[flow] (transmit\i) -- (outstream\i);
}
\draw[queue] (align) -- (process);
\draw[queue] (process) -- (download);
\draw[queue] (download) -- (transmit);
\end{scope}

The F-engine uses two input streams and aligns two incoming polarisations, but
in the XB-engine there is only one.

There might not always be multiple processing pipelines. When they exist, they
are to support multiple outputs generated from the same input, such as wide-
and narrow-band F-engines, or multiple beams. A single stream is used so that
all the outputs go through a single thread (and hence only one core is needed
for sending) and a single rate-limiter (preventing micro-bursts if each
pipeline sends data at the same time).

Chunking
^^^^^^^^
GPUs have massive parallelism, and to exploit them fully requires large batch
Expand Down Expand Up @@ -144,6 +165,14 @@ cause back-pressure on up-stream components by not returning buffers through
the free queue fast enough. The number of buffers needs to be large enough to
smooth out jitter in processing times.

A special case is the split from the receiver into multiple processing
pipelines. In this case each processing pipeline has an incoming queue with new
data (and each buffer is placed in each of these queues), but a single queue
for returning free buffers. Since a buffer can only be placed on the free queue
once it has been processed by all the pipelines, a reference count is held with
the buffer to track how many usages it has. This should not be confused with
the Python interpreter's reference count, although the purpose is similar.

Transfers and events
^^^^^^^^^^^^^^^^^^^^
To achieve the desired throughput it is necessary to overlap transfers to and
Expand Down
Loading

0 comments on commit 30219b1

Please sign in to comment.