Performance with Coniql

In addition to the performance constraints on the React side, the application also needs to satisfy use cases which involve displaying large amounts of quickly updating data. There are sensors and detectors which produce relatively high quality images which update at around 10 Hz. To show this information in the browser requires moving it across the network from some kind of backend which communicates with EPICS. Currently, that backend is Coniql.

This page specifies the methodology and some results around the questions:

what is the maximum data transfer rate we can achieve with our GraphQL setup?
- what is the biggest waveform we can send at an update rate of 10Hz?
how many subscriptions can we support without dropping the update rate?
- how does this change with the size of each subscription?
what data rate can the client handle and how does this compare to server performance?
- where is our overall bottleneck likely to be?

There are many different axes to test along and neither this page nor our work is likely to cover all of them; it is important to understand what we are testing and how we are testing it so that we can improve and make decisions in the future.

Description of test clients

pyThroughputTest.py

Uses py-graphql-client for the web socket subscriptions.

Sets the subscription going and has a callback which increments a variable. Data loading is always performed with in py-graphql-client code, and the callback can be used to implement the decoding and verification steps.

jsThroughputTest.py

This client runs as a node.js script, using the graphqurl package, which is a thin wrapper over the well-known Apollo client.

The subscription returns an Observable object which returns promises that execute each time the observable (subscription) changes. We also return a promise which returns after a timeout which we specify has occurred. This promise then returns the calculated frequency.

Loading is always performed, decoding and verification can be enabled as needed.

asyncClient.py

This Python client uses some of the GraphQL constants specified in py-graphql-client, but uses asyncio to perform the subscriptions via the websockets package.

It is set up to measure the time taken for a specific number of messages to arrive, and then calculates the frequency based on the time taken. This makes the runtime slightly unpredictable.

Limitations

Due to the different frameworks in use, the clients are all testing slightly different things. For example, the JS client always loads the JSON message into an object. For large waveforms, this may have a significant effect.

As we will see below, the decoding step adds significant computing cost to the process, so different decoding implementations may lead to different measurements.

As the server and client are both running on the same machine, the network connection is unlikely to be a source of delay, although competition for other computer resources could potentially be hindering the ability to process large waveforms. The network is another aspect which requires testing, as it will be a relevant concern upon deployment.

Improvements

Improved support for websocket subscriptions with GraphQL is apparently now supported in the V3 alpha release of gql, a widely used Python client. It would be worth implementing this client and comparing it to the basic websockets implementation described above.

Ideally, all of the clients would have very similar structures, allowing for the same control of parameters and functionality. At the moment this is not possible because of the packages being used to implement the GrapqhQL control in the JS client, but perhaps a more direct approach is possible which would allow it.

Methodology for testing

Setup

Check out the git repository for Coniql
Check out the benchmarking or benchmarking-tartiflette branch depending on which version of the server you would like to test
Install the python environment - pipenv install --dev
Install node.js on your machine and install the necessary packages - npm install --dev

Run

Start the server running with pipenv run python -m coniql.server or pipenv run python -m coniql for tartiflette
Open a new terminal in which to run the clients
JS Client - node benchmark/jsThroughput.js
Python Client - pipenv run python benchmark/asyncClient.py

Measure

At the end of the run, the client will print the measured frequency of messages to the terminal. This can be easily imported into a spreadsheet program. Run the client 3 times and taking the average, as this is a good proxy for our use case, although there are also good arguments for taking the fastest time (highest frequency) only.

It should be straightforward to change the client code to test different wavelength sizes and frequencies.

Results

Initial Client Comparison

In this initial chart we see the three clients compared in the situation where loading of the waveform, decoding and verification against known values are all being performed. We can see that py-graphql-client falls away an order of magnitude earlier than the other clients, so we can dismiss it for further tests. It is worth noting that this demonstrates that some clients can work better than others.

Initial Client Comparison, focussing on asyncio and JS

Using the same data as above but focussing on the JS client and asyncio client only, between the range of 10e5 and 10e6 elements, we see that the two clients are fairly evenly matched. This might also lead us to believe that we have reached the limits of the transfer rate, or are certainly pushing them - as we have two clients both performing to a similar level.

However, the graph below, of the same data rate but removing the verification step, clearly shows the the asyncio client with a much more consistent message frequency across the range. The JS client performs very similarly to the previous test. Therefore, it appears that the verification step was requiring a large computational load from the asyncio client, which was interfering with its ability to receive new messages. This does not appear to be the case with the JS client.

No verification

Removing the decoding step provides very similar results to the previous test. This is not hugely surprising, as both JS and Python have efficient tools for encoding and decoding base64 arrays.

No verification or decoding

Finally, although as noted above the JS client always loads the JSON string into an object, this is not the case in Python. Therefore we can modify the client to wait for the message to be received and then do nothing with it. Given that we have tested the connection with verification, we can be confident that the data being sent is correct, and removing computation on the incoming data should allows us to measure values closer to the maximum throughput of the server.

Asyncio no loading

This shows that the message rate begins to drop off above 2e6 elements. Further discussion below in the comparison with tartiflette.

Original vs Tartiflette

Tartiflette Comparison

Tartiflette is a different GraphQL server which has been implemented for coniql. The above chart, using the no-verification, no-decoding test, demonstrates that there is little performance difference between the original and tartiflette implementation. This suggests that in these tests, the limit is with the client, not the server.

We also replicate the no loading test from above:

Tartiflette Comparison

Again, this shows very similar results to the original implementation.

Interestingly, CPU load during the larger tests - around 3e6 and up - shows some of the cores on the machine maxing out. The memory usage also starts to increase dramatically at the same size of waveforms. Memory usage has been seen to max out, killing both the server and client processes. The exact cause of this is unclear, as the server simulation is implemented with numpy.roll which performs an in place transformation. Perhaps the client should more clearly dictate that it is fine to delete the message being sent?

This will require further investigation. It does indicate, however, that we are testing the internal limits of the machine, rather than the protocol itsel, which is a good thing. The next obvious limit to test is the transfer rate which can be achieved over the network. This can be tested by running the same tests as above on separate machines which are located on the same network.

CPU Usage

Memory Usage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly