-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TT-1741] performance comparison tool #1424
Changes from all commits
0337c29
d58b313
3466334
605856c
a21de6f
17531fd
d574af5
c56cbd7
d90596d
f6e07ed
fa8c711
a6bbbd3
072556d
7dd3cfa
a037c1c
dcdd2e0
68297e6
36f4bd3
c55d236
301819b
605096a
aa2a4eb
08bbb92
6d63f5f
7e27cca
d2b61b7
c2cf545
c2236dd
305088e
570ebaa
827b282
a189b41
af2a330
358108a
5a1c3e7
73df456
b847867
2c56a03
7531863
fcc4c27
14a344e
54cc588
f041164
2884289
ba555a6
93c902a
a3f2385
554bb51
d66b740
22a0417
9cd0fd7
d389c00
87d67d9
f5e1c5a
72e7370
987fee9
e4d075d
a8e7ae4
befd51c
70bd176
cb02f63
2822f7c
d1e6563
4edd870
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,7 +31,7 @@ | |
GOPRIVATE: github.com/smartcontractkit/generate-go-function-docs | ||
run: | | ||
git config --global url."https://x-access-token:${{ steps.setup-github-token-read.outputs.access-token }}@github.com/".insteadOf "https://github.com/" | ||
go install github.com/smartcontractkit/[email protected].1 | ||
go install github.com/smartcontractkit/[email protected].2 | ||
go install github.com/jmank88/[email protected] | ||
go install golang.org/x/tools/gopls@latest | ||
|
||
|
@@ -111,7 +111,7 @@ | |
shell: bash | ||
env: | ||
OPENAI_API_KEY: ${{ secrets.OPENAI_DOC_GEN_API_KEY }} | ||
run: | | ||
Check failure on line 114 in .github/workflows/generate-go-docs.yaml GitHub Actions / actionlint[actionlint] .github/workflows/generate-go-docs.yaml#L114
Raw output
|
||
# Add go binary to PATH | ||
PATH=$PATH:$(go env GOPATH)/bin | ||
export PATH | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
name: WASP's BenchSpy Go Tests | ||
on: [push] | ||
concurrency: | ||
group: ${{ github.workflow }}-${{ github.ref }} | ||
cancel-in-progress: true | ||
jobs: | ||
test: | ||
defaults: | ||
run: | ||
working-directory: wasp | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v3 | ||
- uses: dorny/paths-filter@v3 | ||
id: changes | ||
with: | ||
filters: | | ||
src: | ||
- 'wasp/benchspy/**' | ||
- uses: cachix/install-nix-action@08dcb3a5e62fa31e2da3d490afc4176ef55ecd72 # v30 | ||
if: steps.changes.outputs.src == 'true' | ||
with: | ||
nix_path: nixpkgs=channel:nixos-unstable | ||
- name: Run tests | ||
if: steps.changes.outputs.src == 'true' | ||
run: |- | ||
nix develop -c make test_benchspy_race |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
# BenchSpy - Your First Test | ||
|
||
Let's start with the simplest case, which doesn't require any part of the observability stack—only `WASP` and the application you are testing. | ||
`BenchSpy` comes with built-in `QueryExecutors`, each of which also has predefined metrics that you can use. One of these executors is the `DirectQueryExecutor`, which fetches metrics directly from `WASP` generators, | ||
which means you can run it with Loki. | ||
|
||
> [!NOTE] | ||
> Not sure whether to use `Loki` or `Direct` query executors? [Read this!](./loki_dillema.md) | ||
|
||
## Test Overview | ||
|
||
Our first test will follow this logic: | ||
- Run a simple load test. | ||
- Generate a performance report and store it. | ||
- Run the load test again. | ||
- Generate a new report and compare it to the previous one. | ||
|
||
We'll use very simplified assertions for this example and expect the performance to remain unchanged. | ||
|
||
### Step 1: Define and Run a Generator | ||
|
||
Let's start by defining and running a generator that uses a mocked service: | ||
|
||
```go | ||
gen, err := wasp.NewGenerator(&wasp.Config{ | ||
T: t, | ||
GenName: "vu", | ||
CallTimeout: 100 * time.Millisecond, | ||
LoadType: wasp.VU, | ||
Schedule: wasp.Plain(10, 15*time.Second), | ||
VU: wasp.NewMockVU(&wasp.MockVirtualUserConfig{ | ||
CallSleep: 50 * time.Millisecond, | ||
}), | ||
}) | ||
require.NoError(t, err) | ||
gen.Run(true) | ||
``` | ||
|
||
### Step 2: Generate a Baseline Performance Report | ||
|
||
With load data available, let's generate a baseline performance report and store it in local storage: | ||
|
||
```go | ||
baseLineReport, err := benchspy.NewStandardReport( | ||
// random hash, this should be the commit or hash of the Application Under Test (AUT) | ||
"v1.0.0", | ||
// use built-in queries for an executor that fetches data directly from the WASP generator | ||
benchspy.WithStandardQueries(benchspy.StandardQueryExecutor_Direct), | ||
// WASP generators | ||
benchspy.WithGenerators(gen), | ||
) | ||
require.NoError(t, err, "failed to create baseline report") | ||
|
||
fetchCtx, cancelFn := context.WithTimeout(context.Background(), 60*time.Second) | ||
defer cancelFn() | ||
|
||
fetchErr := baseLineReport.FetchData(fetchCtx) | ||
require.NoError(t, fetchErr, "failed to fetch data for baseline report") | ||
|
||
path, storeErr := baseLineReport.Store() | ||
require.NoError(t, storeErr, "failed to store baseline report", path) | ||
``` | ||
|
||
> [!NOTE] | ||
> There's a lot to unpack here, and you're encouraged to read more about the built-in `QueryExecutors` and the standard metrics they provide as well as about the `StandardReport` [here](./reports/standard_report.md). | ||
> | ||
> For now, it's enough to know that the standard metrics provided by `StandardQueryExecutor_Direct` include: | ||
> - Median latency | ||
> - P95 latency (95th percentile) | ||
> - Max latency | ||
> - Error rate | ||
|
||
### Step 3: Run the Test Again and Compare Reports | ||
|
||
With the baseline report ready, let's run the load test again. This time, we'll use a wrapper function to automatically load the previous report, generate a new one, and ensure they are comparable. | ||
|
||
```go | ||
// define a new generator using the same config values | ||
newGen, err := wasp.NewGenerator(&wasp.Config{ | ||
T: t, | ||
GenName: "vu", | ||
CallTimeout: 100 * time.Millisecond, | ||
LoadType: wasp.VU, | ||
Schedule: wasp.Plain(10, 15*time.Second), | ||
VU: wasp.NewMockVU(&wasp.MockVirtualUserConfig{ | ||
CallSleep: 50 * time.Millisecond, | ||
}), | ||
}) | ||
require.NoError(t, err) | ||
|
||
// run the load | ||
newGen.Run(true) | ||
|
||
fetchCtx, cancelFn = context.WithTimeout(context.Background(), 60*time.Second) | ||
defer cancelFn() | ||
|
||
// currentReport is the report that we just created (baseLineReport) | ||
currentReport, previousReport, err := benchspy.FetchNewStandardReportAndLoadLatestPrevious( | ||
fetchCtx, | ||
// commit or tag of the new application version | ||
"v2.0.0", | ||
benchspy.WithStandardQueries(benchspy.StandardQueryExecutor_Direct), | ||
benchspy.WithGenerators(newGen), | ||
) | ||
require.NoError(t, err, "failed to fetch current report or load the previous one") | ||
``` | ||
|
||
> [!NOTE] | ||
> In a real-world case, once you've generated the first report, you should only need to use the `benchspy.FetchNewStandardReportAndLoadLatestPrevious` function. | ||
|
||
### What's Next? | ||
|
||
Now that we have two reports, how do we ensure that the application's performance meets expectations? | ||
Find out in the [next chapter](./simplest_metrics.md). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# BenchSpy - Getting Started | ||
|
||
The following examples assume you have access to the following applications: | ||
- Grafana | ||
- Loki | ||
- Prometheus | ||
|
||
> [!NOTE] | ||
> The easiest way to run these locally is by using CTFv2's [observability stack](../../../framework/observability/observability_stack.md). | ||
> Be sure to install the `CTF CLI` first, as described in the [CTFv2 Getting Started](../../../framework/getting_started.md) guide. | ||
|
||
Since BenchSpy is tightly coupled with WASP, we highly recommend that you [get familiar with it first](../overview.md) if you haven't already. | ||
|
||
Ready? [Let's get started!](./first_test.md) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# BenchSpy - Custom Loki Metrics | ||
|
||
In this chapter, we’ll explore how to use custom `LogQL` queries in the performance report. For this more advanced use case, we’ll manually compose the performance report. | ||
|
||
The load generation part is the same as in the standard Loki metrics example and will be skipped. | ||
|
||
## Defining Custom Metrics | ||
|
||
Let’s define two illustrative metrics: | ||
- **`vu_over_time`**: The rate of virtual users generated by WASP, using a 10-second window. | ||
- **`responses_over_time`**: The number of AUT's responses, using a 1-second window. | ||
|
||
```go | ||
lokiQueryExecutor := benchspy.NewLokiQueryExecutor( | ||
map[string]string{ | ||
"vu_over_time": fmt.Sprintf("max_over_time({branch=~\"%s\", commit=~\"%s\", go_test_name=~\"%s\", test_data_type=~\"stats\", gen_name=~\"%s\"} | json | unwrap current_instances [10s]) by (node_id, go_test_name, gen_name)", label, label, t.Name(), gen.Cfg.GenName), | ||
"responses_over_time": fmt.Sprintf("sum(count_over_time({branch=~\"%s\", commit=~\"%s\", go_test_name=~\"%s\", test_data_type=~\"responses\", gen_name=~\"%s\"} [1s])) by (node_id, go_test_name, gen_name)", label, label, t.Name(), gen.Cfg.GenName), | ||
}, | ||
gen.Cfg.LokiConfig, | ||
) | ||
``` | ||
|
||
> [!NOTE] | ||
> These `LogQL` queries use the standard labels that `WASP` applies when sending data to Loki. | ||
|
||
## Creating a `StandardReport` with Custom Queries | ||
|
||
Now, let’s create a `StandardReport` using our custom queries: | ||
|
||
```go | ||
baseLineReport, err := benchspy.NewStandardReport( | ||
"v1.0.0", | ||
// notice the different functional option used to pass Loki executor with custom queries | ||
benchspy.WithQueryExecutors(lokiQueryExecutor), | ||
benchspy.WithGenerators(gen), | ||
) | ||
require.NoError(t, err, "failed to create baseline report") | ||
``` | ||
|
||
## Wrapping Up | ||
|
||
The rest of the code remains unchanged, except for the names of the metrics being asserted. You can find the full example [here](...). | ||
|
||
Now it’s time to look at the last of the bundled `QueryExecutors`. Proceed to the [next chapter to read about Prometheus](./prometheus_std.md). | ||
|
||
> [!NOTE] | ||
> You can find the full example [here](https://github.com/smartcontractkit/chainlink-testing-framework/tree/main/wasp/examples/benchspy/loki_query_executor/loki_query_executor_test.go). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 404 https://github.com/smartcontractkit/chainlink-testing-framework/tree/main/wasp/examples/benchspy/loki_query_executor/loki_query_executor_test.go probably because it's added here but you point to main, so it will work after this PR is merged? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. exactly, it will work only after this PR has been merged (this way I don't have to update it later) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# BenchSpy - To Loki or Not to Loki? | ||
|
||
You might be wondering whether to use the `Loki` or `Direct` query executor if all you need are basic latency metrics. | ||
|
||
## Rule of Thumb | ||
|
||
You should opt for the `Direct` query executor if all you need is a single number, such as the median latency or error rate, and you're not interested in: | ||
- Comparing time series directly, | ||
- Examining minimum or maximum values over time, or | ||
- Performing advanced calculations on raw data, | ||
|
||
## Why Choose `Direct`? | ||
|
||
The `Direct` executor returns a single value for each standard metric using the same raw data that Loki would use. It accesses data stored in the `WASP` generator, which is later pushed to Loki. | ||
|
||
This means you can: | ||
- Run your load test without a Loki instance. | ||
- Avoid calculating metrics like the median, 95th percentile latency, or error ratio yourself. | ||
|
||
By using `Direct`, you save resources and simplify the process when advanced analysis isn't required. | ||
|
||
> [!WARNING] | ||
> Metrics calculated by the two query executors may differ slightly due to differences in their data processing and calculation methods: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is worth mentioning that in |
||
> - **`Direct` QueryExecutor**: This method processes all individual data points from the raw dataset, ensuring that every value is taken into account for calculations like averages, percentiles, or other statistics. It provides the most granular and precise results but may also be more sensitive to outliers and noise in the data. | ||
> - **`Loki` QueryExecutor**: This method aggregates data using a default window size of 10 seconds. Within each window, multiple raw data points are combined (e.g., through averaging, summing, or other aggregation functions), which reduces the granularity of the dataset. While this approach can improve performance and reduce noise, it also smooths the data, which may obscure outliers or small-scale variability. | ||
|
||
> #### Why This Matters for Percentiles: | ||
> Percentiles, such as the 95th percentile (p95), are particularly sensitive to the granularity of the input data: | ||
> - In the **`Direct` QueryExecutor**, the p95 is calculated across all raw data points, capturing the true variability of the dataset, including any extreme values or spikes. | ||
> - In the **`Loki` QueryExecutor**, the p95 is calculated over aggregated data (i.e. using the 10-second window). As a result, the raw values within each window are smoothed into a single representative value, potentially lowering or altering the calculated p95. For example, an outlier that would significantly affect the p95 in the `Direct` calculation might be averaged out in the `Loki` window, leading to a slightly lower percentile value. | ||
|
||
> #### Direct caveats: | ||
> - **buffer limitations:** `WASP` generator use a [StringBuffer](https://github.com/smartcontractkit/chainlink-testing-framework/blob/main/wasp/buffer.go) with fixed size to store the responses. Once full capacity is reached | ||
> oldest entries are replaced with incoming ones. The size of the buffer can be set in generator's config. By default, it is limited to 50k entries to lower resource consumption and potential OOMs. | ||
> | ||
> - **sampling:** `WASP` generators support optional sampling of successful responses. It is disabled by deafult, but if you do enable it, then the calculations would no longer be done over a full dataset. | ||
|
||
> #### Key Takeaway: | ||
> The difference arises because `Direct` prioritizes precision by using raw data, while `Loki` prioritizes efficiency and scalability by using aggregated data. When interpreting results, it’s essential to consider how the smoothing effect of `Loki` might impact the representation of variability or extremes in the dataset. This is especially important for metrics like percentiles, where such details can significantly influence the outcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason it's not P99, e.g. we consider p99 too noisy? It will show "worst cases"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say let's better add max latency to get extreme outliers. 95th and 99th are usually product requirements but you can rely on one or another and discuss it with stakeholders. MAX on the other hand shows you extreme outliers and if you have a strict SLA that, for example, "0 transactions are delivered after 2 minutes" you'll detect it with MAX and can miss with even 99th.