Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a locally runnable matching simulator #6203

Conversation

taylanisikdemir
Copy link
Member

@taylanisikdemir taylanisikdemir commented Jul 31, 2024

What changed?
Partitioned tasklist matching logic could use some improvements in order to match the tasks with pollers as soon as possible. We observed some sub-optimal behavior which becomes bottleneck for heavy traffic partitioned tasklists. Especially with isolation groups feature enabled. Therefore we are revisiting matching partitioning/routing/forwarding/isolation features.

This PR introduces a locally runnable matching simulation with various knobs and collects/reports useful information via a new structured event logging.

Simulator configuration looks like this:

matchingconfig:
  nummatchinghosts: 4
  simulationconfig:
    tasklistwritepartitions: 2
    tasklistreadpartitions: 2
    numpollers: 10
    numtaskgenerators: 2
    taskgeneratortickinterval: 50ms
    maxtasktogenerate: 1500
    polltimeout: 5s
    forwardermaxoutstandingpolls: 20
    forwardermaxoutstandingtasks: 1
    forwardermaxratepersecond: 10
    forwardermaxchildrenpernode: 20

The structured event logs are added all over the matching code for better visibility. These event logs are only printed when MATCHING_LOG_EVENTS=true so they will only be available for simulation runs for now. Example event look like this:

{
  "DomainID": "2d84398e-1a5b-4436-86b3-4d6c86a0aa72",
  "WorkflowID": "test-workflow-id",
  "RunID": "ccb6ecce-e90c-455f-93d3-8564e6ff298d",
  "TaskID": 8,
  "ScheduleID": 21,
  "CreatedTime": "2024-07-30T22:48:55.539Z",
  "PartitionConfig": null,
  "TaskListName": "my-tasklist",
  "TaskListKind": "NORMAL",
  "TaskListType": 0,
  "EventTime": "2024-07-30T22:48:55.544875679Z",
  "EventName": "Matched Task (pollOrForward)",
  "Host": "",
  "Payload": {
    "FromIsolatedTaskC": false,
    "IsolationGroup": "",
    "SyncMatched": false,
    "TaskIsForwarded": false
  }
}

The main simulator logic is in host/matching_simulation_test.go.
Rest of the changes are updates to integration test framework to satisfy the simulator's needs such as

  • Ability to run multiple matching services
  • Ability to mock history client

How did you test it?

./scripts/run_matching_simulator.sh

This script runs the test for 1 minute and prints a simulation summary by querying the event logs via jq.

Example simulation summary output: https://gist.github.com/taylanisikdemir/7cebe12fdf6aaf72e0005bd13bbec7e0

One weird thing (which I hope is a bug in my setup) is that sync matches only happen for tasks forwarded to root partition. This needs to be debugged a bit to make sure the setup is not acting weird.

Next steps

  • Instrument other/missed critical places with event logs
  • Run simulation with different configurations such as
    • high vs low partition count
    • mismatching read/write partition counts
    • forwarding tasks enabled vs disabled
    • forwarding polls enabled vs disabled
    • generate more tasks and run more pollers
  • Some ideas to try
    • Sleep a bit to give sync match a chance before forwarding
    • Forward polls/tasks to a random partition instead of root one

Copy link

codecov bot commented Jul 31, 2024

Codecov Report

Attention: Patch coverage is 75.16779% with 74 lines in your changes missing coverage. Please review.

Project coverage is 72.91%. Comparing base (38c295d) to head (c228efc).
Report is 1 commits behind head on master.

Files Patch % Lines
service/matching/tasklist/matcher.go 74.32% 38 Missing ⚠️
service/matching/tasklist/task_reader.go 66.66% 16 Missing ⚠️
service/matching/handler/engine.go 88.23% 8 Missing ⚠️
common/resource/resourceImpl.go 0.00% 7 Missing ⚠️
service/matching/tasklist/task.go 50.00% 1 Missing and 1 partial ⚠️
service/matching/tasklist/task_list_manager.go 83.33% 2 Missing ⚠️
service/matching/tasklist/task_writer.go 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
Files Coverage Δ
service/matching/tasklist/task_writer.go 78.47% <90.90%> (+1.02%) ⬆️
service/matching/tasklist/task.go 78.33% <50.00%> (-2.03%) ⬇️
service/matching/tasklist/task_list_manager.go 66.91% <83.33%> (+2.84%) ⬆️
common/resource/resourceImpl.go 2.31% <0.00%> (-0.05%) ⬇️
service/matching/handler/engine.go 78.17% <88.23%> (+1.02%) ⬆️
service/matching/tasklist/task_reader.go 72.27% <66.66%> (-0.93%) ⬇️
service/matching/tasklist/matcher.go 78.60% <74.32%> (-2.72%) ⬇️

... and 6 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38c295d...c228efc. Read the comment docs.

@taylanisikdemir taylanisikdemir enabled auto-merge (squash) July 31, 2024 22:31
@taylanisikdemir taylanisikdemir merged commit 06e5a6d into cadence-workflow:master Aug 1, 2024
18 checks passed
@taylanisikdemir taylanisikdemir deleted the taylan/matching_simulator branch August 1, 2024 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants