Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize metadata merge protocol #44

Closed
vyzo opened this issue Oct 14, 2016 · 6 comments
Closed

Optimize metadata merge protocol #44

vyzo opened this issue Oct 14, 2016 · 6 comments
Milestone

Comments

@vyzo
Copy link
Contributor

vyzo commented Oct 14, 2016

Metadata merges as initially implemented in #41 fetch data batches synchronously within the flow of the query stream.

This may be fine for small merges, but throughput will suffer in larger merges and potentially hold up open query result sets in the source.

The data merge can be implemented with a background goroutine fetching data batches as requested by the primary merge goroutine through a buffered channel.

@vyzo vyzo added this to the V2 milestone Oct 14, 2016
@parkan parkan modified the milestones: V3, V2 Oct 14, 2016
@vyzo vyzo mentioned this issue Oct 26, 2016
@vyzo
Copy link
Contributor Author

vyzo commented Oct 27, 2016

Measurements showing performance improvement from v1.0 with the optimizaitons in #68:

mcnode-v1.0:
1000 [0-1k]   0m10.233s
1000 [1k-2k]  0m10.287s
2000 [2k-4k]  0m17.673s
4000 [4k-8k]  0m33.720s
8000 [8k-16k] 1m4.431s

mcnode/parallel-fetch:
1000 [0-1k]   0m8.716s
1000 [1k-2k]  0m9.709s
2000 [2k-4k]  0m15.859s
4000 [4k-8k]  0m26.150s
8000 [8k-16k] 0m47.155s

mcnode/parallel-fetch+merge-batch:
1000 [0-1k]   0m5.598s
1000 [1k-2k]  0m6.325s
2000 [2k-4k]  0m8.548s
4000 [4k-8k]  0m18.896s
8000 [8k-16k] 0m34.640s

These measurements come from my laptop, with 4vcpus (2 cores, 4 threads) and pretty lousy bandwidth and latency from the peer node.
The test was performed with a clean database, with successive merges from QmeiY2eHMwK92Zt6X4kUUC3MsjMmVb2VnGZ17DhnhRPCEQ with the query SELECT * FROM images.dpla LIMIT $x

@vyzo
Copy link
Contributor Author

vyzo commented Oct 27, 2016

Measurements from an ec2 test node:

1000 [0-1k]   0m0.731s
1000 [1k-2k]  0m1.018s
2000 [2k-4k]  0m1.639s
4000 [4k-8k]  0m2.805s
8000 [8k-16k] 0m5.386s

@vyzo
Copy link
Contributor Author

vyzo commented Oct 27, 2016

100k merge in ec2 in 35s:

$ time mcclient merge QmeiY2eHMwK92Zt6X4kUUC3MsjMmVb2VnGZ17DhnhRPCEQ "SELECT * FROM images.dpla LIMIT 100000"
merged 100000 statements and 100000 objects

real    0m35.354s
user    0m0.265s
sys     0m0.025s

@vyzo
Copy link
Contributor Author

vyzo commented Oct 27, 2016

1MM merge in ec2:

$ time mcclient merge QmeiY2eHMwK92Zt6X4kUUC3MsjMmVb2VnGZ17DhnhRPCEQ "SELECT * FROM images.dpla LIMIT 1000000"
merged 1000000 statements and 1000000 objects

real    6m31.248s
user    0m0.253s
sys     0m0.037s

@parkan
Copy link
Contributor

parkan commented Oct 27, 2016

👍

@vyzo
Copy link
Contributor Author

vyzo commented Oct 28, 2016

A final measurement with an additional optimization (occurs check delayed until request time, so that it can happen in parallel):

$ time mcclient merge QmeiY2eHMwK92Zt6X4kUUC3MsjMmVb2VnGZ17DhnhRPCEQ "SELECT * FROM images.dpla LIMIT 1000000"
merged 1000000 statements and 1000000 objects

real    6m23.390s
user    0m0.264s
sys     0m0.025s

so we are merging at a cool 2.5K writes/s

@vyzo vyzo closed this as completed in #68 Oct 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants