Optimize metadata merge protocol #44

vyzo · 2016-10-14T09:50:53Z

Metadata merges as initially implemented in #41 fetch data batches synchronously within the flow of the query stream.

This may be fine for small merges, but throughput will suffer in larger merges and potentially hold up open query result sets in the source.

The data merge can be implemented with a background goroutine fetching data batches as requested by the primary merge goroutine through a buffered channel.

vyzo · 2016-10-27T13:53:20Z

Measurements showing performance improvement from v1.0 with the optimizaitons in #68:

mcnode-v1.0:
1000 [0-1k]   0m10.233s
1000 [1k-2k]  0m10.287s
2000 [2k-4k]  0m17.673s
4000 [4k-8k]  0m33.720s
8000 [8k-16k] 1m4.431s

mcnode/parallel-fetch:
1000 [0-1k]   0m8.716s
1000 [1k-2k]  0m9.709s
2000 [2k-4k]  0m15.859s
4000 [4k-8k]  0m26.150s
8000 [8k-16k] 0m47.155s

mcnode/parallel-fetch+merge-batch:
1000 [0-1k]   0m5.598s
1000 [1k-2k]  0m6.325s
2000 [2k-4k]  0m8.548s
4000 [4k-8k]  0m18.896s
8000 [8k-16k] 0m34.640s

These measurements come from my laptop, with 4vcpus (2 cores, 4 threads) and pretty lousy bandwidth and latency from the peer node.
The test was performed with a clean database, with successive merges from QmeiY2eHMwK92Zt6X4kUUC3MsjMmVb2VnGZ17DhnhRPCEQ with the query SELECT * FROM images.dpla LIMIT $x

vyzo · 2016-10-27T14:12:50Z

Measurements from an ec2 test node:

1000 [0-1k]   0m0.731s
1000 [1k-2k]  0m1.018s
2000 [2k-4k]  0m1.639s
4000 [4k-8k]  0m2.805s
8000 [8k-16k] 0m5.386s

vyzo · 2016-10-27T14:19:41Z

100k merge in ec2 in 35s:

$ time mcclient merge QmeiY2eHMwK92Zt6X4kUUC3MsjMmVb2VnGZ17DhnhRPCEQ "SELECT * FROM images.dpla LIMIT 100000"
merged 100000 statements and 100000 objects

real    0m35.354s
user    0m0.265s
sys     0m0.025s

vyzo · 2016-10-27T14:36:12Z

1MM merge in ec2:

$ time mcclient merge QmeiY2eHMwK92Zt6X4kUUC3MsjMmVb2VnGZ17DhnhRPCEQ "SELECT * FROM images.dpla LIMIT 1000000"
merged 1000000 statements and 1000000 objects

real    6m31.248s
user    0m0.253s
sys     0m0.037s

parkan · 2016-10-27T16:37:26Z

👍

vyzo · 2016-10-28T07:11:40Z

A final measurement with an additional optimization (occurs check delayed until request time, so that it can happen in parallel):

$ time mcclient merge QmeiY2eHMwK92Zt6X4kUUC3MsjMmVb2VnGZ17DhnhRPCEQ "SELECT * FROM images.dpla LIMIT 1000000"
merged 1000000 statements and 1000000 objects

real    6m23.390s
user    0m0.264s
sys     0m0.025s

so we are merging at a cool 2.5K writes/s

vyzo added this to the V2 milestone Oct 14, 2016

parkan modified the milestones: V3, V2 Oct 14, 2016

vyzo mentioned this issue Oct 26, 2016

Merges, revisited #68

Merged

vyzo closed this as completed in #68 Oct 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize metadata merge protocol #44

Optimize metadata merge protocol #44

vyzo commented Oct 14, 2016

vyzo commented Oct 27, 2016 •

edited

Loading

vyzo commented Oct 27, 2016 •

edited

Loading

vyzo commented Oct 27, 2016

vyzo commented Oct 27, 2016

parkan commented Oct 27, 2016

vyzo commented Oct 28, 2016

Optimize metadata merge protocol #44

Optimize metadata merge protocol #44

Comments

vyzo commented Oct 14, 2016

vyzo commented Oct 27, 2016 • edited Loading

vyzo commented Oct 27, 2016 • edited Loading

vyzo commented Oct 27, 2016

vyzo commented Oct 27, 2016

parkan commented Oct 27, 2016

vyzo commented Oct 28, 2016

vyzo commented Oct 27, 2016 •

edited

Loading

vyzo commented Oct 27, 2016 •

edited

Loading