-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vdev property to bypass vdev queue #16591
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it might have sense for some very fast NVMe's with "unlimited" queue depth and I/O rate, since others may suffer from I/O aggregation lost as a side effect.
Also it seems that you are still obtaining the queue lock on completion, aren't you?
Also I don't see what would initialize io_queue_state
you check on completion if the queue is bypassed.
Also the property seems to have negative meaning. It is preferable to use positive ones to avoid double negation.
7739374
to
ca898ab
Compare
Changed to avoid obtaining the queue lock if nothing gets removed from the queue.
Changed to set the io_queue_state if not queueing the zio.
Changed property to use positive meaning. |
ca898ab
to
af04e9a
Compare
tests/zfs-tests/tests/functional/cli_root/zpool_set/vdev_set_queue_io.ksh
Outdated
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/cli_root/zpool_set/vdev_set_queue_io.ksh
Outdated
Show resolved
Hide resolved
I am uncomfortable with this change. I agree that the So the main thing I know that this will "break" is that the IO is no longer tracked for the "slow IO" checks to find ( The other part I'm just nervous about is that I don't know how it will interact with IO that is on the queue. Seems like it could starve other high-priority IO. This sort of stuff is why I've spent so long on it without a conclusive path forward. I would weakly support this being done as a module parameter so it's at least available in the field for experimentation on specific customer workload, though it's a little tough with module parameters being scoped far wider than a single vdev instance. Though maybe that's fine to start for an experimental feature; the other For now though I'd be against having as a vdev property, because that is both more visible and more permanent, and I don't have the confidence to support it yet. If I were overruled here, I would at least make the scope far more narrow: maybe only limit it to Direct IO IOs, and definitely make the name clear that is more narrowly. I would suggest something like your original and make it "queue bypass" or similar (eg "directio queue bypass" if it were limited to those). (I appreciate the point about avoiding double-negatives, but I think its possible to claim that "bypass" is a specific, positive action. At least, for me, "queue io" is way too vague, and might cause me to say "why wouldn't I turn that off? Queues sound slow!". But I this is only a tiny part of my objection to this whole thing, so if it does stay in its current form, I'm unlikely to worry too much about naming). |
af04e9a
to
6da75aa
Compare
I have been working with @bwatkinson into improving IOPS performance with reads and we have also noticed that getting around this lock just seems to shift contention elsewhere. I am hoping that with further research and testing we can further reduce this contention, this bypassing the vdev queue idea is just one step.
As for implementing this as a module parameter, I am not too sure which way we should go. I am okay with it either way, maybe others (@amotin, @tonyhutter) can provide some insight into what should be done here?
I agree that this property should probably be limited to only O_DIRECT IO's, as from the testing that I have done, it doesn't seem to improve performance with non O_DIRECT workloads.
|
@behlendorf do you have an opinion on whether we should stick with a VDEV property that is limited to Direct I/O or would a module parameter make more sense at this point? I see a good argument for both. |
My preference would be for a vdev property, since you could set it NVMe vdevs, but not spinning disks. This assumes we're 100% comfortable with this feature though. @MigeljanImeri do you have any thoughts on @robn's comments regarding breaking the deadman IO checks and the IO priority starvation? For example, if you submit a un-queue'd IO, and the drive "eats" the IO with no response, how does this codepath handle that? |
If we restrict this to only o_direct reads and writes, the o_direct code path should propagate up any errors that occur. In terms of priority starvation, I don't foresee this being an issue with newer NVMe SSD's that are very fast. Although this could be an issue with slower devices, maybe we could include a warning in the property description that potential starvation could occur if the underlying disk can't keep up with the requests? |
@robn I was actually able to get this PR to trigger deadman events even with I set the ZFS params to:
I then created a scsi_debug device with these params set:
... then created the pool with the scsi_debug device. I then set these additional scsi_debug params:
Then I copied the |
I'm going to see if I can starve out a scrub and MMP write with |
@tonyhutter thanks for trying to test those cases; I'm mostly going from memory and gut feels and its good to have things confirmed, or not! From memory the deadman path takes the front (oldest) zio from the queue active list and uses it for the slow report, because we know its the oldest one outstanding. It therefore follows that that if the slow IO is not on that list, we won't report on it. The slow IO however will still be holding up the txg sync, and with no reporting, all the knockon "deadman fired" events (like suspending the pool, zed actions, etc) won't fire. That might be ok! It might be ok to just document, "you lose all these other things" or something, so long as we're aware of it. But this is part of why, for now, I'm against a vdev property: because there's knock-on effects that are to some extent tied up in implementation details that I definitely want to change, and I'm not aware of any policy or process around deprecating/removing a property in the future. To be clear, I'm not against having the option available. If it had started as a module parameter, I would not have flinched. I think it's crucial we have ways to change internal behaviour in the field to gain more experience without too much overhead. My only concern with any of this is that we can support it into the future. Properties feel to me like real "user interface", like buttons, switches or lights on the dashboard, while module parameters feel more like messing around under the hood. If I'm overblowing it, feel free to tell me to go away: I get it, the "best" way isn't always possible, and I'm not gonna be the person blocking this - it's important we can learn more about these use-cases. Other possible options:
I volunteer to make any/all of these possible if it would help (I'm not gonna request a bunch of unrelated work from someone, that's rude). |
(incidentally, I love how many subtle and weird and things properly fast IO is throwing up. It's nice to be solving hard problems!) |
@robn - some random thoughts: One alternative is to pull it in as a vdev property and then keep it in master until 2.4.x. So don't backport it to 2.3.x or 2.2.x. That will give it some soak time and will help expose any unknowns. One thing I didn't think of though is that you can set vdev properties on the root vdev, top-level-vdevs and leaf vdevs, which could make this complicated. Like, what happens if you have Another idea would be not have the vdev property at all and just overload |
There's currently no inheritance on vdev properties, so nothing happens. We've talked a little internally at Klara about what property inheritance should mean, and turns out it's kinda weird no matter how you slice it. But yeah, for now, nothing. I did think about getting it on master and then not backporting it, but I don't think it gets us much, and possible makes it worse. "Soak time" assumes that there's people running master and kicking the tyres on new things, and to me this feels too niche for many (any?) people to actually do that. If they don't, then we're just having to make the same decisions when its time to cut 2.4. If they do, then it's the same problem, in that we have a user-acessible lever tied to an implementation detail, with the bonus that it's already enabled on pools already out there. If we were doing a module parameter, I would just make a new one, not overload an existing one. Based on precedent, I don't consider module parameters to be stable between release series, and I'd have no problem shipping one in (say) 2.3 and then removing it in 2.4. We already do this with lots of different parameters (most notably ARC related). (Below follows an extended ramble where I ask a bunch of questions, sort of answer them along the way, and come to the same conclusion. The joys of vacation; got time on my hands). Maybe there's some clarity to be found in considering when and why an operator might want to enable this option, or maybe better, when not to. That's something for me too; I initially saw a mention of So, I'm an operator, and I've got my shiny new NVMe pool, and I feel like it's going a bit slow. I search around and find people talking about an option that makes it go faster. I turn it on and indeed, it goes faster. I am happy! Hurrah! Or, I turn it on, and it doesn't go faster because my workload (or pool topology or overall system architecture) isn't a fit for it, and I'm now annoyed, but I don't bother turning it off because eh, it doesn't do anything anyway. Alternatively, I'm not using an NVMe pool, just regular spinners, but I read about the option and wonder, and I turn it on. Everything suddenly goes slower, so I turn it off again, fine. But maybe it doesn't go slower, because I have some funky setup with tons of block cache "under" ZFS. (I dunno, I'm trying to invent a scenario I'm pretty sure isn't real). So! Is this is only a GO FAST / GO SLOW toggle? Or does it affect something deeper? Let's say I'm right for the moment, and it does have two knock-on effects: no slow IO warnings for bypassed IOs, and priority queued IO may not get serviced if bypassed IOs are hogging all the worker threads. As an operator, am I going to be surprised if I toggle this and other things stop working properly? What if I didn't know that they might beforehand? What if I did now but I didn't really understand it? What if it didn't affect something deeper? What if we "fixed" those? Like, maybe queue bypass only works when we're not scrubbing or other things that "naturally" degrade pool performance, though maybe we want to prioritise ZIL writes still, or MOS updates, or...? Maybe queue bypass only works when the pool is definitely in good health and responding well (ie when we can be more certain that slow IOs wouldn't have happened anyway). I hope it's obvious that these are examples; these two may not end up being problems at all. There might be different things we haven't thought of. New ones might come up as a result of future changes; how does that affect the property being enabled or not? What if there was a future where If it's only limited to direct IO, does that change the answers? Does it add any new problems? And then, if it was possible for it to be only a GO FAST / GO SLOW toggle, would we even need it? Is there ever a time where we wouldn't bypass the queue for "safe" IO, and so don't even need a toggle? What if it wasn't an explicit "bypass the queue", but more of a hint: "prioritise direct IO when safe to do so?". That would sidestep most of the questions, because it clearly says "idk, give it a try, I promise I won't mess up your pool". Is there any value in such a thing? But if we had that, then how is that any different in concept to the existing priority system? Note "in concept"; I know it would suck under the existing priority system because of how it's currently implemented, but that's not the point. So I don't have complete answers for all of this, and I'll admit that probably my partial answers are informing my questions a bit, so again, I am not assuming that I'm definitely right here. But, all this is why this seems like a module parameter to me, because those are things we currently have to influence implementation-specific behaviour. Lifting the lid to flip a "bypass queue" switch carries with it an acknowledgement that the operator knows they're doing something weird, and at least, they now know there is thing called "queue" and presumably it's there for a reason. So they flip it and everything goes wrong and at least maybe they know why, and if they ask for help we can say "ok, turn that off, and thanks for the feedback on you scenario". The one shortcoming here is about wanting to apply this to a specific vdev. I know that properties are the only comfortable way to do that right now, but that doesn't mean they're the right way necessarily, and the only reason we don't have some more transient way is that we haven't needed one. Is this the change that forces our hand? Ask yourself, if I had said at the start "this should be a vdev-specific module parameter" because that's a facility we aready had, would this discussion even had happened? If it's "I don't think so", then I will go and write that facility this weekend (and use the hell out of it myself; I have a patch to lift the vdev queue limit parameters into vdev properties that I finished but shelved for all the reasons above: I didn't want to commit to surfacing those implementation details yet). Conclusion: don't put a "bypass queue" button on the dashboard without first (a) defining "queue", (b) committing to "queue" being a thing in the long term and (c) making sure it does only that. But feel free to put whatever you want under the hood, because it's already expected that (a) there's loads of things in there you don't understand, (b) the next model could have a totally different engine and (c) you have no expectation that it won't do weird things to other things in there, because it's a whole engine and everything is there for a reason, right? (I genuinely did not start out with car analogies, but there's a reason they're so prevalent in engineering discplines!) |
I think you raise some good points on the vdev prop vs module param discussion. I'll have to think it over a little more.
Hmm.. could we do a hybrid approach where we allow O_DIRECT queue bypass if and only if the pending queue is empty ( zio_t *
vdev_queue_io(zio_t *zio)
...
if (!zio->io_vd->vdev_queue_io && vq->vq_active == 0) {
zio->io_queue_state = ZIO_QS_NONE;
return (zio);
} So if you had a workload that was 99% O_DIRECT writes and 1% queued writes, you'd get most (but not all) of your O_DIRECT writes to be bypass. That would solve the starvation problem but not the slow IO detection. I don't know if this would actually work, just thinking aloud. You could get stuck into a "pileup" problem where there's exists one queued write, and so the next 99 O_DIRECT writes get queued up too, and then you never get back into bypass mode since the queue is backed up. Maybe there's enough gaps where the queue goes back to zero that you can get into bypass mode fairly easily. I'd be curious to test this on real hardware.
If your disk was unhealthy and not responding to IOs, and you had at least a handful of queued-IOs (even just 1% of the writes), then presumably those would get reported as slow IOs, and you'd at least have somewhat of a canary in the coal mine, right? So maybe this slow IO detection isn't as much of an issue? |
@MigeljanImeri I've been trying to test/benchmark this PR, but in the process I've uncovered some unrelated bugs in ZFS (#16707, #16695) that are slowing down my testing. Can you describe your setup for getting those 3.5x gains you mentioned? Did you get those running |
I was running just a single pool with one vdev. Had recordsize set to 4k. Wrote out 256 GB over 256 files (each one gig) with o_direct. Then doing o_direct random 4k reads with fio, 256 jobs, each job reading from its own individual file (we saw some weird contention with rangelocks and this was how we got around it). |
I just tested with a single NVMe pool. Sure enough, turning off queuing does improve IOPS with O_DIRECT: -- IOPS -- fio, blocks=4k, randread, direct=1 |
I did get over 500k IOPS with direct=0, but I'm assuming that's due to caching. |
I'll try to run my "scrub while FIOing" test again, but this time using the NVMe drive, since there seems to be weird stuff going on with brd + ZFS. |
6da75aa
to
118abc4
Compare
I limited the queue bypass to only direct io. I think I've come to the conclusion that this should probably be a module parameter, rather than a vdev property, due to the potential issues that could arise with starvation and slow io detection. The only concern I have is that the current vdev module parameters aren't specific to a vdev. Ideally, we would be able to toggle this for individual vdevs. |
I'm having trouble seeing the benefits of the "Direct IOs can bypass the queue" special case. What does that get us over just having There's one other "hybrid" option we could consider:
hybrid: If this is a "normal IO" and the queue is empty, then do not queue the IO. That should prevent starvation of the important IOs. Assuming most of the IOs are "normal" then most of the time then you're not going to be queuing, which is what you want. Important IOs would be things like ZIO_FLAG_IO_REPAIR, ZIO_FLAG_SELF_HEAL, ZIO_FLAG_RESILVER, ZIO_FLAG_SCRUB, ZIO_FLAG_IO_RETRY... In those cases you would always queue. So if you're running as resilver, a bunch of resilver IOs are going to get queued, which means your "normal IOs" get queued as well, and so you don't starve out the resilver. Hybrid mode would look a little like this (untested): vdev_queue_io(zio_t *zio)
...
zio->io_flags |= ZIO_FLAG_DONT_QUEUE;
zio->io_timestamp = gethrtime();
if (!zio->io_vd->vdev_queue_io && ZIO_IS_NORMAL(zio)) {
int length = vdev_queue_class_length(zio->io_vd, zio->io_priority);
if (length == 0) {
/* Queue is empty - bypass queue */
zio->io_queue_state = ZIO_QS_NONE;
return (zio);
}
} I don't think we'd ever default I don't think this should be a module param because you can have big discrepancies in latency and bandwidth between disks in a pool. Imagine you have a pool comprised of spinning disks, and some NVMe drives for a special device. You would want |
Since I was the one arguing against a vdev property upthread, I'll just clarify that. I do think this is a per-vdev option. I just don't think it's a property, because those are saved, which implies a certain amount of confidence that we have it right and won't change it. It could be a property in the future once it's stabilised and/or we have some way to make a property "transient", or to mark it as deprecated if this doesn't work out. If you don't think it's a problem though, eg, if you're cool with removing it in the future, maybe even the near future, then I'm cool with that. Or you can just tell me to pull my head in :) |
We could make a class of properties "experimental" and then have to pass As far as the PR is now, I do see a lot of value in it due to the IOPS increase, even if it only implements |
The main reason for this is to limit the probability of not catching "Slow IO's" through the deadman path, along with the potential starvation issues. Although this "hybrid" approach looks like it could mitigate these issues. If we are not as worried about that though there is no reason to limit it to only Direct IO's. |
@MigeljanImeri I think there's a way we could do this and still get deadman detection to still work. We would basically do the "hybrid" method but add the bypass ZIOs to The new code would work like this:
Note that we would need a new mutex_enter(vq->vq_active_list_lock);
list_insert_tail(&vq->vq_active_list, zio);
mutex_exit(vq->vq_active_list_lock); The nice thing is that we never have to iterate on We could enable bypass by default on solid state drives ( |
You make this a threshold too, like "allow bypass if the normal queue is 10% full or less". The percentage could be adjustable with a module param, but you'd want to make it capped to something safe. The threshold might help make the use of bypass less bursty when there are occasional queued IOs. |
118abc4
to
90842f8
Compare
Okay, I've implemented this hybrid method and added the module parameter. Seems to be slightly less performant (from my limited testing), than the previous iteration, but we don't break |
Nice, I just retested and am only seeing a slight decrease as well:
I also ran a scrub while running I'll take a look at the code now |
@@ -249,6 +249,14 @@ typedef uint64_t zio_flag_t; | |||
#define ZIO_CHILD_BIT(x) (1U << (x)) | |||
#define ZIO_CHILD_BIT_IS_SET(val, x) ((val) & (1U << (x))) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add a comment here so it's understood why these flags are called out:
/*
* ZIOs that are ZIO_FLAG_IMPORTANT are always queued so that they never get
* starved out. This allows us to bypass the queue for "normal" reads and
* writes when the queues are low for better IOPS. If the queues get too high
* then we go back to queuing the "normal" reads/writes so as not to starve
* out more important IOs like scrub/resilver/retry. See
* zfs_vdev_queue_bypass_pct for details.
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, queue_bypass_pct, UINT, ZMOD_RW, | ||
"Queue bypass percentage per vdev"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want to add an entry for zfs_vdev_queue_bypass_pct
to man/man4/zfs.4
, like:
zfs_vdev_queue_bypass_pct=10 (uint)
Allow bypassing the vdev's queue if the vdev queue is less than
zfs_vdev_queue_bypass_pct percent full. This only applies to
SSDs (non-rotational drives). You can use 0 to always queue IOs
and 100 to never queue IOs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
tests/runfiles/common.run
Outdated
@@ -544,7 +544,7 @@ tags = ['functional', 'cli_root', 'zpool_scrub'] | |||
[tests/functional/cli_root/zpool_set] | |||
tests = ['zpool_set_001_pos', 'zpool_set_002_neg', 'zpool_set_003_neg', | |||
'zpool_set_ashift', 'zpool_set_features', 'vdev_set_001_pos', | |||
'user_property_001_pos', 'user_property_002_neg', | |||
'user_property_001_pos', 'user_property_002_neg', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, removed.
module/zfs/vdev_queue.c
Outdated
/* | ||
* Bypass queue if certain conditions are met. Queue bypassing requires | ||
* a non-rotational device. Reads / writes will attempt to bypass queue, | ||
* depending on how full the queue is. Other operations will always | ||
* queue. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add to this comment: Bypassing the queue can lead to a 2x IOPS speed-ups on some benchmarks. If the queue is too full (due to a scrub or resilver) then go back to queuing normal reads/writes so as not to starve out the more important IOs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
Could you reword the commit message to call out the
|
90842f8
to
6dd8582
Compare
Reworded. |
@MigeljanImeri looks like there's a NULL pointer dereference happening in the test suite:
|
if (zio->io_vd->vdev_nonrot && ZIO_IS_NORMAL(zio)) { | ||
|
||
int bypass = vdev_queue_length(vq->vq_vdev) < | ||
(zfs_vdev_max_active * zfs_vdev_queue_bypass_pct) / 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zfs_vdev_max_active was supposed to be equal to a number of commands that drive can handle. Its current limit of 1000 effectively disables the functionality. Taking 10% of it makes no sense. Same time if you reduce it to 32 reasonable for SATA, then 10% of it will become 3 requests, which you reach immediately in most cases.
Same time I don't see vq_active
to be incremented for bypass commands, which makes me wonder how are you expecting the threshold be reached until we magically get something that we won't bypass?
Also to avoid lock contention on one per-vdev lock you introduce another per-vdev lock, adding even more atomic operations, that makes me wonder if this is an optimization at all.
Allow bypassing the vdev queue on SSDs if the vdev queue is less than zfs_vdev_queue_bypass_pct percent full. This can lead to an over 2x IOPS speed-up on some benchmarks. The intention behind this property is to improve performance when using O_DIRECT. Signed-off-by: MigeljanImeri <[email protected]>
6dd8582
to
e33957a
Compare
Added vdev to bypass vdev queue when reading / writing from / to vdev. The intention behind this property is to improve performance when using o_direct.
Motivation and Context
Bypassing the vdev queue yields significant performance improvements when doing o_direct random reads on NVMe SSD's. We noticed a contention point with the
vq->vq_lock
in a flamegraph and wondered if this contention could be reduced through bypassing the vdev queue.Description
An additional check is needed in
vdev_queue_io_done
to see if the ZIO was actually added to the queue, since the property may have been toggled in betweenvdev_queue_io
andvdev_queue_io_done
.Using a gen 5 NVMe SSD, we were able to get roughly 3.5x the IOPS (136k -> 469k), through bypassing the vdev queue when doing o_direct random reads.
How Has This Been Tested?
Added a test case to check if toggling the bypassqueue property will cause kernel panic. When creating this vdev property, I ran into a situation where if the property was toggled while a ZIO was going through the pipeline, it would cause a panic because it was expected to be in a queue when it was never added to the queue. We now check if the ZIO has been added to the vdev queue before removing it.
Types of changes
Checklist:
Signed-off-by
.