Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vdev property to bypass vdev queue #16591

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MigeljanImeri
Copy link
Contributor

Added vdev to bypass vdev queue when reading / writing from / to vdev. The intention behind this property is to improve performance when using o_direct.

Motivation and Context

Bypassing the vdev queue yields significant performance improvements when doing o_direct random reads on NVMe SSD's. We noticed a contention point with the vq->vq_lock in a flamegraph and wondered if this contention could be reduced through bypassing the vdev queue.

Description

An additional check is needed in vdev_queue_io_done to see if the ZIO was actually added to the queue, since the property may have been toggled in between vdev_queue_io and vdev_queue_io_done.

Using a gen 5 NVMe SSD, we were able to get roughly 3.5x the IOPS (136k -> 469k), through bypassing the vdev queue when doing o_direct random reads.

How Has This Been Tested?

Added a test case to check if toggling the bypassqueue property will cause kernel panic. When creating this vdev property, I ran into a situation where if the property was toggled while a ZIO was going through the pipeline, it would cause a panic because it was expected to be in a queue when it was never added to the queue. We now check if the ZIO has been added to the vdev queue before removing it.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it might have sense for some very fast NVMe's with "unlimited" queue depth and I/O rate, since others may suffer from I/O aggregation lost as a side effect.

Also it seems that you are still obtaining the queue lock on completion, aren't you?

Also I don't see what would initialize io_queue_state you check on completion if the queue is bypassed.

Also the property seems to have negative meaning. It is preferable to use positive ones to avoid double negation.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 1, 2024
@MigeljanImeri
Copy link
Contributor Author

I guess it might have sense for some very fast NVMe's with "unlimited" queue depth and I/O rate, since others may suffer from I/O aggregation lost as a side effect.

Also it seems that you are still obtaining the queue lock on completion, aren't you?

Changed to avoid obtaining the queue lock if nothing gets removed from the queue.

Also I don't see what would initialize io_queue_state you check on completion if the queue is bypassed.

Changed to set the io_queue_state if not queueing the zio.

Also the property seems to have negative meaning. It is preferable to use positive ones to avoid double negation.

Changed property to use positive meaning.

man/man7/vdevprops.7 Outdated Show resolved Hide resolved
@robn
Copy link
Member

robn commented Oct 7, 2024

I am uncomfortable with this change.

I agree that the vq_lock runs white-hot when the disks underneath are very very fast. I've spent many months trying to reorganise this lock, with mixed success, mostly because in all my testing it just moves the contention to other locks. That was all pre-direct-IO though, so that may be different now, though my own workloads are not O_DIRECT.

So the main thing I know that this will "break" is that the IO is no longer tracked for the "slow IO" checks to find (vdev_deadman()). That's actually the main reason this lock runs hot, because removing the IO from the list is the only reason we take it on the IO return path.

The other part I'm just nervous about is that I don't know how it will interact with IO that is on the queue. Seems like it could starve other high-priority IO. This sort of stuff is why I've spent so long on it without a conclusive path forward.

I would weakly support this being done as a module parameter so it's at least available in the field for experimentation on specific customer workload, though it's a little tough with module parameters being scoped far wider than a single vdev instance. Though maybe that's fine to start for an experimental feature; the other vdev_queue_* parameters have the same issue.

For now though I'd be against having as a vdev property, because that is both more visible and more permanent, and I don't have the confidence to support it yet.

If I were overruled here, I would at least make the scope far more narrow: maybe only limit it to Direct IO IOs, and definitely make the name clear that is more narrowly. I would suggest something like your original and make it "queue bypass" or similar (eg "directio queue bypass" if it were limited to those).

(I appreciate the point about avoiding double-negatives, but I think its possible to claim that "bypass" is a specific, positive action. At least, for me, "queue io" is way too vague, and might cause me to say "why wouldn't I turn that off? Queues sound slow!". But I this is only a tiny part of my objection to this whole thing, so if it does stay in its current form, I'm unlikely to worry too much about naming).

@MigeljanImeri
Copy link
Contributor Author

I am uncomfortable with this change.

I agree that the vq_lock runs white-hot when the disks underneath are very very fast. I've spent many months trying to reorganise this lock, with mixed success, mostly because in all my testing it just moves the contention to other locks. That was all pre-direct-IO though, so that may be different now, though my own workloads are not O_DIRECT.

I have been working with @bwatkinson into improving IOPS performance with reads and we have also noticed that getting around this lock just seems to shift contention elsewhere. I am hoping that with further research and testing we can further reduce this contention, this bypassing the vdev queue idea is just one step.

So the main thing I know that this will "break" is that the IO is no longer tracked for the "slow IO" checks to find (vdev_deadman()). That's actually the main reason this lock runs hot, because removing the IO from the list is the only reason we take it on the IO return path.

The other part I'm just nervous about is that I don't know how it will interact with IO that is on the queue. Seems like it could starve other high-priority IO. This sort of stuff is why I've spent so long on it without a conclusive path forward.

I would weakly support this being done as a module parameter so it's at least available in the field for experimentation on specific customer workload, though it's a little tough with module parameters being scoped far wider than a single vdev instance. Though maybe that's fine to start for an experimental feature; the other vdev_queue_* parameters have the same issue.

As for implementing this as a module parameter, I am not too sure which way we should go. I am okay with it either way, maybe others (@amotin, @tonyhutter) can provide some insight into what should be done here?

For now though I'd be against having as a vdev property, because that is both more visible and more permanent, and I don't have the confidence to support it yet.

If I were overruled here, I would at least make the scope far more narrow: maybe only limit it to Direct IO IOs, and definitely make the name clear that is more narrowly. I would suggest something like your original and make it "queue bypass" or similar (eg "directio queue bypass" if it were limited to those).

I agree that this property should probably be limited to only O_DIRECT IO's, as from the testing that I have done, it doesn't seem to improve performance with non O_DIRECT workloads.

(I appreciate the point about avoiding double-negatives, but I think its possible to claim that "bypass" is a specific, positive action. At least, for me, "queue io" is way too vague, and might cause me to say "why wouldn't I turn that off? Queues sound slow!". But I this is only a tiny part of my objection to this whole thing, so if it does stay in its current form, I'm unlikely to worry too much about naming).

@bwatkinson
Copy link
Contributor

bwatkinson commented Oct 23, 2024

@behlendorf do you have an opinion on whether we should stick with a VDEV property that is limited to Direct I/O or would a module parameter make more sense at this point? I see a good argument for both.

@bwatkinson bwatkinson requested a review from behlendorf October 23, 2024 00:34
@tonyhutter
Copy link
Contributor

My preference would be for a vdev property, since you could set it NVMe vdevs, but not spinning disks. This assumes we're 100% comfortable with this feature though.

@MigeljanImeri do you have any thoughts on @robn's comments regarding breaking the deadman IO checks and the IO priority starvation? For example, if you submit a un-queue'd IO, and the drive "eats" the IO with no response, how does this codepath handle that?

@MigeljanImeri
Copy link
Contributor Author

My preference would be for a vdev property, since you could set it NVMe vdevs, but not spinning disks. This assumes we're 100% comfortable with this feature though.

@MigeljanImeri do you have any thoughts on @robn's comments regarding breaking the deadman IO checks and the IO priority starvation? For example, if you submit a un-queue'd IO, and the drive "eats" the IO with no response, how does this codepath handle that?

If we restrict this to only o_direct reads and writes, the o_direct code path should propagate up any errors that occur. In terms of priority starvation, I don't foresee this being an issue with newer NVMe SSD's that are very fast. Although this could be an issue with slower devices, maybe we could include a warning in the property description that potential starvation could occur if the underlying disk can't keep up with the requests?

@tonyhutter
Copy link
Contributor

@robn I was actually able to get this PR to trigger deadman events even with queue_io=off.

I set the ZFS params to:

zfs_deadman_checktime_ms=100
zfs_deadman_ziotime_ms=500
zfs_deadman_enabled=1

I then created a scsi_debug device with these params set:

modprobe scsi_debug max_luns=1 num_tgts=1 add_host=1 dev_size_mb=1000

... then created the pool with the scsi_debug device. I then set these additional scsi_debug params:

delay=100
every_nth=1000

Then I copied the zfs/build directory from the zfs source workspace to the pool and saw deadman events generated. This was NOT using O_DIRECT writes - just normal writes.

@tonyhutter
Copy link
Contributor

I'm going to see if I can starve out a scrub and MMP write with queue_io=off.

@robn
Copy link
Member

robn commented Oct 24, 2024

@tonyhutter thanks for trying to test those cases; I'm mostly going from memory and gut feels and its good to have things confirmed, or not!

From memory the deadman path takes the front (oldest) zio from the queue active list and uses it for the slow report, because we know its the oldest one outstanding. It therefore follows that that if the slow IO is not on that list, we won't report on it. The slow IO however will still be holding up the txg sync, and with no reporting, all the knockon "deadman fired" events (like suspending the pool, zed actions, etc) won't fire.

That might be ok! It might be ok to just document, "you lose all these other things" or something, so long as we're aware of it.

But this is part of why, for now, I'm against a vdev property: because there's knock-on effects that are to some extent tied up in implementation details that I definitely want to change, and I'm not aware of any policy or process around deprecating/removing a property in the future.

To be clear, I'm not against having the option available. If it had started as a module parameter, I would not have flinched. I think it's crucial we have ways to change internal behaviour in the field to gain more experience without too much overhead. My only concern with any of this is that we can support it into the future. Properties feel to me like real "user interface", like buttons, switches or lights on the dashboard, while module parameters feel more like messing around under the hood.

If I'm overblowing it, feel free to tell me to go away: I get it, the "best" way isn't always possible, and I'm not gonna be the person blocking this - it's important we can learn more about these use-cases.

Other possible options:

  • make it a hidden property; a secret switch under the dashboard is perhaps a bit more defensible
  • add some helpers so we can target module parameters to specific objects (vdevs, datasets, pools)
  • add some sort "transient" property type, that can be set, but isn't stored
  • gate the property behind a module parameter, so it does nothing unless a parameter is also set

I volunteer to make any/all of these possible if it would help (I'm not gonna request a bunch of unrelated work from someone, that's rude).

@robn
Copy link
Member

robn commented Oct 24, 2024

(incidentally, I love how many subtle and weird and things properly fast IO is throwing up. It's nice to be solving hard problems!)

@tonyhutter
Copy link
Contributor

@robn - some random thoughts:

One alternative is to pull it in as a vdev property and then keep it in master until 2.4.x. So don't backport it to 2.3.x or 2.2.x. That will give it some soak time and will help expose any unknowns. One thing I didn't think of though is that you can set vdev properties on the root vdev, top-level-vdevs and leaf vdevs, which could make this complicated. Like, what happens if you have queue_io=off on the top level vdev, and queue_io=on on the mirror-0 vdev and queue_io=off on one of the leaf mirror vdevs.

Another idea would be not have the vdev property at all and just overload zfs_vdev_def_queue_depth=0 to mean "don't queue anything". That makes me curious what setting zfs_vdev_def_queue_depth=0 does without this PR...

@robn
Copy link
Member

robn commented Oct 24, 2024

There's currently no inheritance on vdev properties, so nothing happens. We've talked a little internally at Klara about what property inheritance should mean, and turns out it's kinda weird no matter how you slice it. But yeah, for now, nothing.

I did think about getting it on master and then not backporting it, but I don't think it gets us much, and possible makes it worse. "Soak time" assumes that there's people running master and kicking the tyres on new things, and to me this feels too niche for many (any?) people to actually do that. If they don't, then we're just having to make the same decisions when its time to cut 2.4. If they do, then it's the same problem, in that we have a user-acessible lever tied to an implementation detail, with the bonus that it's already enabled on pools already out there.

If we were doing a module parameter, I would just make a new one, not overload an existing one. Based on precedent, I don't consider module parameters to be stable between release series, and I'd have no problem shipping one in (say) 2.3 and then removing it in 2.4. We already do this with lots of different parameters (most notably ARC related).


(Below follows an extended ramble where I ask a bunch of questions, sort of answer them along the way, and come to the same conclusion. The joys of vacation; got time on my hands).

Maybe there's some clarity to be found in considering when and why an operator might want to enable this option, or maybe better, when not to. That's something for me too; I initially saw a mention of vq_lock and kinda rubbed my hands together because I've been training for this moment 💪 and maybe I haven't thought about it on its merits.

So, I'm an operator, and I've got my shiny new NVMe pool, and I feel like it's going a bit slow. I search around and find people talking about an option that makes it go faster. I turn it on and indeed, it goes faster. I am happy! Hurrah! Or, I turn it on, and it doesn't go faster because my workload (or pool topology or overall system architecture) isn't a fit for it, and I'm now annoyed, but I don't bother turning it off because eh, it doesn't do anything anyway.

Alternatively, I'm not using an NVMe pool, just regular spinners, but I read about the option and wonder, and I turn it on. Everything suddenly goes slower, so I turn it off again, fine. But maybe it doesn't go slower, because I have some funky setup with tons of block cache "under" ZFS. (I dunno, I'm trying to invent a scenario I'm pretty sure isn't real).

So! Is this is only a GO FAST / GO SLOW toggle? Or does it affect something deeper?

Let's say I'm right for the moment, and it does have two knock-on effects: no slow IO warnings for bypassed IOs, and priority queued IO may not get serviced if bypassed IOs are hogging all the worker threads. As an operator, am I going to be surprised if I toggle this and other things stop working properly? What if I didn't know that they might beforehand? What if I did now but I didn't really understand it?

What if it didn't affect something deeper? What if we "fixed" those? Like, maybe queue bypass only works when we're not scrubbing or other things that "naturally" degrade pool performance, though maybe we want to prioritise ZIL writes still, or MOS updates, or...? Maybe queue bypass only works when the pool is definitely in good health and responding well (ie when we can be more certain that slow IOs wouldn't have happened anyway).

I hope it's obvious that these are examples; these two may not end up being problems at all. There might be different things we haven't thought of. New ones might come up as a result of future changes; how does that affect the property being enabled or not? What if there was a future where vq_lock didn't exist (I'm working on it I swear). What's the user's expectation then?

If it's only limited to direct IO, does that change the answers? Does it add any new problems?

And then, if it was possible for it to be only a GO FAST / GO SLOW toggle, would we even need it? Is there ever a time where we wouldn't bypass the queue for "safe" IO, and so don't even need a toggle?

What if it wasn't an explicit "bypass the queue", but more of a hint: "prioritise direct IO when safe to do so?". That would sidestep most of the questions, because it clearly says "idk, give it a try, I promise I won't mess up your pool". Is there any value in such a thing?

But if we had that, then how is that any different in concept to the existing priority system? Note "in concept"; I know it would suck under the existing priority system because of how it's currently implemented, but that's not the point.

So I don't have complete answers for all of this, and I'll admit that probably my partial answers are informing my questions a bit, so again, I am not assuming that I'm definitely right here. But, all this is why this seems like a module parameter to me, because those are things we currently have to influence implementation-specific behaviour. Lifting the lid to flip a "bypass queue" switch carries with it an acknowledgement that the operator knows they're doing something weird, and at least, they now know there is thing called "queue" and presumably it's there for a reason. So they flip it and everything goes wrong and at least maybe they know why, and if they ask for help we can say "ok, turn that off, and thanks for the feedback on you scenario".

The one shortcoming here is about wanting to apply this to a specific vdev. I know that properties are the only comfortable way to do that right now, but that doesn't mean they're the right way necessarily, and the only reason we don't have some more transient way is that we haven't needed one. Is this the change that forces our hand? Ask yourself, if I had said at the start "this should be a vdev-specific module parameter" because that's a facility we aready had, would this discussion even had happened? If it's "I don't think so", then I will go and write that facility this weekend (and use the hell out of it myself; I have a patch to lift the vdev queue limit parameters into vdev properties that I finished but shelved for all the reasons above: I didn't want to commit to surfacing those implementation details yet).

Conclusion: don't put a "bypass queue" button on the dashboard without first (a) defining "queue", (b) committing to "queue" being a thing in the long term and (c) making sure it does only that. But feel free to put whatever you want under the hood, because it's already expected that (a) there's loads of things in there you don't understand, (b) the next model could have a totally different engine and (c) you have no expectation that it won't do weird things to other things in there, because it's a whole engine and everything is there for a reason, right?

(I genuinely did not start out with car analogies, but there's a reason they're so prevalent in engineering discplines!)

@tonyhutter
Copy link
Contributor

I think you raise some good points on the vdev prop vs module param discussion. I'll have to think it over a little more.

What if it didn't affect something deeper? What if we "fixed" those? Like, maybe queue bypass only works when we're not scrubbing or other things that "naturally" degrade pool performance, though maybe we want to prioritise ZIL writes still, or MOS updates, or...?

Hmm.. could we do a hybrid approach where we allow O_DIRECT queue bypass if and only if the pending queue is empty (vq->vq_active == 0)?:

zio_t *                                                                          
vdev_queue_io(zio_t *zio)   
...
       if (!zio->io_vd->vdev_queue_io && vq->vq_active == 0) {                   
              zio->io_queue_state = ZIO_QS_NONE;                                 
              return (zio);                                                      
       }  

So if you had a workload that was 99% O_DIRECT writes and 1% queued writes, you'd get most (but not all) of your O_DIRECT writes to be bypass. That would solve the starvation problem but not the slow IO detection. I don't know if this would actually work, just thinking aloud. You could get stuck into a "pileup" problem where there's exists one queued write, and so the next 99 O_DIRECT writes get queued up too, and then you never get back into bypass mode since the queue is backed up. Maybe there's enough gaps where the queue goes back to zero that you can get into bypass mode fairly easily. I'd be curious to test this on real hardware.

Maybe queue bypass only works when the pool is definitely in good health and responding well (ie when we can be more certain that slow IOs wouldn't have happened anyway).

If your disk was unhealthy and not responding to IOs, and you had at least a handful of queued-IOs (even just 1% of the writes), then presumably those would get reported as slow IOs, and you'd at least have somewhat of a canary in the coal mine, right? So maybe this slow IO detection isn't as much of an issue?

@tonyhutter
Copy link
Contributor

@MigeljanImeri I've been trying to test/benchmark this PR, but in the process I've uncovered some unrelated bugs in ZFS (#16707, #16695) that are slowing down my testing. Can you describe your setup for getting those 3.5x gains you mentioned? Did you get those running fio with O_DIRECT? What did your pool config and record size look like?

@MigeljanImeri
Copy link
Contributor Author

@MigeljanImeri I've been trying to test/benchmark this PR, but in the process I've uncovered some unrelated bugs in ZFS (#16707, #16695) that are slowing down my testing. Can you describe your setup for getting those 3.5x gains you mentioned? Did you get those running fio with O_DIRECT? What did your pool config and record size look like?

I was running just a single pool with one vdev. Had recordsize set to 4k. Wrote out 256 GB over 256 files (each one gig) with o_direct. Then doing o_direct random 4k reads with fio, 256 jobs, each job reading from its own individual file (we saw some weird contention with rangelocks and this was how we got around it).

@tonyhutter
Copy link
Contributor

tonyhutter commented Oct 30, 2024

I just tested with a single NVMe pool. Sure enough, turning off queuing does improve IOPS with O_DIRECT:

-- IOPS --
fio, blocks=4k, seq read, direct=1
queue on:150k
queue off: 280k

fio, blocks=4k, randread, direct=1
queue on: 136k
queue off: 297k

@tonyhutter
Copy link
Contributor

I did get over 500k IOPS with direct=0, but I'm assuming that's due to caching.

@tonyhutter
Copy link
Contributor

I'll try to run my "scrub while FIOing" test again, but this time using the NVMe drive, since there seems to be weird stuff going on with brd + ZFS.

module/zfs/vdev.c Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Outdated Show resolved Hide resolved
@MigeljanImeri
Copy link
Contributor Author

I limited the queue bypass to only direct io. I think I've come to the conclusion that this should probably be a module parameter, rather than a vdev property, due to the potential issues that could arise with starvation and slow io detection. The only concern I have is that the current vdev module parameters aren't specific to a vdev. Ideally, we would be able to toggle this for individual vdevs.

@tonyhutter
Copy link
Contributor

tonyhutter commented Nov 22, 2024

I'm having trouble seeing the benefits of the "Direct IOs can bypass the queue" special case. What does that get us over just having queue_io control all IOs? If the idea is "my workload runs Direct IO, and I know what I'm doing, so I'm fine with bypassing the queue and accept the risks that go along with it", that's fine, but keep in mind other users on your system may also be using Direct IO as well. In fact a user may be running an application that uses Direct IO without their knowledge (like a database). So they could be buying into the relaxed safety assumptions unwittingly. That's why I lean towards queue_io controlling all IOs. It's just simpler, and it allows the user to bypass the queue for normal, non-Direct IOs as well.

There's one other "hybrid" option we could consider:

queue_io=on|off|hybrid

hybrid: If this is a "normal IO" and the queue is empty, then do not queue the IO. That should prevent starvation of the important IOs. Assuming most of the IOs are "normal" then most of the time then you're not going to be queuing, which is what you want. Important IOs would be things like ZIO_FLAG_IO_REPAIR, ZIO_FLAG_SELF_HEAL, ZIO_FLAG_RESILVER, ZIO_FLAG_SCRUB, ZIO_FLAG_IO_RETRY... In those cases you would always queue. So if you're running as resilver, a bunch of resilver IOs are going to get queued, which means your "normal IOs" get queued as well, and so you don't starve out the resilver.

Hybrid mode would look a little like this (untested):

vdev_queue_io(zio_t *zio)
...
    zio->io_flags |= ZIO_FLAG_DONT_QUEUE;
    zio->io_timestamp = gethrtime();

    if (!zio->io_vd->vdev_queue_io && ZIO_IS_NORMAL(zio)) {
        int length = vdev_queue_class_length(zio->io_vd, zio->io_priority);
        if (length == 0) {
            /* Queue is empty - bypass queue */
            zio->io_queue_state = ZIO_QS_NONE;
            return (zio);
        }
    }

I don't think we'd ever default queue_io to anything other than "on". You would have to manually switch it to "off" or "hybrid" to signal buy-in for the relaxed protections.

I don't think this should be a module param because you can have big discrepancies in latency and bandwidth between disks in a pool. Imagine you have a pool comprised of spinning disks, and some NVMe drives for a special device. You would want queue_io=on for the spinning disks and queue_io=off|hybrid for the NVMe. This is also true since you'd want slow IO detection for the spinning disks, but probably wouldn't care for NVMe. Now consider the same pool with spinning disks and NVMe but you only had a single, monolithic zfs_queue_io module parameter. Would you set it to "on" or "off" or "hybrid?

@robn
Copy link
Member

robn commented Nov 24, 2024

Since I was the one arguing against a vdev property upthread, I'll just clarify that. I do think this is a per-vdev option. I just don't think it's a property, because those are saved, which implies a certain amount of confidence that we have it right and won't change it.

It could be a property in the future once it's stabilised and/or we have some way to make a property "transient", or to mark it as deprecated if this doesn't work out.

If you don't think it's a problem though, eg, if you're cool with removing it in the future, maybe even the near future, then I'm cool with that. Or you can just tell me to pull my head in :)

@tonyhutter
Copy link
Contributor

I do think this is a per-vdev option. I just don't think it's a property, because those are saved, which implies a certain amount of confidence that we have it right and won't change it.

It could be a property in the future once it's stabilised and/or we have some way to make a property "transient", or to mark it as deprecated if this doesn't work out.

We could make a class of properties "experimental" and then have to pass --experimental to zpool set|get to get/set them. They wouldn't be listed by default and there would be no guarantee they would be supported into the future. Future pool upgrades could ignore or remove them. That's an idea outside the scope of this PR though.

As far as the PR is now, I do see a lot of value in it due to the IOPS increase, even if it only implements queue_io=on|off. I'm not as nervous about merging it into master in case it doesn't work out, as master is still the wild west and this feature probably wouldn't go into 2.3.x. If history is any indication, it would have around a year of soak time before going into a future 2.4.x. I still haven't heard a compelling reason to make it only apply to Direct IOs though.

@MigeljanImeri
Copy link
Contributor Author

As far as the PR is now, I do see a lot of value in it due to the IOPS increase, even if it only implements queue_io=on|off. I'm not as nervous about merging it into master in case it doesn't work out, as master is still the wild west and this feature probably wouldn't go into 2.3.x. If history is any indication, it would have around a year of soak time before going into a future 2.4.x. I still haven't heard a compelling reason to make it only apply to Direct IOs though.

The main reason for this is to limit the probability of not catching "Slow IO's" through the deadman path, along with the potential starvation issues. Although this "hybrid" approach looks like it could mitigate these issues. If we are not as worried about that though there is no reason to limit it to only Direct IO's.

@tonyhutter
Copy link
Contributor

tonyhutter commented Dec 18, 2024

@MigeljanImeri I think there's a way we could do this and still get deadman detection to still work. We would basically do the "hybrid" method but add the bypass ZIOs to vq->vq_active_list just like the regular non-bypass ZIOs. vq->vq_active_list is what the deadman looks at in vdev_deadman().

The new code would work like this:

  1. Call vdev_queue_length() to see if the non-bypass queue is empty.
  2. If the non-bypass queue has entries, or the ZIO is not a regular read/write (like a scrub IO), then put the ZIO in the regular non-bypass queue.
  3. If the non-bypass queue is empty, then add it to vq->vq_active_list, and then issue the ZIO.

Note that we would need a new vq->vq_active_list_lock to protect vq->vq_active_list, since both the bypass and non-bypass codepaths would be accessing it. So you'd have to go though everywhere the uses vq->vq_active_list and protect it with the new lock:

mutex_enter(vq->vq_active_list_lock);
list_insert_tail(&vq->vq_active_list, zio);
mutex_exit(vq->vq_active_list_lock);

The nice thing is that we never have to iterate on vq->vq_active_list. All accesses are O(1) so lock contention would be minimal. We're also not doing time consuming stuff like aggregation or LBA lookups like we do while holding vq->vq_lock.

We could enable bypass by default on solid state drives (vd->vdev_nonrot) for normal reads/write which would do away with the need for a vdev property too.

@tonyhutter
Copy link
Contributor

Call vdev_queue_length() to see if the non-bypass queue is empty.

You make this a threshold too, like "allow bypass if the normal queue is 10% full or less". The percentage could be adjustable with a module param, but you'd want to make it capped to something safe. The threshold might help make the use of bypass less bursty when there are occasional queued IOs.

@MigeljanImeri
Copy link
Contributor Author

Okay, I've implemented this hybrid method and added the module parameter. Seems to be slightly less performant (from my limited testing), than the previous iteration, but we don't break zio_deadman this way. I used zfs_vdev_max_active to determine how full the queue is, but this might need tweaking as zfs_vdev_max_active is currently set to 1000, which seems significantly higher than the sum of all the max active queue numbers.

@tonyhutter
Copy link
Contributor

Nice, I just retested and am only seeing a slight decrease as well:

IOPS
fio, blocks=4k, seq read, direct=1
bypass=292k, previously 280k
queue=140k, previously 150k

fio, blocks=4k, randread, direct=1
bypass=289k, previously 297k
queue=131k, previously 136k

I also ran a scrub while running fio and it didn't starve the scrub. So functionally it looks good. I'm happy that our users won't need to tune anything to get this IOPS speed-up.

I'll take a look at the code now

@@ -249,6 +249,14 @@ typedef uint64_t zio_flag_t;
#define ZIO_CHILD_BIT(x) (1U << (x))
#define ZIO_CHILD_BIT_IS_SET(val, x) ((val) & (1U << (x)))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add a comment here so it's understood why these flags are called out:

/*
 * ZIOs that are ZIO_FLAG_IMPORTANT are always queued so that they never get
 * starved out.  This allows us to bypass the queue for "normal" reads and
 * writes when the queues are low for better IOPS.  If the queues get too high
 * then we go back to queuing the "normal" reads/writes so as not to starve
 * out more important IOs like scrub/resilver/retry.  See
 * zfs_vdev_queue_bypass_pct for details.
 */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Comment on lines +1208 to +1212
ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, queue_bypass_pct, UINT, ZMOD_RW,
"Queue bypass percentage per vdev");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to add an entry for zfs_vdev_queue_bypass_pct to man/man4/zfs.4, like:

     zfs_vdev_queue_bypass_pct=10 (uint)
             Allow bypassing the vdev's queue if the vdev queue is less than
             zfs_vdev_queue_bypass_pct percent full.  This only applies to
             SSDs (non-rotational drives).  You can use 0 to always queue IOs
             and 100 to never queue IOs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -544,7 +544,7 @@ tags = ['functional', 'cli_root', 'zpool_scrub']
[tests/functional/cli_root/zpool_set]
tests = ['zpool_set_001_pos', 'zpool_set_002_neg', 'zpool_set_003_neg',
'zpool_set_ashift', 'zpool_set_features', 'vdev_set_001_pos',
'user_property_001_pos', 'user_property_002_neg',
'user_property_001_pos', 'user_property_002_neg',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removed.

Comment on lines 961 to 969
/*
* Bypass queue if certain conditions are met. Queue bypassing requires
* a non-rotational device. Reads / writes will attempt to bypass queue,
* depending on how full the queue is. Other operations will always
* queue.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add to this comment: Bypassing the queue can lead to a 2x IOPS speed-ups on some benchmarks. If the queue is too full (due to a scrub or resilver) then go back to queuing normal reads/writes so as not to starve out the more important IOs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@tonyhutter
Copy link
Contributor

Could you reword the commit message to call out the zfs_vdev_queue_bypass_pct module param. Like maybe:

Allow bypassing the vdev queue on SSDs

Allow bypassing the vdev queue on SSDs if the vdev queue is less than
zfs_vdev_queue_bypass_pct percent full.  This can lead to an over 2x
IOPS speed-up on some benchmarks. The intention behind this property
is to improve performance when using O_DIRECT.

@MigeljanImeri
Copy link
Contributor Author

Could you reword the commit message to call out the zfs_vdev_queue_bypass_pct module param. Like maybe:

Allow bypassing the vdev queue on SSDs

Allow bypassing the vdev queue on SSDs if the vdev queue is less than
zfs_vdev_queue_bypass_pct percent full.  This can lead to an over 2x
IOPS speed-up on some benchmarks. The intention behind this property
is to improve performance when using O_DIRECT.

Reworded.

@tonyhutter
Copy link
Contributor

@MigeljanImeri looks like there's a NULL pointer dereference happening in the test suite:

   [ 1641.645984] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
  [ 1641.649605] PGD 0 P4D 0 
  [ 1641.650972] Oops: 0000 [#1] SMP NOPTI
  [ 1641.652873] CPU: 0 PID: 116456 Comm: z_vdev_file Tainted: P           OE     -------- -  - 4.18.0-553.34.1.el8_10.x86_64 #1
  [ 1641.658384] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  [ 1641.663036] RIP: 0010:abd_is_from_pages+0x0/0x10 [zfs]
  [ 1641.667700] Code: 11 de bc f3 66 0f 1f 44 00 00 8b 07 c1 e8 05 83 e0 01 e9 fe dd bc f3 0f 1f 00 8b 07 c1 e8 06 83 e0 01 e9 ee dd bc f3 0f 1f 00 <8b> 07 c1 e8 09 83 e0 01 e9 de dd bc f3 0f 1f 00 49 89 f8 48 89 f7
  [ 1641.676833] RSP: 0018:ffffa67301873d28 EFLAGS: 00010202
  [ 1641.679473] RAX: ffff998ee4050000 RBX: 0000000000000000 RCX: 000000003fffffff
  [ 1641.683024] RDX: ffffffffc0000001 RSI: 0000000000000200 RDI: 0000000000000000
  [ 1641.686603] RBP: 0000000000000200 R08: ffff998ee002b028 R09: ffff998ee002b0c8
  [ 1641.690166] R10: ffff998edfc7bcc0 R11: 0000000000000000 R12: 0000000000000200
  [ 1641.693789] R13: 0000000000400a00 R14: ffff998edfc7bf68 R15: ffff998edfc7bf00
  [ 1641.697379] FS:  0000000000000000(0000) GS:ffff999037c00000(0000) knlGS:0000000000000000
  [ 1641.701408] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 1641.704331] CR2: 0000000000000000 CR3: 0000000036410004 CR4: 0000000000370ef0
  [ 1641.707895] Call Trace:
  [ 1641.709370]  ? __die_body+0x1a/0x60
  [ 1641.711278]  ? no_context+0x1ba/0x3f0
  [ 1641.713222]  ? __bad_area_nosemaphore+0x157/0x180
  [ 1641.715656]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [ 1641.718164]  ? do_page_fault+0x37/0x12d
  [ 1641.720162]  ? page_fault+0x1e/0x30
  [ 1641.722029]  ? abd_is_gang+0x10/0x10 [zfs]
  [ 1641.724310]  abd_verify+0x15/0x2b0 [zfs]
  [ 1641.726437]  abd_borrow_buf+0x14/0x60 [zfs]
  [ 1641.728829]  abd_borrow_buf_copy+0x14/0x50 [zfs]
  [ 1641.731316]  vdev_file_io_strategy+0x5a/0x130 [zfs]
  [ 1641.733964]  taskq_thread+0x28c/0x5c0 [spl]
  [ 1641.736267]  ? wake_up_q+0x60/0x60
  [ 1641.738085]  ? taskq_lowest_id+0xc0/0xc0 [spl]
  [ 1641.740427]  kthread+0x134/0x150
  [ 1641.742213]  ? set_kthread_struct+0x50/0x50
  [ 1641.744361]  ret_from_fork+0x35/0x40

if (zio->io_vd->vdev_nonrot && ZIO_IS_NORMAL(zio)) {

int bypass = vdev_queue_length(vq->vq_vdev) <
(zfs_vdev_max_active * zfs_vdev_queue_bypass_pct) / 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zfs_vdev_max_active was supposed to be equal to a number of commands that drive can handle. Its current limit of 1000 effectively disables the functionality. Taking 10% of it makes no sense. Same time if you reduce it to 32 reasonable for SATA, then 10% of it will become 3 requests, which you reach immediately in most cases.

Same time I don't see vq_active to be incremented for bypass commands, which makes me wonder how are you expecting the threshold be reached until we magically get something that we won't bypass?

Also to avoid lock contention on one per-vdev lock you introduce another per-vdev lock, adding even more atomic operations, that makes me wonder if this is an optimization at all.

Allow bypassing the vdev queue on SSDs if the vdev queue is less than
zfs_vdev_queue_bypass_pct percent full. This can lead to an over 2x
IOPS speed-up on some benchmarks. The intention behind this property
is to improve performance when using O_DIRECT.

Signed-off-by: MigeljanImeri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants