-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add module parameter to disable prefetch in zfs_readdir #16318
Conversation
Would it make sense to change this from a boolean to a scalar threshold value on the directory link count to determine whether or not to prefetch? |
Yes, it probably does make sense. I'm willing to try. (I didn't actually write the code, but I can probaby figure that out.) However before doing that it's worth asking whether we can solve the underlying problem. The problem occurs becasue NFS wants to know the name that the directory has in its parennt directory. There's a function to get the name of an inode. It's specific to the file system type. ZFS doesn't have it, so the backup is used. The backup ends up using zfs_readdir to search the directory looking for the inode. So the obvious queston is: can we create a zfs_getname that does it better. Is there somewhere in the ZFS data structures that we could cause the filename. that way when zfs_readdir (or our replacemennt) looks through the parent directory, it can cache the name of every inode as it goes by. Then when we come to look up that node, the name will already be there, and zfs_getname won't need to do a search at all. Since I'm not a zfs developer, I don't know the data structures. Is there a place the name can be put, and obviously cleared if there's a rename or delete? With a normal file, it can be in more than one directory, so a single entry wouldn't work. But for a directory, there's only one name that it has in its parent (i.e ".."), so we don't have that ambiguity. If someone could point me to what needs to be done, I'm willing to code it and test it. I do have experience doing kernel code. (I was one of the original Linux developers when Linus was first writing linux.) |
For clarify, the actual function is int get_name(const struct path *path, char *name, struct dentry *child);
The problem occurs when data is in cache on the client, but the dentry for the child directory isn't in the dentry cache anymolre. So the data can't be stored ni the dentry, since it may not exist for some of the directories whose names we'd like to cache. |
man/man4/zfs.4
Outdated
have a very large number of subdirectories. | ||
A reasonable value would be 20000. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be pedantic, please consider specifying what the units are.
I've changed it to have a limit rather than a boolean. If non-zero, any directory larger than the limit disables prefetch. Note that it fails checkstyle. Checkstyle demands that I use a posix type, replaceing ulong with ulong_t, but the macro doesn't support ulong_t. There are examples in other parts of the code using ulong. I have tested this with 0, a value smaller than my directory size, and a value larger than my directory size. The performance is sufficiently different that I can tell it works as intended. |
there's no unit. It's a count. Directory size. Unlike other file systems, in ZFS the size of the directory is a count of the files in it. |
Perhaps too pedantic, but in general directories have a size in bytes as well as a size in link count. However, if looks like ZFS sets the size to equal the link count. |
I looked a bit further into writing an alternative zfs_get_name. It looks like one could use zap_value_search. However the basic functions for iterating through the zap seem to be the same. I'm sure it would be faster, but it's not clear whether it would be enough to justify the extra code. |
I can take a look at this. Would you mind squashing the commits and take a look at the checkstyle errors? |
How do I squash?
On Jul 11, 2024, at 8:07 PM, Tony Hutter ***@***.***> wrote:
I can take a look at this. Would you mind squashing the commits and take a look at the checkstyle errors?
—
Reply to this email directly, view it on GitHub<#16318 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAORUCCRC2JXFIHGYJZMMBDZL4M2NAVCNFSM6AAAAABKGFOVRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRUGE2DINJSG4>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
the complaint about non-posix definition was hard to understand. All lthe POSIX alternatives generate compile errors. And I found plenty of uses of ulong in other files. Here's my attempt to use ulong_t param_ops_ulong_t’ undeclared here Here's another file with the same type of declaration not using POSIX. in ./module/os/linux/zfs/zvol_os.c I was unable to figure out what to do. |
I see what you mean, that's bizarre. It's probably a bug in our cstyle.pl. I guess just make it a To squash commits run |
ulong is the one that compiles but fails style I haven't been able to figure out why mine fails but it works in other files. I suspect there's an include file I need. I think I've squashed it correctly. |
The original version, that allows one to disable all prefetch is running in production. I tried the current vetsion on a test server. I’ll retry on test tomorrow to make sure nothing has broken in the update, though I don’t see how that could have happened, since the only thing that changed was spacing.
The first chance to reboot production is mid August.
On Jul 11, 2024, at 9:00 PM, Tony Hutter ***@***.***> wrote:
I see what you mean, that's bizarre. It's probably a bug in our cstyle.pl. I guess just make it a ulong_t for now in module_param() as a workaround.
To squash commits run git rebase -i and squash all your commits into your first commit. After you've successfully squashed them, you can do a force push (git push --force) to update this PR with the single commit.
—
Reply to this email directly, view it on GitHub<#16318 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAORUCEI4Q6JT6UHYCQPFLLZL4TDDAVCNFSM6AAAAABKGFOVRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRUGIZTGOJUG4>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
@clhedrick looks like a diff --git a/module/os/linux/zfs/zfs_vnops_os.c b/module/os/linux/zfs/zfs_vnops_os.c
index 11f665d22..efe2371fa 100644
--- a/module/os/linux/zfs/zfs_vnops_os.c
+++ b/module/os/linux/zfs/zfs_vnops_os.c
@@ -4257,6 +4257,7 @@ EXPORT_SYMBOL(zfs_map);
module_param(zfs_delete_blocks, ulong, 0644);
MODULE_PARM_DESC(zfs_delete_blocks, "Delete files larger than N blocks async");
+/* CSTYLED */
module_param(zfs_readdir_dnode_prefetch_limit, ulong, 0644);
MODULE_PARM_DESC(zfs_readdir_dnode_prefetch_limit,
"No zfs_readdir prefetch if non-zero and size > this"); Would you mind adding that in? Then we should be pretty much good to go. |
OK. I just tested on a 6.8 kernel with the limit 0, 20000 and 400000, with a directory of size 200000. I got the expected results. 4 min with 0 and 400000, 20 sec with 20000. |
This is tricky to test. The problem occurs when files are cached on the client, but metadata is not, and the dnoce is not cached on the server. This is realistic for us, because our clients have very large memory. Data can stay cached for days if not weeks, but the attribute cache time out after something like 90 sec. I believe the problem occurs when the client does a getattr to see if the data is still fresh. The getattr uses what is in effect a inode number. That triggers the reconnect code in the NFS server, which searches the parent directory for the directory, in order to find its name. So to duplicate the problem, clear cache on the server but not the cllient, and wait at least 120 sec on the client, for the attribute cache to timeout. Clearing cache on the client would cause it to do full lookups on all files, which would be fast. |
Just so I'm clear, diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..588bf9339 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1781,9 +1781,11 @@ When readdir searches a directory, it normally prefetches metadata for
all objects in the directory it checks, even if it's just
looking for a single object.
Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with more entries than that value. For example, if set to 1000, then stop
+prefetching file metadata after 1000 file's metadata have been prefetched for
+the directory.
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
A reasonable value would be 20000.
.
.It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int |
The documentation is correct. I compare the directory size with the limit. What you describe could be implemented. I would use a counter in the loop. It would take a lot of work on lots of different situations to know whether this is better or not. It's not obvious that it is. A more complex fix would be to implement a zfs-specific get_name. That would basically disable prefetch only in this specific NFS operation. Whether that's worth doing or not depends partly upon whether the prefetch is a good idea in the first place for very large directories. I'm not so sure it is. Consider "ls -l". If information on all files is already in the ARC, the prefetches are wasted. If not, it will queue up a bunch of prefetches, in the order that they occur in the directory traversal. The best case seems to be"ls -l". But ls will sort the files alphabetically. I think it's just as likely that the first file will have to wait for a bench of prefetches are finished to get its data. Prefetch won't be any faster than a demand fetch. It's useful only if the demand comes later, so that by the time it comes the data is in memory, where it would have had to wait otherwise. But it probably doesn't matter for typical file access. The "ls -l" case seems like the best hope, but I'm skeptical because of the ordering issue. You'd need some pretty careful testing in quite a bunch of situations. But I'm skeptical about the usefulness of the prefetch. Certainly not for us, where metadata is always in NVMe. It you can suggest a way to get a directory's name without doing a traversal of the parent at all, that would be worth doing. I'd be willing to code it. But I don't see any obvious way to do that given the data structures. |
Ok, I misunderstood. So it's literally is the 'size' from stat(/directory). (https://kb.netapp.com/on-prem/ontap/Ontap_OS/OS-KBs/What_is_directory_size) May I suggest the following change?: diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..083ffeaa4 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1776,15 +1776,18 @@ that ends up not being needed, so it can't hurt performance.
.
.It Sy zfs_readdir_dnode_prefetch_limit Ns = Ns Sy 0 Pq u64
Disable prefetches in readdir for large directories.
-(Normally off)
When readdir searches a directory, it normally prefetches metadata for
all objects in the directory it checks, even if it's just
looking for a single object.
Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with a greater size than that value. Directory size in this case is the
+size returned from calling
+.Sy stat
+on the directory (stat.st_size).
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
A reasonable value would be 20000.
+A zero value (the default) means no limit on directory metadata prefetching.
.
.It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
Disable QAT hardware acceleration for SHA256 checksums. |
To my knowledge, that field for a directory is in fact the number of entries in the directory. Is that wrong? It's not, as it is for something like ext4, the amount of space used by the directory structure itself. Obviously I don't actually call stat, although I'm prepared to believe that this is the number that stat uses. I just checked a few directories, and it seems close, off by one maybe. I found code incrementing or drecrementing it for actons with a single file. |
I double checked and that appears to be correct. On ZFS it's basically the number of files/subdirs in the directory, but on other filesystems it's like the metadata size needed for the directory. So it's not strictly defined 🤷♂️ How about: diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..a6a00e107 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1776,15 +1776,20 @@ that ends up not being needed, so it can't hurt performance.
.
.It Sy zfs_readdir_dnode_prefetch_limit Ns = Ns Sy 0 Pq u64
Disable prefetches in readdir for large directories.
-(Normally off)
When readdir searches a directory, it normally prefetches metadata for
all objects in the directory it checks, even if it's just
looking for a single object.
Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with a greater size than that value.
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
+Directory size in this case is the size returned from calling
+.Sy stat
+on the directory (stat.st_size).
+On ZFS, this directory size value is approximately the number of files
+and subdirectories in the directory.
A reasonable value would be 20000.
+A zero value (the default) means no limit on directory metadata prefetching.
.
.It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
Disable QAT hardware acceleration for SHA256 checksums. |
Can you edit the commit message to fix the commitcheck error?:
Other than that, it should be good to go.. |
ok. I also noted in the documentation that it only applies to Linux. I have no reason to think the BSD needs it. It's needed because of details of the NFS implementation. i would hope that BSD hasn't chosen the same approach. At any rate, the BSD readdir doesn't prefetch. |
what's up with this? Test failures can't be due to this patch, as it obvviously has no effect if not enabled. Is this going to be merged? |
I have resolved the conflict, and added advice on when to set the parameter in the man page. |
Add paramter zfs_readdir_dnode_prefetch_limit, defaulting to 0, to control whether zfs_readdir prefetched metadata for objects it look at when reading a directory. If zero, metadata is prefetched for all directory entries. If non-zero, metadata is prefetched only if directory has fewer entries than this. Setting it to non-0 can be important for NFS servers with directories containing many subdirectories. Signed-off-by: Charles Hedrick <[email protected]> Co-authored-by: Chris Siebenmann<[email protected]>
rebased I've built it and booted it Simple tests look ok. There was a comment about impact on performance. Note that it has no effect unless explicitly enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still object this change, since the limit on the directory size it introduces makes no sense. The large directories have the same or may be even higher chance for dnodes they reference to not be in cache and so benefit from prefetch. For smaller directories it might just be less of problem, being hidden by the network and NFS stack latency.
Anything that's worth doing is beyond what I can do.
That is not true if you think for a moment. All we need is to limit prefetch to only first directory read. We just need another flag in znode_t (or replace/modify z_zn_prefetch to save space) to represent that directory was read till the end and there is nothing to prefetch any more next time. Or we could save the position till which directory was previously read, the complication there though is that as I understand serialized ZAP cursor is not monotonically increasing, so we could not compare it for bigger/smaller, but we could save just zc_hash, which should be enough for quick and cheap comparison. And/or we could store the time of the last full read, and issue prefetches only if since previous time passed some predefined time that would justify another prefetch.
Any of that would be pretty trivial to implement, and would be more logical than what is proposed here. BTW, please keep the code in sync between OS'es. FreeBSD code is similar enough to update it also.
The reason I said doing more is beyond me is because I'm not a ZFS developer. One implication is that I don't know the lifetme of various data structures, so it's not clear to me where I can put information and how long it will live. |
znode in most cases lives until kernel under memory pressure will decide to evict inodes, that will evict respective znodes also, or until it is deleted. So I'd say it lives long enough. Just search sources for z_zn_prefetch. It is used only in 5 places on Linux and 4 on FreeBSD, and its logic is pretty simple. Just extend it. |
The ideal solution would be to provide a better way to go from inode to name. Is there someplace to stash the name for a file when readdir (or an equivalent in getname) reads past it, so it could be translated without a directory search? |
I don't see how it is related to this topic of prefetch optimization. inode might have many names due to hard links, or no name at all if all of them were deleted while it was open. It was discusses a lot on recent OpenZFS developer summit and even in last few weeks somewhere here on the GitHub. |
The problem is not, as far as I know, triggered by people doing "ls". You wouldn't expect someone to do "ls" for the same directory enough to cause this performance disaster.. Rather, it's triggered by NFS. If an existing file is being access, NFS sends a file handle from client to server. For Linux, the file handle is file system and inode number, effectively. The server wants to validate the file handle. As part of doing this, the NFS server calls a function gettname, to go from inode number to file name. It is able to find the files's conntaining directory fairly easily, but to find the filename it calls readdir to read the directory until if finds the file with that inode number. You may not think this is a sensible approach, but it's what the LInux NFS server does. Our clients have a large amount of memory, enough to keep the working set cached in memory, sometimes for days. It is common for them to access a file where the client has it cached but it's gone out of the server.s cache. that causes the client to send an inocde number for a file that is not longer in the servers's dnode cache. In that case, if the inode is a directory, the code is triggered that reads the containing directory to find the inode number. Some of our AI training jobs have directories with 200,000 - 500,000 subdirectories. If the client tries to access all of them, each access of a subdirectory will cause a read of on average half the containing directory to map it's inode number to its name. Hence the total number of prefetches will be something like 200,000 x 200000 / 2. This isn't a performance optimization. It's savign us from a disaster. It is possible to supply a zfs-specific implementation of getname. If there's a place to cache the name, then the first time you read the containing directory, the inode to name mapping could be cached. That could turn the problem from 200,000 squared to 200,000. I realize that the same inode number could appear more than once in a directory, but the problem occurs only for subdirectories, and I don't believe that's possible for a directory. Or if it is, the NFS code doesn't care. However youzhongyang may well have a different situation where that approach wouldn't help. He may indeed need to disable the prefetch. Before doing more coding I'm going to do some experiments to see if the prefetch is useful at all. I conjecture that it is not. In which case doing more coding to allow it in some cases would be unnecessary. I propose to try "ls -lU" in large directories with the cache cleared. That seems like the best case for prefetch to have an effect. |
@clhedrick A comment from one good man: "I do wonder if it's something as simple as NFS server misconfiguration where they didn't disable subtree checking." |
sorry, not a misconfiguration. I've looked at the code and used ebpf to watch the behavior. It's really doing what I said. |
Would you explain why then? Because handle to inode translations should not require name. |
It doesn't. The problem is in code to verify that the file handle refers to a file that is still valid. You can see a backtrace in #16254. If this is done with a file that has just been looked up, the necessary information is cached. The problem happens when the client has the data cached, but the dnodes are no longer cached on the server. The code looks weird to me. There ought bo be a better way. But that's how it currently works. |
In expfs, exportfs_decode_fh_raw, for the case of a directory, the relevant code is
acceptable is a test, which in this case is nfsd_acceptable. It does in fact return true immediately if subtree checking is off. The problem is that before it is called, if the dcache entry is disconnected (which it would be if it's new), reconnect_path is called. It ends up calling reconnect_one, which calls get_name. The default implementation of that enumerates the containing directory. I don't think ZFS has enough data to provided an alternative. |
I don't know why you're going down the non-optimized generic |
zpl_get_name is not in 2.2.6. The patch is 20232ec, from 2021. You'd think a 2021 patch would be in 2.2.6, but it isn't. The purpose is to support long names, but I believe it would fix this problem as well. People who are seeing this issue in 2.2.6 will still want to either apply the change in this PR or simply comment out the call to prefetch in zfs_readdir. 20232ec seems a bit much for us to retrofit into 2.2.6. zpl_getname still iterates over the directory. But it doesn't prefetch, and the iteration should be slightly more efficient than the one in zfs_readdir (but more efficient by a constant factor -- so reconnect_one is still going to be more CPU-intensive than it ought to be). |
Since a diff is more likely to survive version changes, here's a diff to add get_name to 2.2.6: https://services.cs.rutgers.edu/zfs.diff. It's the implementation from 2.3, but without a bunch of extra stuff to implement long file names. I've tested it, but not in production. But it might be useful for people having this issue in 2.2. We're going to stick with 2.2.x for Sping. I'm debating whether to use this, since the patch in this PR has so far kept us out of trouble. get_name is certainly a better approach. If the maintainers want to put it in 2.2.x, I'll change this PR to be relative to 2.2.7 when it is released (or the current draft if you want it for 2.2.7.). |
Could you open a PR with the proposed fix against the zfs-2.2.7-staging branch. This way we can run it through the CI before pulling it in for 2.2.7 (or 2.2.8). I agree, it sure does look like it will address the issue and is similar to what I was attempting to suggest in an early comment. I'd forgotten this hook got implemented already with the long name changes. As for this being authored in 2021 and not in 2.2.x it's not quite as bad as it seems. Long name support was started back in June 2021 but was only finalized and committed to the upstream on October 1st 2024, 2 months ago.
|
OK, I've done a PR against zfs-2.2.7-staging. Note that CI is unlikely to test this code, as noted in the explanation in the PR. It would be very hard to test, as it requires cooperation between the system and an NFS client. |
I wanted to make sure the code was doing the right thinig. I have little idea how I'd tell whether directories are getting reconnected, but I used bpftrace to check expfs.c: reconnect_one. exportfs_get_name can't be traced. But it should call zpl_get_name, so I traced it, including a kernel backtrace. it gets called from reconnect_one, and returns 0, which I assume is no error. It then calls lookup_one_len_unlocked. I traced it, and it does get called from the right place with valid directory names, which would have to have come from the zpl_get_name. When I look at the output of reconnect_one it is properly returning the parent directories. So as far as I can tell things are working. |
@clhedrick Shall we close this in favor of the backport, or you plan to continue working on improving prefetch aspect? |
Add paramter zfs_readdir_dnode_prefetch_enabled, defaulting to 1, to control whether zfs_readdir prefetched metadata for all objects it look at when reading a directory.
Setting it to 0 can be important for NFS servers with directories containing many subdirectories.
Motivation and Context
This improves #16254, although a more radical redesign is possible. We have an NFS servers with directories containing 200,000 to 500,000 subdirectories. These arise in AI training scenarios. Our NFS servers can have 20 or more threads running at 100% for hours due to this.
The` problem has been found to be an interaction between the NFS code and zfs_readdir. If information about a directory is cached on the client but not the server, when the server tries to valid the NFS file id, it will add an entry in the dnode cache for the directory. Because he new dnode is not connected to the rest of the directory hierarchy, the NFS (actually exportfs) code goes up the tree trying to connect it. The ends up calling zfs_readdir to find the directory in its parent Because zfs_readdir reads the parent looking for the directory, the amount of time it takes is proportional to the size of the parent. If every directory is processed, the amount of time will be N**2 in the size of the parent. Reading a directory would have only moderate overhead except that zfs_readdir does a prefetch on every item as it looks at it, even if the item is already in the ARC. Of course if it's in the ARC, no disk I/O will be done. But there's still a substantial amount of code.
We've found a factor of 10 to 20 speedup by skipping the prefetches. This is enough to move us from a painful situation to one we can live with.
Of course a linear speedup in an N**2 problem isn't a full solution, but it's enough to survive the size directories we see in practice. A full solution would almost certainly require help from the kernel developers, which is going to be hard to get, since directory searching is fast enough for other Linux file systems that the problem only comes up with ZFS.
Description
This change add a module parameter, zfs_readdir_dnode_prefetch_enabled, normally 1, which can be changed to 0 to disable the prefetch that happens in zfs_readdir.
How Has This Been Tested?
We have done testing with a directory containing 200,000 subdirectories, with the results discussed above. It has been running in production for several weeks.
Types of changes
Checklist:
Signed-off-by
.