-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimized versions of isdir / isfile on Windows #101196
Comments
Yeah, this is very strange. Perhaps @eryksun has a better idea of what's happening here? By my read, But yes, the native implementation is gone, so we should remove the attempt to use it from |
The The reason that In principle, we could use Calling A possible exception is if a redirector for a remote filesystem locally caches file information (e.g. SMB uses caching). Footnotes
|
Actually, according to the documentation on supporting placeholders in a sync engine, most applications default to disguising placeholder reparse points. The documentation of
To confirm the "%SystemRoot%" behavior, I built a program that calls |
The OneDrive placeholders are... special... because I think traditional reparse points were used in an earlier implementation and it was found that they caused the MFT to fill up for some users. I suspect they virtualise all the files in a filter driver, and "generate" reparse points on the fly for anything inside the top-level directory (or possible all directories?) Though since we install the all-users py.exe launcher into
Okay, good. I thought we should've been, but glanced quickly at the sources and didn't see it. In that case, perhaps backup semantics + relative path + no reparse points work out faster than GetFileAttributes? So this particular benchmark might be the ideal case, and things balance out (hopefully don't get worse) in more complex cases? Hopefully we get |
I get the result that I would expect when I compare versions 3.6 to 3.11. Starting with 3.8 it became significantly slower. In the following table, each row is normalized by the version that ran the test in the shortest time:
I'm not very concerned about this result because normally the work of checking thousands of files and directories would use
Using backup semantics in this case typically means that If SeBackupPrivilege or SeRestorePrivilege is enabled, backup semantics also affects the nature of and outcome of the discretionary access check. If both privileges are enabled, backup semantics allows
Using a relative path avoids most of the path parsing in the object namespace and filesystem(s) because the directory is opened relative to the handle for the current working directory. However, this would be a common factor across all tested versions.
Yes, if the directories were symlinks or junctions, that would be to the advantage of
Either way, whether they're real or virtual reparse points, placeholders are seen as normal files by default unless either the process or thread is manually configured to expose them, or unless the executable is located in "%SystemRoot%".
The "py.exe" launcher doesn't matter. What matters is the location of the process executable (e.g. "python[w].exe", "pythonservice.exe"), which is unlikely to be installed in "%SystemRoot%". I copied "python.exe" to "%SystemRoot%", with the installation directory set in |
Your result is certainly less surprising than mine -- I'll have to figure out why my own benchmark runs show the opposite, but my best guess is my lack of Windows experience at the moment.
Agreed, but the use case that inspired this investigation was timing |
I wonder if an API like GLib's def test(path, flags):
methods = {
EXISTS | NOT_IS_DIR: _test_exists_is_not_dir,
EXISTS: _test_exists,
(...)
}
return methods.get(flags, _test_fallback)(path, flags) So, for instance, on Unix, |
I performed some more profiling on the While it's true that if/when we get And in this case, we should allow |
That's a reasonable (and new) suggestion. Probably deserves a new issue, as removing the comment referenced earlier is still a legitimate PR someone can create. Eventually a majority of Windows users will have the new API. Our timescales for this stuff are long enough to allow for that. Apps can modify their algorithms for faster speedup, but the runtime doesn't have to be in such a hurry. |
Here's an implementation of /*[clinic input]
os._isdir
path: path_t(allow_fd=True)
/
Return true if the pathname refers to an existing directory.
[clinic start generated code]*/
static PyObject *
os__isdir_impl(PyObject *module, path_t *path)
/*[clinic end generated code: output=75f56f32720836cb input=4aef81031b54999b]*/
{
HANDLE hfile;
BOOL close_file = TRUE;
FILE_BASIC_INFO info = {0};
Py_BEGIN_ALLOW_THREADS
if (path->fd != -1) {
hfile = _Py_get_osfhandle_noraise(path->fd);
close_file = FALSE;
} else {
hfile = CreateFileW(path->wide, FILE_READ_ATTRIBUTES, 0, NULL,
OPEN_EXISTING, FILE_FLAG_BACKUP_SEMANTICS, NULL);
}
if (hfile != INVALID_HANDLE_VALUE) {
GetFileInformationByHandleEx(hfile, FileBasicInfo, &info,
sizeof(info));
if (close_file) {
CloseHandle(hfile);
}
}
Py_END_ALLOW_THREADS
if (info.FileAttributes & FILE_ATTRIBUTE_DIRECTORY) {
Py_RETURN_TRUE;
} else {
Py_RETURN_FALSE;
}
} Remember to run clinic and add This version supports testing a file descriptor, since we've had that capability on Windows since 3.8. It isn't a drop-in replacement for For example: >>> import os
>>> os.path.isdir
<built-in function _isdir>
>>> os.path.isdir(0)
False
>>> O_OBTAIN_DIR = 0x2000
>>> fd = os.open('C:\\', O_OBTAIN_DIR)
>>> os.path.isdir(fd)
True
>>> os.close(fd)
>>> os.path.isdir(fd)
False
>>> os.symlink('spamdir', 'spamlink', target_is_directory=True)
>>> os.path.isdir('spamlink')
False
>>> os.mkdir('spamdir')
>>> os.path.isdir('spamlink')
True
>>> os.rmdir('spamdir')
>>> os.path.isdir('spamlink')
False
>>> genericpath.isdir('spam\0')
False
>>> os.path.isdir('spam\0')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: _isdir: embedded null character in path For me, this implementation takes about about a third less time than using A similar function could be implemented for A new Somehow I compared the performance of These functions always open a reparse point, except for disguised placeholders. Traversing a reparse point requires a normal open such as |
@eryksun: I agree that some form of this optimized
@DefaultRyan: I agree we should also do this (and maybe there are other instances of this elsewhere). Do you want to file a separate issue for that? |
os.path.isdir
on Windows
In |
@mdboom and @zooba I've created the new issue here: #101358. |
Co-authored-by: Eryk Sun <[email protected]>
* main: (82 commits) pythongh-101670: typo fix in PyImport_ExtendInittab() (python#101723) pythonGH-99293: Document that `Py_TPFLAGS_VALID_VERSION_TAG` shouldn't be used. (#pythonGH-101736) no-issue: Add Dong-hee Na as the cjkcodecs codeowner (pythongh-101731) pythongh-101678: Merge math_1_to_whatever() and math_1() (python#101730) pythongh-101678: refactor the math module to use special functions from c11 (pythonGH-101679) pythongh-85984: Remove legacy Lib/pty.py code. (python#92365) pythongh-98831: Use opcode metadata for stack_effect() (python#101704) pythongh-101283: Version was just released, so should be changed in 3.11.3 (pythonGH-101719) pythongh-101283: Fix use of unbound variable (pythonGH-101712) pythongh-101283: Improved fallback logic for subprocess with shell=True on Windows (pythonGH-101286) pythongh-101277: Port more itertools static types to heap types (python#101304) pythongh-98831: Modernize CALL and family (python#101508) pythonGH-101696: invalidate type version tag in `_PyStaticType_Dealloc` (python#101697) pythongh-100221: Fix creating dirs in `make sharedinstall` (pythonGH-100329) pythongh-101670: typo fix in PyImport_AppendInittab() (pythonGH-101672) pythongh-101196: Make isdir/isfile/exists faster on Windows (pythonGH-101324) pythongh-101614: Don't treat python3_d.dll as a Python DLL when checking extension modules for incompatibility (pythonGH-101615) pythongh-100933: Improve `check_element` helper in `test_xml_etree` (python#100934) pythonGH-101578: Normalize the current exception (pythonGH-101607) pythongh-47937: Note that Popen attributes are read-only (python#93070) ...
It looks like this has been completed? The discussion on makedirs has a dedicated issue now. There was some discussion on the PR about whether to backport, but it's rare for us to backport a performance improvement (particularly so when the merge isn't trivial) |
Yeah, just forgot to hit close. |
…`is_*()` Suppress all `OSError` exceptions from `pathlib.Path.exists()` and `is_*()` rather than a selection of more common errors as we do presently. Also adjust the implementations to call `os.path.exists()` etc, which are much faster on Windows thanks to pythonGH-101196.
…`is_*()` (python#118243) Suppress all `OSError` exceptions from `pathlib.Path.exists()` and `is_*()` rather than a selection of more common errors as we do presently. Also adjust the implementations to call `os.path.exists()` etc, which are much faster on Windows thanks to pythonGH-101196.
I went down this rabbit hole when someone mentioned that
isfile
/isdir
/exists
all make a rather expensiveos.stat
call on Windows (which is actually a long wrapper around a number of system calls on Windows), rather than the simpler and more direct call toGetFileAttributeW
.I noticed that at one point there was a version of
isdir
that does exactly this. At the time, this claimed a 2x speedup.However, this C implementation of
isdir
was removed as part of a large set of changes in df2d4a6, and as a result,isdir
got faster.With the following benchmark:
isdir benchmark
I get the following with df2d4a6:
and with the prior commit:
So, from this, I'd conclude that the idea of replacing calls to
os.stat
with calls toGetFileAttributeW
would not bear fruit, but @zooba should probably confirm I'm benchmarking the right thing and making sense.In any event, we should probably remove the little vestige that imports this fast path that was removed:
Linked PRs
The text was updated successfully, but these errors were encountered: