Skip to content

Latest commit

 

History

History
151 lines (119 loc) · 6.9 KB

globbing.rst

File metadata and controls

151 lines (119 loc) · 6.9 KB

Globbing

Herein globbing refers to using filesystem information (like which files have .cxx extension) to configure targets and other project properties, as opposed to explicitly listing each target's sources in CMakeLists.txt.

The case for globbing

Discussion of globbing in a CMake project must begin with discussion of the :cmake:`admonition against doing so <command/file.html#glob>` in CMake's own documentation (:ref:`skip to usage <glob-function>`). The main reasons cited for avoiding globs are:

  • Not all generators support glob-dependent reconfiguration.
  • There may be files which match a glob unintentionally (for example temporary files generated by a tool) which pessimize or invalidate build configuration.
  • If there are globs configuration depends on then each build must check that those globs have not changed, which introduces overhead.

If it is necessary for a project to support all generators or to enable usage of tools which introduce spurious glob matches, then globbing is not an option. There is no decision on which workflows to support which is correct for all projects, so I think a blanket prohibition against a technique is less beneficial than a description of its relative merits.

In Maud's case the C++20 modular structure is central and every generator which :cmake:`supports C++20 modules <manual/cmake-cxxmodules.7.html#generator-support>` also supports glob-dependent reconfiguration, so avoiding globs would not expand Maud's generator support.

As for tools which touch the source tree: even in projects where globbing is not used I frequently have multiple worktrees associated with the repository to isolate those tools from (for example) a build which I don't want to invalidate. Perhaps some would find this unacceptably inelegant.

Globbing Performance

One of the project tests is a benchmark of globbing overhead. On my machine, the output looks like:

$ ./test_.project --gtest_filter=*bench* | grep -E "^BENCHMARK" -A 10 -B 0
BENCHMARK
--      Writing:            ( mean=3602.278     min=3381.849    ) ms
--      New checking:       ( mean=848.554      min=826.814     ) ms
--      Globbing:           ( mean=929.903      min=909.522     ) ms
--      Globbing(fd):       ( mean=299.474      min=292.794     ) ms
--      Globbing(git):      ( mean=311.838      min=305.374     ) ms
--      Filtering:          ( mean=87.323       min=85.166      ) ms
--      Loading the cache:  ( mean=24.618       min=22.662      ) ms
--
    8 iterations with 160000 files

(Parameters chosen to approximate the llvm-project repository at the time of writing in number of files and directory depth (median=4).) Writing serves as a baseline of the filesystem's speed: a simulated project with 160,000 empty files is generated, which takes a few seconds. New checking is another useful baseline: accessing the mtime of every file takes a little less than a second.

The benchmark's Globbing result shows that using :cmake:`file(GLOB_RECURSE) <command/file.html#glob-recurse>` to list all files and directories in the simulated project also takes a little less than a second. (Unless we delegate to a dedicated globbing utility as in Globbing(*), which can reduce that time significantly for large projects.) Maud's globbing aggressively caches results, filtering from those cached results on each new glob. This means the overhead of actual filesystem access is only paid once per rebuild; each new glob incurs less than a tenth of that overhead.

Loading the cache is also once-per-build overhead. Maud stores glob results in ${CMAKE_BINARY_DIR}/CMakeCache.txt, which must be loaded in the CMake scripts which verify globs have not changed.

In testing on multiple machines and simulated project sizes, Globbing overhead remains comparable to New checking. The latter is an unavoidable once-per-build overhead even if globbing is not used, since each source file's mtime must be checked to determine if it must be recompiled. To me, adding this overhead again seems acceptable. There may be projects where that added overhead is unacceptable; in that case, I'm glad this benchmark was useful to decide that quantitatively... but I'd be more glad of a PR to increase Maud's globbing performance.

glob

glob(
  name
  [CONFIGURE_DEPENDS]
  [EXCLUDE_RENDERED]
  < inclusion_regex | ! exclusion_regex >...
)

Declare a glob. A list will be stored in a CACHE variable with the provided name containing the absolute path of matching files and directories. All files in ${CMAKE_SOURCE_DIR} as well as generated files in ${MAUD_DIR}/rendered are examined for inclusion in the glob. Files and directories whose name begins with . are excluded from all globs.

Glob results are updated as part of the main build system check target, so during reconfiguration calls to glob() are a no-op (because the CACHE variable is already up-to-date). Scripts which load the cache can access the variable normally.

CONFIGURE_DEPENDS
If this flag is specified then in addition to updating the glob's results the check target will trigger regeneration if the results change.
EXCLUDE_RENDERED
Generated files will be ignored if this flag is specified.
< inclusion_regex | ! exclusion_regex >...

Each pattern is a :cmake:`REGEX <command/string.html#regex-specification>` which is applied to each candidate file's path. Patterns are applied to relative paths; either the component relative to ${CMAKE_SOURCE_DIR} or relative to ${MAUD_DIR}/rendered if generated.

Patterns are evaluated in series, starting with an empty result set. Inclusion patterns are applied to all files and any matches are added to the result set. Exclusion patterns are applied to the result set and any matches are removed. So for example [.](cxx|hxx) !(^|/)_ !thirdparty would include hello.cxx, hello.hxx but would exclude _disabled.cxx and any files in world_thirdparty/.

Built-in globs

By default the extensions used to identify C++ source files are .cxx .cxxm .ixx .mxx .cpp .cppm .cc .ccm .c++ .c++m. These can be customized by setting the variable MAUD_CXX_SOURCE_EXTENSIONS.

Directories and files whose names start with . are excluded from all globs. Maud names build directories .build/ by default to ensure that they are excluded from globs in the common case where the build directory is nested in the source root. Maud relies on build directory files being excluded from globs of source files, so if a non-default build directory name is used then things may break.