Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path.rglob performance issues in deeply nested directories compared to glob.glob(recursive=True) #102613

Closed
ionite34 opened this issue Mar 11, 2023 · 7 comments
Labels
performance Performance or resource usage topic-pathlib type-bug An unexpected behavior, bug, or error

Comments

@ionite34
Copy link
Contributor

ionite34 commented Mar 11, 2023

Bug report

Pathlib.rglob can be orders of magnitudes slower than glob.glob(recursive=True)

With a 1000-deep nested directory, glob.glob and Path.glob both took under 1 second. Path.rglob took close to 1.5 minutes.

import glob
import os
from pathlib import Path

x = ""
for _ in range(1000):
    x += "a/"
    os.mkdir(x)
    
# ~ 0.5s
print(glob.glob("**/*", recursive=True))

# ~ 87s
print(list(Path(".").rglob("**/*")))

Linked PRs

@ionite34 ionite34 added the type-bug An unexpected behavior, bug, or error label Mar 11, 2023
@ionite34 ionite34 changed the title Pathlib.rglob performance issues in deeply nested directories compared to glob.glob(recursive=True) Path.rglob performance issues in deeply nested directories compared to glob.glob(recursive=True) Mar 11, 2023
@AlexWaygood AlexWaygood added performance Performance or resource usage topic-pathlib labels Mar 12, 2023
@barneygale
Copy link
Contributor

barneygale commented Mar 12, 2023

At a depth of 400, on my system, 50% of the time is spent in PurePath.__hash__(), which is called because _RecursiveWildcardSelector uses a set to de-duplicate. We could drop the set if we handle symlinks uniformly - see #77609.

barneygale added a commit to barneygale/cpython that referenced this issue Mar 12, 2023
Add a keyword-only *follow_symlinks* parameter to `pathlib.Path.glob()` and
`rglob()`, defaulting to false. When set to true, symlinks to directories
are followed as if they were directories.

Previously these methods followed symlinks except when evaluating "`**`"
wildcards; on Windows they returned paths in filesystem casing except when
evaluating non-wildcard tokens. Both these problems are solved here. This
will allow us to address pythonGH-102613 and pythonGH-81079 in future commits.
@barneygale
Copy link
Contributor

barneygale commented Mar 12, 2023

@barneygale
Copy link
Contributor

barneygale commented Mar 21, 2023

It occurs to me that we only need to de-duplicate results if two or more ** tokens appear in the pattern. So we can improve the situation even without making symlink handling more consistent. I'll implement this soon, just need #102710 to land first I think.

Nevermind, I'm wrong :]

Nevernevermind, I was right!

@barneygale
Copy link
Contributor

My plan here is:

  1. Address Support recursive wildcards in pathlib.PurePath.match() #73435
  2. Address Add parameter @case_sensitive to glob and rglob in pathlib #81079
  3. Implement glob() using walk() to avoid hitting recursion errors (this issue.)
  4. Address pathlib.Path.glob does not follow symlinks #77609
  5. Address pathlib strips trailing slash #65238
  6. Implement **/ tokens using a "walk-and-filter" strategy (this issue.)

barneygale added a commit to barneygale/cpython that referenced this issue May 6, 2023
Stop de-duplicating results in `_RecursiveWildcardSelector`. A new
`_DoubleRecursiveWildcardSelector` class is introduced which performs
de-duplication, but this is used _only_ for patterns with multiple
non-adjacent `**` segments, such as `path.glob('**/foo/**')`. By avoiding
the use of a set, `PurePath.__hash__()` is not called, and so paths do not
need to be parsed and (case-) normalised.

Also merge adjacent '**' segments in patterns.
@barneygale
Copy link
Contributor

barneygale commented May 6, 2023

barneygale added a commit that referenced this issue May 7, 2023
Stop de-duplicating results in `_RecursiveWildcardSelector`. A new
`_DoubleRecursiveWildcardSelector` class is introduced which performs
de-duplication, but this is used _only_ for patterns with multiple
non-adjacent `**` segments, such as `path.glob('**/foo/**')`. By avoiding
the use of a set, `PurePath.__hash__()` is not called, and so paths do not
need to be stringified and case-normalised.

Also merge adjacent '**' segments in patterns.
jbower-fb pushed a commit to jbower-fb/cpython-jbowerfb that referenced this issue May 8, 2023
…nGH-104244)

Stop de-duplicating results in `_RecursiveWildcardSelector`. A new
`_DoubleRecursiveWildcardSelector` class is introduced which performs
de-duplication, but this is used _only_ for patterns with multiple
non-adjacent `**` segments, such as `path.glob('**/foo/**')`. By avoiding
the use of a set, `PurePath.__hash__()` is not called, and so paths do not
need to be stringified and case-normalised.

Also merge adjacent '**' segments in patterns.
carljm added a commit to carljm/cpython that referenced this issue May 9, 2023
* main: (47 commits)
  pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188)
  pythonGH-104308: socket.getnameinfo should release the GIL (python#104307)
  pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311)
  pythongh-99113: A Per-Interpreter GIL! (pythongh-104210)
  pythonGH-104284: Fix documentation gettext build (python#104296)
  pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251)
  pythongh-104223: Fix issues with inheriting from buffer classes (python#104227)
  pythongh-99108: fix typo in Modules/Setup (python#104293)
  pythonGH-104145: Use fully-qualified cross reference types for the bisect module (python#104172)
  pythongh-103193: Improve `getattr_static` test coverage (python#104286)
  Trim trailing whitespace and test on CI (python#104275)
  pythongh-102500: Remove mention of bytes shorthand (python#104281)
  pythongh-97696: Improve and fix documentation for asyncio eager tasks (python#104256)
  pythongh-99108: Replace SHA3 implementation HACL* version (python#103597)
  pythongh-104273: Remove redundant len() calls in argparse function (python#104274)
  pythongh-64660: Don't hardcode Argument Clinic return converter result variable name (python#104200)
  pythongh-104265 Disallow instantiation of `_csv.Reader` and `_csv.Writer` (python#104266)
  pythonGH-102613: Improve performance of `pathlib.Path.rglob()` (pythonGH-104244)
  pythongh-103650: Fix perf maps address format (python#103651)
  pythonGH-89812: Churn `pathlib.Path` methods (pythonGH-104243)
  ...
barneygale added a commit to barneygale/cpython that referenced this issue May 11, 2023
Use `Path.walk()` to implement the recursive wildcard `**`. This method
uses an iterative (rather than recursive) walk - see pythonGH-100282.
barneygale added a commit that referenced this issue May 15, 2023
Use `Path.walk()` to implement the recursive wildcard `**`. This method
uses an iterative (rather than recursive) walk - see GH-100282.
barneygale added a commit to barneygale/cpython that referenced this issue May 15, 2023
This commit replaces selector classes with selector functions. These
generators directly yield results rather calling through to their
successor. A new internal `Path._glob()` takes care to chain these
generators together, which simplifies the lazy algorithm and slightly
improves performance.
carljm added a commit to carljm/cpython that referenced this issue May 16, 2023
* main:
  pythonGH-104510: Fix refleaks in `_io` base types (python#104516)
  pythongh-104539: Fix indentation error in logging.config.rst (python#104545)
  pythongh-104050: Don't star-import 'types' in Argument Clinic (python#104543)
  pythongh-104050: Add basic typing to CConverter in clinic.py (python#104538)
  pythongh-64595: Fix write file logic in Argument Clinic (python#104507)
  pythongh-104523: Inline minimal PGO rules (python#104524)
  pythongh-103861: Fix Zip64 extensions not being properly applied in some cases (python#103863)
  pythongh-69152: add method get_proxy_response_headers to HTTPConnection class (python#104248)
  pythongh-103763: Implement PEP 695 (python#103764)
  pythongh-104461: Run tkinter test_configure_screen on X11 only (pythonGH-104462)
  pythongh-104469: Convert _testcapi/watchers.c to use Argument Clinic (python#104503)
  pythongh-104482: Fix error handling bugs in ast.c (python#104483)
  pythongh-104341: Adjust tstate_must_exit() to Respect Interpreter Finalization (pythongh-104437)
  pythonGH-102613: Fix recursion error from `pathlib.Path.glob()` (pythonGH-104373)
barneygale added a commit to barneygale/cpython that referenced this issue May 18, 2023
barneygale added a commit to barneygale/cpython that referenced this issue May 29, 2023
barneygale added a commit to barneygale/cpython that referenced this issue May 31, 2023
@barneygale
Copy link
Contributor

barneygale commented May 31, 2023

Final PR available:

As a result of these changes, for the particular repro case in the original post with 1000 nested a/ directories, pathlib is now about 25x faster than glob.glob(), rather than 150x slower! 🎉

barneygale added a commit that referenced this issue Jun 6, 2023
This commit introduces a 'walk-and-match' strategy for handling glob patterns that include a non-terminal `**` wildcard, such as `**/*.py`. For this example, the previous implementation recursively walked directories using `os.scandir()` when it expanded the `**` component, and then **scanned those same directories again** when expanded the `*.py` component. This is wasteful.

In the new implementation, any components following a `**` wildcard are used to build a `re.Pattern` object, which is used to filter the results of the recursive walk. A pattern like `**/*.py` uses half the number of `os.scandir()` calls; a pattern like `**/*/*.py` a third, etc.

This new algorithm does not apply if either:

1. The *follow_symlinks* argument is set to `None` (its default), or
2. The pattern contains `..` components.

In these cases we fall back to the old implementation.

This commit also replaces selector classes with selector functions. These generators directly yield results rather calling through to their successors. A new internal `Path._glob()` method takes care to chain these generators together, which simplifies the lazy algorithm and slightly improves performance. It should also be easier to understand and maintain.
@barneygale
Copy link
Contributor

Tis fixed! Most changes made it into 3.12, but a couple are 3.13-only.

ambv added a commit to ambv/cpython that referenced this issue Jun 13, 2023
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 13, 2023
…er Coverage (pythonGH-105744)

(cherry picked from commit 4e80082)

Co-authored-by: Łukasz Langa <[email protected]>
barneygale pushed a commit that referenced this issue Jun 13, 2023
…der Coverage (GH-105744) (#105749)

gh-102613: Bump recursion limit to fix running test_pathlib under Coverage (GH-105744)
(cherry picked from commit 4e80082)

Co-authored-by: Łukasz Langa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants