Man Group – Improving Python: How we made pip install twice as fast

mathisd a day ago

Why bother now that there is newer package manager such as uv which still have a strong lead in performance ?

zahlman a day ago

(for fellow JavaScript haters: https://archive.is/Hl4yJ; but this will show collapsed accordions with important content that of course don't expand. I caved and visited the original page — but seriously, people, the <details> tag is not deep magic.)
TFA documents work done for and incorporated into Pip about a year ago.
Improvements like this are still worth making because, among other things, tons of people still use Pip and are not even going to look at changing. They are, I can only assume, already running massive CI jobs that dynamically grab the latest version of Pip repeatedly and stuff them into containers, in ways that defeat Pip's own caching, and forcibly check the Internet every time for new versions. Because that's the easiest, laziest thing to write in many cases. This is the only plausible explanation I have for Pip being downloaded an average of 12 million times per day (https://pypistats.org/packages/pip).
They're also worth making exactly because Pip still has a very long way to go in terms of performance improvement, and because experiments like this show that the problem is very much with Pip rather than with Python. Tons of people hyping Uv assume that it must be "rocket emoji, blazing fast, sparkle emoji" because it's written in Rust. Its performance is not in question; but the lion's share of the improvement, in my analysis, is due to other factors.
Documenting past performance gains helps inform the search for future improvements. They aren't going to start over (although I am: https://github.com/zahlman/paper) so changes need to be incremental, and constantly incorporated into the existing terrible design.
Showing off unexpected big-O issues is also enlightening. FTA:
> This was the code to sort installed packages just before the final print.
> There was a quadratic performance bug lurking in that code. The function `env.get_distribution(item)` to fetch the package version that was just installed was not constant time, it looped over all installed packages to find the requested package.
The user would not expect an installation of hundreds of packages to spend a significant amount of time in preparing to state which packages were installed. But Pip has been around since 2008 (https://pypi.org/project/pip/#history) and Ian Bicking may never have imagined environments with hundreds of installed packages, never mind installing hundreds at a time.
Finally, documentation like this helps highlight things that have improved in the Python packaging ecosystem, even outside of Pip. In particular:
> Investigation revealed the download is done during the dependency resolution. pip can only discover dependencies after it has downloaded a package, then it can download more packages and discover more dependencies, and repeat. The download and the dependency resolution are fully intertwined.
This is mostly no longer true. While of course the dependency metadata must be downloaded and cannot appear by magic, it is now available separately from the package artifact in a large fraction of cases. Specifically, there is a standard for package indices to provide that information separately (https://peps.python.org/pep-0658/), and per my discussion with Pip maintainers, PyPI does so for wheels. (Source distributions — called sdists — are still permitted to omit PKG-INFO, and dependency specifications in an sdist can still be dynamic since the system for conditional platform-dependent dependencies is apparently not adequate for everyone. But in principle, some projects could have that metadata supplied for their sdists, and nowadays it's relatively uncommon to be forced to install from source anyway.)