-
-
Notifications
You must be signed in to change notification settings - Fork 610
Description
We had a user request about this:
Hey, we have some customers with tens of GBs of pip deps under external, even with BwoB. I'm wondering whether it would be possible to use the approach of rules_js: Download .zips or .whl files in a repo rule but don't extract them, then extract them into a declared directory as part of a build rule.
Users would only need to store the compressed files under external/ and the total number of files would be one per dep.
Is that something you have considered before?
So far there is prior art of people doing this:
rules_pycross
is doing this.
I see the main motivation being:
- Fewer files in the action graph.
- Compression making the
external
folder smaller.
Pre-requisites for this to be considered:
- We need the
METADATA
to be fetched separately. This requires some smart (or not so) fetching in thebzlmod
phase. Either we download the whole wheel extract theMETADATA
file and throw the downloaded wheel away, or we download theMETADATA
file with some special tools (maybe uv could help) - We need the Write pip extension metadata to the MODULE.bazel.lock file #2731 wiring in place so that we can store the downloaded
METADATA
.
Some points worth considering before implementing this:
- Header and other filegroups may not work anymore that easily, because you would need to lug the wheel around just to extract some headers in there.
- The output would have to be directory artifacts and the current mechanism of implicit namespace pkgs would have to be changed (moved back to python). This means that if we want to leverage Implement venv/site-packages based binaries #2156, we would have to re-extract the wheel for a second time, because of the changes in the configuration flags.
- We would have to have
tags = ["no-remote"]
to avoid uploading the inputs to the actions that may easily timeout. See slack. - Right now the
venv
layout (Implement venv/site-packages based binaries #2156) selectively symlinks directories to create an actual virtual environment and it relies on the depsets not being an opaque structure. I would love to understand how we can have this working with directory artifacts. - If we don't want to do directory artifacts, we should also parse
RECORD
files and populate theoutputs
label to the rule based on theRECORD
file. This would not have the drawbacks I described with directory artifacts and would still integrate well with the other rulesets, we could probably also easily support accessing header files, etc.
This is a place to collect thoughts on how to do this or why this should not be done. I personally have no time to work on this, but can consult people when they come up with a design that would not break existing functionality.