Skip to content

Extract wheels in build action #3091

@aignas

Description

@aignas

We had a user request about this:

Hey, we have some customers with tens of GBs of pip deps under external, even with BwoB. I'm wondering whether it would be possible to use the approach of rules_js: Download .zips or .whl files in a repo rule but don't extract them, then extract them into a declared directory as part of a build rule.

Users would only need to store the compressed files under external/ and the total number of files would be one per dep.

Is that something you have considered before?

So far there is prior art of people doing this:

  • rules_pycross is doing this.

I see the main motivation being:

  • Fewer files in the action graph.
  • Compression making the external folder smaller.

Pre-requisites for this to be considered:

  • We need the METADATA to be fetched separately. This requires some smart (or not so) fetching in the bzlmod phase. Either we download the whole wheel extract the METADATA file and throw the downloaded wheel away, or we download the METADATA file with some special tools (maybe uv could help)
  • We need the Write pip extension metadata to the MODULE.bazel.lock file #2731 wiring in place so that we can store the downloaded METADATA.

Some points worth considering before implementing this:

  • Header and other filegroups may not work anymore that easily, because you would need to lug the wheel around just to extract some headers in there.
  • The output would have to be directory artifacts and the current mechanism of implicit namespace pkgs would have to be changed (moved back to python). This means that if we want to leverage Implement venv/site-packages based binaries #2156, we would have to re-extract the wheel for a second time, because of the changes in the configuration flags.
  • We would have to have tags = ["no-remote"] to avoid uploading the inputs to the actions that may easily timeout. See slack.
  • Right now the venv layout (Implement venv/site-packages based binaries #2156) selectively symlinks directories to create an actual virtual environment and it relies on the depsets not being an opaque structure. I would love to understand how we can have this working with directory artifacts.
  • If we don't want to do directory artifacts, we should also parse RECORD files and populate the outputs label to the rule based on the RECORD file. This would not have the drawbacks I described with directory artifacts and would still integrate well with the other rulesets, we could probably also easily support accessing header files, etc.

This is a place to collect thoughts on how to do this or why this should not be done. I personally have no time to work on this, but can consult people when they come up with a design that would not break existing functionality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: pippip/pypi integration

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions