Skip to content

Add CLI for converting v2 metadata to v3 #3257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

K-Meech
Copy link

@K-Meech K-Meech commented Jul 16, 2025

For #1798

Adds a CLI using typer to convert v2 metadata (.zarray / .zattrs...) to v3 metadata zarr.json.

To test, you will need to install the new optional cli dependency e.g.
pip install -e ".[remote,cli]"

This should make the zarr-converter command available e.g. try:

zarr-converter --help
zarr-converter convert --help
zarr-converter clear --help

convert adds zarr.json files to every group / array, leaving the v2 metadata as-is. A zarr with both sets of metadata can still be opened with zarr.open, but will give a UserWarning: Both zarr.json (Zarr format 3) and .zarray (Zarr format 2) metadata objects exist... Zarr v3 will be used.. This can be avoided by passing zarr_format=3 to zarr.open, or by using the clear command to remove the v2 metadata.

clear can also remove v3 metadata. This is useful if the conversion fails part way through e.g. if one of the arrays uses a codec with no v3 equivalent.

All code for the cli is in src/zarr/core/metadata/converter/cli.py, with the actual conversion functions in src/zarr/core/metadata/converter/converter_v2_v3.py. These functions can be called directly, for those who don't want to use the CLI (although currently they are part of /core which is considered private API, so it may be best to move them elsewhere in the package).

Some points to consider:

  • I had to modify set_path from test_dtype_registry.py and test_codec_entrypoints.py, as they were causing the CLI tests to fail if they were run after. This seems to be due to the lazy_load_list of the numcodecs codecs registries being cleared, meaning they were no longer available in my code which finds the numcodecs.zarr3 equivalent of a numcodecs codec.
  • I tested this on local zarr images, so it would be great if someone with access to s3 / google cloud etc., could try it out on some small example images there.
  • I'm happy to add docs about how to use the CLI, but wanted to get feedback on the general structure first

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jul 16, 2025
Copy link

codecov bot commented Jul 16, 2025

Codecov Report

Attention: Patch coverage is 96.85039% with 4 lines in your changes missing coverage. Please review.

Project coverage is 94.69%. Comparing base (18f41d4) to head (3540434).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/zarr/core/metadata/converter/cli.py 90.00% 2 Missing ⚠️
...rc/zarr/core/metadata/converter/converter_v2_v3.py 98.13% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3257      +/-   ##
==========================================
+ Coverage   94.62%   94.69%   +0.06%     
==========================================
  Files          78       80       +2     
  Lines        8690     8817     +127     
==========================================
+ Hits         8223     8349     +126     
- Misses        467      468       +1     
Files with missing lines Coverage Δ
src/zarr/core/metadata/converter/cli.py 90.00% <90.00%> (ø)
...rc/zarr/core/metadata/converter/converter_v2_v3.py 98.13% <98.13%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-v-b
Copy link
Contributor

d-v-b commented Jul 16, 2025

this is awesome! A few high-level suggestions:

  • can we have the name of the cli be just zarr, where convert is a subcommand:

    $ zarr --help
     # prints info including available commands
    $ zarr convert --help
    # info about the convert operation
    
  • can the 2->3 migration just be a particular invocation of a more general command, like
    zarr convert source/data.zarr --in-place --zarr-format 3

    I'm not sure how to express the "clean up a broken conversion" process through this API, but it should be doable

  • is "convert" the best name here? "resave" also seems good. I'm not opposed to "convert", just interested in exploring the space of names a bit.

@K-Meech
Copy link
Author

K-Meech commented Jul 18, 2025

Thanks for the comments @d-v-b !

  • Happy to change the top-level name to zarr

  • Just to check - are you suggesting it would be best to combine the convert and clear commands into one command? If so, then it should be possible with something like zarr convert source/data.zarr --in-place --allow-overwrite where:

    • --in-place would create v3 metadata + remove any v2 metadata (without this specified, both v2/v3 metadata would remain)
    • --allow-overwrite would remove any existing v3 metadata, then create new (otherwise it errors if an existing zarr.json is encountered)
    • I think the only situation this doesn't cover is if the conversion fails part way through + you just want to remove any zarr.json that were written. Perhaps it would be best to prevent this scenario from occurring in the first place? I could always run convert_v2_to_v3 inside a try / except, then run remove_metadata if any exceptions are encountered. This could be optional behaviour with something like --clear
  • As for the naming - I don't have a strong preference here. I'd probably prefer convert as I think it's a simpler / easier to understand term, but happy to change this.

@d-v-b
Copy link
Contributor

d-v-b commented Jul 18, 2025

I think the only situation this doesn't cover is if the conversion fails part way through + you just want to remove any zarr.json that were written. Perhaps it would be best to prevent this scenario from occurring in the first place? I could always run convert_v2_to_v3 inside a try / except, then run remove_metadata if any exceptions are encountered. This could be optional behaviour with something like --clear

Attempting to clean up on failure is a good idea, but that can also fail. It would still be possible for the program to be interrupted while remove_metadata is being run, so I don't think there's anything we can do prevent conversion from failing. But as long as we are only creating json documents, and we aren't deleting the originals (the v2 metadata), then re-running should be cheap. This seems like a detail that we can iron out later on.

--in-place would create v3 metadata + remove any v2 metadata (without this specified, both v2/v3 metadata would remain)

There might actually be situations where people want the v2 and v3 metadata in the same hierarchy, and this is not against the rules of the spec. So I'd say that creating v3 metadata should be a separate (and precedent) operation from deleting the v2 metadata. --in-place should be considered shorthand for this:

$ zarr convert foo.zarr/bar foo.zar/bar

i.e., taking a v2 hierarchy from foo.zarr/bar, and creating a v3 hierarchy at foo.zarr/bar. Thus I think

$ zarr convert foo.zarr baz.zarr

i.e., resaving the hierarchy somewhere else, should also be supported. For now users would be responsible for copying chunks in this case.

--allow-overwrite would remove any existing v3 metadata, then create new (otherwise it errors if an existing zarr.json is encountered)

that seems right, but instead of --allow-overwrite I would just do --overwrite.

And a broader request: I am sure that while working on this kind of low-level tool, you find tasks that should be direct, but are made difficult by design choices in zarr-python, or the lack of low-level features. If you have the patience for it, please keep track of the stuff we could change in the core library to make this kind of tool easier to write, and feel free to open issues to track these features.

@K-Meech
Copy link
Author

K-Meech commented Jul 18, 2025

Ok - that all sounds good. How about the following for an updated CLI:

The base command is zarr convert path/to/input.zarr - this creates v3 metadata, writing it in-place back to input.zarr, leaving the v2 metadata as-is. There are then the following optional commands on top:

  • --output path/to/output.zarr - if provided, v3 metadata will be written to this separate output location (leaving input.zarr completely unchanged). Only the metadata will be created (no copying of chunks).
  • --overwrite: allows overwrite of v3 metadata at the output location (either path/to/input.zarr or path/to/output.zarr if provided)
  • --remove-v2: removes any v2 metadata at the output location

I think it's probably worth also keeping the clear command to give maximum flexibility. I've certainly used it a lot during testing of the CLI e.g. when converting in-place, then doing some testing of the v3 output to check it looks ok, then later removing the v2 metadata when it's no longer needed.

@d-v-b
Copy link
Contributor

d-v-b commented Jul 18, 2025

  • --output path/to/output.zarr - if provided, v3 metadata will be written to this separate output location (leaving input.zarr completely unchanged). Only the metadata will be created (no copying of chunks).

This design would make the in-place conversion the default, and writing the hierarchy to a new path would be an opt-in kind of thing. Speaking for myself, I think I would expect the convert-to-somewhere-else mode to be the default, and the in-place conversion to be opt-in. It might be helpful to sample opinions from other folks here (cc @zarr-developers/python-core-devs). In any case this is a minor thing.

I think it's probably worth also keeping the clear command to give maximum flexibility. I've certainly used it a lot during testing of the CLI e.g. when converting in-place, then doing some testing of the v3 output to check it looks ok, then later removing the v2 metadata when it's no longer needed.

That makes perfect sense. Since "clear" is kind of ambiguous on its own, and since the convert command is pretty narrowly scoped to converting from v2 to v3 data, I wonder if we should rename this command to something like convert-v3, migrate-v3, etc, something that makes it really obvious that all the options will be scoped to the process of moving from v2 data to v3 data. This would free up the convert command for later use, e.g. things like changing chunking, compression, etc.

@d-v-b
Copy link
Contributor

d-v-b commented Jul 18, 2025

also, in case anyone wants to try this tool out, here's a 1-liner with uvx:

uvx --with typer --from git+https://github.com/K-Meech/zarr-python/@km/v2-v3-conversion zarr-converter --help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants