Add CLI for converting v2 metadata to v3 #3257

K-Meech · 2025-07-16T15:10:16Z

Adds a CLI using typer to convert v2 metadata (.zarray / .zattrs...) to v3 metadata zarr.json.

To test, you will need to install the new optional cli dependency e.g.
pip install -e ".[remote,cli]"

This should make the zarr-converter command available e.g. try:

zarr-converter --help
zarr-converter convert --help
zarr-converter clear --help

convert adds zarr.json files to every group / array, leaving the v2 metadata as-is. A zarr with both sets of metadata can still be opened with zarr.open, but will give a UserWarning: Both zarr.json (Zarr format 3) and .zarray (Zarr format 2) metadata objects exist... Zarr v3 will be used.. This can be avoided by passing zarr_format=3 to zarr.open, or by using the clear command to remove the v2 metadata.

clear can also remove v3 metadata. This is useful if the conversion fails part way through e.g. if one of the arrays uses a codec with no v3 equivalent.

All code for the cli is in src/zarr/core/metadata/converter/cli.py, with the actual conversion functions in src/zarr/core/metadata/converter/converter_v2_v3.py. These functions can be called directly, for those who don't want to use the CLI (although currently they are part of /core which is considered private API, so it may be best to move them elsewhere in the package).

Some points to consider:

I had to modify set_path from test_dtype_registry.py and test_codec_entrypoints.py, as they were causing the CLI tests to fail if they were run after. This seems to be due to the lazy_load_list of the numcodecs codecs registries being cleared, meaning they were no longer available in my code which finds the numcodecs.zarr3 equivalent of a numcodecs codec.
I tested this on local zarr images, so it would be great if someone with access to s3 / google cloud etc., could try it out on some small example images there.
I'm happy to add docs about how to use the CLI, but wanted to get feedback on the general structure first

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.rst
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

…onversion

…sting a zarr version greater than 3

codecov · 2025-07-16T16:23:58Z

Codecov Report

Attention: Patch coverage is 96.85039% with 4 lines in your changes missing coverage. Please review.

Project coverage is 94.69%. Comparing base (18f41d4) to head (3540434).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/zarr/core/metadata/converter/cli.py	90.00%	2 Missing ⚠️
...rc/zarr/core/metadata/converter/converter_v2_v3.py	98.13%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3257      +/-   ##
==========================================
+ Coverage   94.62%   94.69%   +0.06%     
==========================================
  Files          78       80       +2     
  Lines        8690     8817     +127     
==========================================
+ Hits         8223     8349     +126     
- Misses        467      468       +1

Files with missing lines	Coverage Δ
src/zarr/core/metadata/converter/cli.py	`90.00% <90.00%> (ø)`
...rc/zarr/core/metadata/converter/converter_v2_v3.py	`98.13% <98.13%> (ø)`

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

d-v-b · 2025-07-16T17:04:33Z

this is awesome! A few high-level suggestions:

can we have the name of the cli be just zarr, where convert is a subcommand:

$ zarr --help
 # prints info including available commands
$ zarr convert --help
# info about the convert operation

can the 2->3 migration just be a particular invocation of a more general command, like
zarr convert source/data.zarr --in-place --zarr-format 3

I'm not sure how to express the "clean up a broken conversion" process through this API, but it should be doable
is "convert" the best name here? "resave" also seems good. I'm not opposed to "convert", just interested in exploring the space of names a bit.

K-Meech · 2025-07-18T08:24:20Z

Thanks for the comments @d-v-b !

Happy to change the top-level name to zarr
Just to check - are you suggesting it would be best to combine the convert and clear commands into one command? If so, then it should be possible with something like zarr convert source/data.zarr --in-place --allow-overwrite where:
- --in-place would create v3 metadata + remove any v2 metadata (without this specified, both v2/v3 metadata would remain)
- --allow-overwrite would remove any existing v3 metadata, then create new (otherwise it errors if an existing zarr.json is encountered)
- I think the only situation this doesn't cover is if the conversion fails part way through + you just want to remove any zarr.json that were written. Perhaps it would be best to prevent this scenario from occurring in the first place? I could always run convert_v2_to_v3 inside a try / except, then run remove_metadata if any exceptions are encountered. This could be optional behaviour with something like --clear
As for the naming - I don't have a strong preference here. I'd probably prefer convert as I think it's a simpler / easier to understand term, but happy to change this.

d-v-b · 2025-07-18T08:48:25Z

I think the only situation this doesn't cover is if the conversion fails part way through + you just want to remove any zarr.json that were written. Perhaps it would be best to prevent this scenario from occurring in the first place? I could always run convert_v2_to_v3 inside a try / except, then run remove_metadata if any exceptions are encountered. This could be optional behaviour with something like --clear

Attempting to clean up on failure is a good idea, but that can also fail. It would still be possible for the program to be interrupted while remove_metadata is being run, so I don't think there's anything we can do prevent conversion from failing. But as long as we are only creating json documents, and we aren't deleting the originals (the v2 metadata), then re-running should be cheap. This seems like a detail that we can iron out later on.

--in-place would create v3 metadata + remove any v2 metadata (without this specified, both v2/v3 metadata would remain)

There might actually be situations where people want the v2 and v3 metadata in the same hierarchy, and this is not against the rules of the spec. So I'd say that creating v3 metadata should be a separate (and precedent) operation from deleting the v2 metadata. --in-place should be considered shorthand for this:

$ zarr convert foo.zarr/bar foo.zar/bar

i.e., taking a v2 hierarchy from foo.zarr/bar, and creating a v3 hierarchy at foo.zarr/bar. Thus I think

$ zarr convert foo.zarr baz.zarr

i.e., resaving the hierarchy somewhere else, should also be supported. For now users would be responsible for copying chunks in this case.

--allow-overwrite would remove any existing v3 metadata, then create new (otherwise it errors if an existing zarr.json is encountered)

that seems right, but instead of --allow-overwrite I would just do --overwrite.

And a broader request: I am sure that while working on this kind of low-level tool, you find tasks that should be direct, but are made difficult by design choices in zarr-python, or the lack of low-level features. If you have the patience for it, please keep track of the stuff we could change in the core library to make this kind of tool easier to write, and feel free to open issues to track these features.

K-Meech · 2025-07-18T09:58:02Z

Ok - that all sounds good. How about the following for an updated CLI:

The base command is zarr convert path/to/input.zarr - this creates v3 metadata, writing it in-place back to input.zarr, leaving the v2 metadata as-is. There are then the following optional commands on top:

--output path/to/output.zarr - if provided, v3 metadata will be written to this separate output location (leaving input.zarr completely unchanged). Only the metadata will be created (no copying of chunks).
--overwrite: allows overwrite of v3 metadata at the output location (either path/to/input.zarr or path/to/output.zarr if provided)
--remove-v2: removes any v2 metadata at the output location

I think it's probably worth also keeping the clear command to give maximum flexibility. I've certainly used it a lot during testing of the CLI e.g. when converting in-place, then doing some testing of the v3 output to check it looks ok, then later removing the v2 metadata when it's no longer needed.

d-v-b · 2025-07-18T10:09:05Z

--output path/to/output.zarr - if provided, v3 metadata will be written to this separate output location (leaving input.zarr completely unchanged). Only the metadata will be created (no copying of chunks).

This design would make the in-place conversion the default, and writing the hierarchy to a new path would be an opt-in kind of thing. Speaking for myself, I think I would expect the convert-to-somewhere-else mode to be the default, and the in-place conversion to be opt-in. It might be helpful to sample opinions from other folks here (cc @zarr-developers/python-core-devs). In any case this is a minor thing.

I think it's probably worth also keeping the clear command to give maximum flexibility. I've certainly used it a lot during testing of the CLI e.g. when converting in-place, then doing some testing of the v3 output to check it looks ok, then later removing the v2 metadata when it's no longer needed.

That makes perfect sense. Since "clear" is kind of ambiguous on its own, and since the convert command is pretty narrowly scoped to converting from v2 to v3 data, I wonder if we should rename this command to something like convert-v3, migrate-v3, etc, something that makes it really obvious that all the options will be scoped to the process of moving from v2 data to v3 data. This would free up the convert command for later use, e.g. things like changing chunking, compression, etc.

d-v-b · 2025-07-18T10:13:46Z

also, in case anyone wants to try this tool out, here's a 1-liner with uvx:

uvx --with typer --from git+https://github.com/K-Meech/zarr-python/@km/v2-v3-conversion zarr-converter --help

K-Meech added 27 commits July 1, 2025 11:08

add rough cli converter structure

45bb4e5

allow zstd, gzip and numcodecs zarr 3 compression

456c9e7

convert filters to v3

242a338

create BytesCodec with correct endian

1045c33

handle C vs F order in v2 metadata

4e2442f

save group and array metadata to file

c63f0b8

create overall conversion functions for store, array or group

2947ce4

add minimal typer cli

ba81755

add initial tests for converter

67f9580

add tests for conversion of groups and nested groups and arrays

0d7c2c8

add tests for conversion of compressors and filters

cf39580

test conversion of order and endianness

11499e7

add tests for edge cases of incorrect codecs

90b0996

add tests for / separator

85159bb

draft of metadata remover and add test for internal paths

53ba166

add clear command to cli with tests

d4cdc04

add test for metadata removal with path#

dfdc729

add verbose logging option

ad60991

add dry run option to cli

66bae0d

add test for dry-run

97df9bf

add zarr-converter script and enable cli dep in tests

42e0435

use v2 chunk key encoding type

9e20b39

Merge branch 'main' of github.com:K-Meech/zarr-python into km/v2-v3-c…

6586e66

…onversion

update endianness of test data type

ce409a3

Merge branch 'main' of github.com:K-Meech/zarr-python into km/v2-v3-c…

fb7136b

…onversion

check converted arrays can be accessed

6585f24

Merge branch 'main' of github.com:K-Meech/zarr-python into km/v2-v3-c…

46e958d

…onversion

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jul 16, 2025

K-Meech added 2 commits July 16, 2025 16:58

remove uses of pathlib walk, as it didn't exist in python 3.11

08fc138

include tags in checkout for gpu test, to avoid numcodecs.zarr3 reque…

3540434

…sting a zarr version greater than 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add CLI for converting v2 metadata to v3 #3257

Add CLI for converting v2 metadata to v3 #3257

Uh oh!

K-Meech commented Jul 16, 2025

Uh oh!

codecov bot commented Jul 16, 2025

Uh oh!

d-v-b commented Jul 16, 2025

Uh oh!

K-Meech commented Jul 18, 2025

Uh oh!

d-v-b commented Jul 18, 2025

Uh oh!

K-Meech commented Jul 18, 2025

Uh oh!

d-v-b commented Jul 18, 2025

Uh oh!

d-v-b commented Jul 18, 2025

Uh oh!

Uh oh!

Uh oh!

Add CLI for converting v2 metadata to v3 #3257

Are you sure you want to change the base?

Add CLI for converting v2 metadata to v3 #3257

Uh oh!

Conversation

K-Meech commented Jul 16, 2025

Uh oh!

codecov bot commented Jul 16, 2025

Codecov Report

Uh oh!

d-v-b commented Jul 16, 2025

Uh oh!

K-Meech commented Jul 18, 2025

Uh oh!

d-v-b commented Jul 18, 2025

Uh oh!

K-Meech commented Jul 18, 2025

Uh oh!

d-v-b commented Jul 18, 2025

Uh oh!

d-v-b commented Jul 18, 2025

Uh oh!

Uh oh!