Skip to content

Resample RGB images in C++ #29453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

QuLogic
Copy link
Member

@QuLogic QuLogic commented Jan 11, 2025

PR summary

Agg already has RGB resampling with output to RGBA builtin, so we just need to correctly wire up the corresponding templates. With this RGB resampling mode, we save the extra copy from RGB to RGBA in NumPy land that was required for the previous always-RGBA resampling.

With the example from #29434, this saves 800MB of peak memory usage, which actually works out to two copies of 10000x10000x4 bytes; I'm not sure where the second copy comes from. I was thinking it was 4-byte RGBA, but it's actually 8-byte float64 input, so 1 copy removed as expected.

While this does pass tests, I think there may be some room for further improvement, as the input to the C++ code is A[..., :3], which should be a discontiguous view, but then the pybind11 code forces it to be contiguous, which would make another copy. There are some offset and step settings in Agg that I think we can use to avoid this copy as well, but it might require extra template expansions. I have implemented this to save another copy.

PR checklist

Agg already has RGB resampling with output to RGBA builtin, so we just
need to correctly wire up the corresponding templates. With this RGB
resampling mode, we save the extra copy from RGB to RGBA in NumPy land
that was required for the previous always-RGBA resampling.
In the case of RGBA, the RGB and A channels are resampled separately,
but they are created as a view on the original to pass to the C++ code.
The C++ code then copies it to a contiguous buffer, but Agg's RGB
resampler supports manually stepping the RGB input by a custom stride.
As this step is a template parameter, we can't handle any arbitraray
array, but can special case steps of 3 or 4 units, which should cover
the common cases of RGB or RGBA-viewed-as-RGB input.
@tacaswell tacaswell added this to the v3.11.0 milestone Jan 17, 2025
@QuLogic
Copy link
Member Author

QuLogic commented Jan 23, 2025

Using memray on the example from #29434, we can find 3 major memory contributors:

  1. A (800MB) copy made in ImageBase.set_data; this one is expected as we don't want to use or modify external data.
  2. A (3.2GB) copy made in Colormap._get_rgba_and_mask; this is part of the colour mapping process and expected (though for the old interpolation stage, this originally happened after resampling, so was smaller.)
  3. A (3.2GB) copy made in _rgb_to_rgba in order to make an RGBA array for resampling in C++ with just the RGB from point 2.

Using pyinstrument, we can find that most processing time is taken up by AxesImage._make_image, specifically _to_rgba (41.5%), _rgb_to_rgba (40.5%), and _resample (16.4%).

Screenshot 2025-01-23 at 06-12-16 22 8s - issue29434 py - pyinstrument

The first commit here removes entry 3 by applying Agg templates for RGB, but we still end up allocating a similar size one in _resample (to make a contiguous array), but without the alpha channel, so only 2.4GB, meaning a drop of 800MB. Instrumentation doesn't show much change in runtime.

The second commit removes the need for that latter copy as well by applying more Agg templates for step size of 3 or 4. Strangely though, memray now attributes more allocations to Colorizer.to_rgba 4.7GiB vs 3.0 GiB before. Insrtumentation does show a good improvement in runtime:

Screenshot 2025-01-23 at 06-23-50 14 7s - issue29434 py - pyinstrument

I think memray must only show the blocks that contribute to peak usage (which explains why the attribution changes around a bit after the commits). Reported peak memory usage shrank from 6.8 GiB to 5.6 GiB (17.6% decrease), and runtime (of Axes._make_image specifically) from 19.146s to 12.467s (34.9% decrease).

There are a few other copies that I can find by inspection:

  1. The initial data is 10000x10000xfloat=800M bytes, though since it's passed to imshow directly, it gets immediately discarded.
  2. A copy of the input in _Colormap._get_rgba_and_mask; this is so that it can be modified (with a byteswap or scaling from float to int).
  3. Three masks for over/under/bad, but these are booleans, not float64.
  4. A cast from float to int; this is used for the indexing into the lut.

Copies 1, 2, and 4 are 8 bytes, so 800M each, and copy 3 should be 3*byte = 300M. Many of these are replacing the previous version, or otherwise discarded, so I guess they don't count for memray.

I believe we can remove copies 2-4 with a port of _get_rgba_and_mask to C++, in order to directly output the final RGBA without any intermediaries. I didn't implement the alpha part yet, but a quick implementation shows this will cut down peak memory usage to 4.6 GiB (which is probably as low as we can go for original, RGBA, and resampled), and runtime of _get_rgba_and_mask itself from 9.757s to 1.822s.

I haven't committed that last thing, as I didn't finish alpha processing and wasn't sure if we wanted to move it to C++, but I think the numbers indicate this is a good idea.

@timhoffm
Copy link
Member

On a side note, Colormap._get_rgba_and_mask has the peculiar behavior that integers are used to directly index into the lut, whereas floats are scaled before lookup so that 0...1 covers the color range (#28198). We want to move away from this anyway, basically deprecate accepting ints here. But we have to decide whether we need this integer-lookup capability and if so, how to make a reasonable API for it. Thoughts welcome.

@anntzer
Copy link
Contributor

anntzer commented Mar 23, 2025

I think(?) #29776 would essentially remove the need for this PR (as it removes the use of _rgb_to_rgba).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants