-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
ENH: Vectorize quicksort for 16-bit and 64-bit dtype using AVX512 #22315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@serge-sans-paille: Reformatted your C++ code and added 16-bit and 64-bit sorting. |
Added benchmark numbers. Need to fix the CI errors. |
0b5ef42
to
b2a5ff8
Compare
The cygwin failure is strange: "Permission denied" when importing the c-extension. Is there a compilation problem? |
Doesn't look like it from logs. It was able to build and install fine. I am very confused with all the errors. Still working my way through them. |
Very odd: the error message says it's Next commit imports fine before hitting a segfault so the permissions failure seems to have gone away. Hopefully it stays that way. |
That's possible. What would be the easiest way to reproduce this error locally? Any pointers would be helpful :) |
This will mostly be following the CI workflow. On a 64-bit Windows machine (it's possible to adapt the directons for 32-bit windows, but not recommended):
Another option would be to mark the new functions This page has a brief description of how to use the |
@DWesl Thanks for that! And yup, its the same problem as you mentioned! Bunch of aligned mov instructions to unaligned stack pointer.
|
I'm trying to reproduce the Windows Python38-32bit-fast CI failure (followed the steps listed in
Any idea where I am going wrong? |
Somehow you are running the test with a 64-bit python ( |
Yes looks like it. Doesn't the CI use the python.exe downloaded from here:? Line 11 in 3cf2ca1
And that is a 64-bit one:
|
No, that is PyPy-only. Otherwise |
Might need some help reproducing and figuring out why the tests fail on WIN32 |
f353b83
to
6b3e914
Compare
The ever elusive "All checks have passed" 🤦♂️But I think the last CI failure has nothing to do with this patch. |
ping .. any idea why the two failures? one is on Arm64 and other one might also not be related (this patch is disabled on 32-bit windows). |
The aarch64 run is timing out at ~80% |
Added AVX-512 quicksort for float16. This was the tricky one and I have a feeling it can be improved further. Benchmarks:
|
@mattip The ARM64 failure on TravisCI is unlikely to be related to this patch (this patch is x86 code only). Looks like other PR's have the similar failures too: https://app.travis-ci.com/github/numpy/numpy/jobs/587139824 and https://app.travis-ci.com/github/numpy/numpy/jobs/587353553 |
@seiko2plus any thoughts? |
just a friendly ping :) |
@r-devulap, sorry, currently I suffer from lack of time, please just give me 2 or 3 days max to give a full review. |
@seiko2plus Didnt mean to rush you. Please take your time. |
disabled on Cygwin. |
This has become quite a large PR. If I understand correctly it adds new headers, new ways of dispatching, and a new dependency. Could we split this into two: one PR to enhance the dispatching mechanism, including documentation, and another to add AVX512 float16 sorting? The speed increase is really nice. Correct me if I am wrong, I would imagine much of it comes from having a native float16 sort. If that is true, then implementing a float16 sort without casting float16 to float32 would also speed up non-AVX512 architectures. |
the other way around, the code is reduced and became easier to read.
Yes, that's correct. mainly initial new road for using C++ that enhances the use of type deduced and overloading The new four headers are located in
I can move this patch into a separate pr if it's necessary.
Nothing changed, the same way.
This patch doesn't enable any native float16 instructions, except native conversion between single/half precision.
yes, indeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thank you
Thanks @r-devulap |
This broke MacOS CI with (I have a PR to fix it by ensuring the submodule is initalized so headers are there):
I presume we are on hardware there were AVX512 use (or compiling for it), is OK and that submodule initialization is right (see gh-23216). If not, then some guard seems not narrow enough. |
@r-devulap if you would like some visibility on this, do feel free to add a |
I wonder why the macOS CI failure didnt show up here. This PR already has a commit to update submodules in the macOS CI.
Will do, thanks! |
Ohhh weird, so now the submodule init is there twice :/. It seemed to have been a bit random. Maybe a weird caching issue? |
Doesn't this change break this documented promise:
From the Intel x86-simd-sort repo:
|
A quick heads-up that the x86-simd-sort code (I downloaded HEAD from Github) may have an issue sorting large arrays. I noticed incorrect results when integrating this into our vqsort benchmark. Happy to discuss in HN comments or here. |
I opened a milestoned issue. We should investigate this (correctness and maybe also NaN sorting stability) before the next release (which is plenty of time). (Happy if discussion continues here.) Another thing that I hope might happen is to have contributions to expand so this is not/less AVX512 specific. |
Sounds good 👍
Our vqsort works on S-SSE3/SSE4/AVX2/AVX-512, Arm NEON/SVE/SVE2, WASM SIMD and RISC-V V. The x86-simd-sort work cites previous work by one of the coauthors (Mark Blacher) of vqsort, but does not include its faster sorting network and more robust pivot sampling (note: some enhancements are not mentioned in our paper because they came later). |
The AVX-512 sorting routines does put all the nan's at the end of the array. What it doesn't do is preserve your NAN's bit by bit. Meaning, if you pass 0xFFFFFFFF, it will replace it with |
@jan-wassenberg Couldn't figure out the bug you encountered from that link (perhaps I didn't look deep enough). Could you elaborate? |
Ah, OK. I agree that NumPy should not have to worry about preserving the NaN payload, thanks for clarifying. (There are interesting use-cases for using NaN payload to identify NAs which are different from NaN, but that would be a problem to solve once someone wants to do that, it might be for R which does that IIRC.) |
Sure. If you patch in that pull request, but remove the
The problem is that the result doesn't match the expected sort order. But now I see what happened: x86-simd-sort doesn't seem to support reverse sorting, but our test also exercised that. Sorry I didn't notice that before - with only ascending order the test passes and it looks like all is well. I've updated our patch. |
@jan-wassenberg Sounds good. Thanks for clarifying! |
Curious if these changes might impact order for element subsets that are undetermined (i.e. identical values for fields)? |
Not sure what you are asking here. If you need identical elements to be taken in the same order, stable sorts are what you want, quicksort variants can differ in many ways. |
[Edit: verified that this is in fact the case. Editorially, having quicksort be the default sort seems to violate principle of least surprise, but I get that numpy wants to have the best perfromance OOTB.] Yep, that's what I'm asking. Of course it doesn't violate spec for quicksort to vary over identical fields, I was just wondering if you thought that was possible/likely with this particular change. In this case, we have regeressions tests that expect a certain ordering for underdetermined elements, even though that ordering is arbitrary, and I just need to adjust so tests don't have that expectation. (Normally I would assume implementations to stay stable wrt to algorithm behaviour on x.n releases, but it is clearly stated in release notes, so this isn't a critique, just an observation.) |
This patch adds AVX512 based 64-bit on AVX512-SKX and 16-bit sorting on AVX512-ICL. All the AVX512 sorting code has been reformatted as a separate header files and put in a separate folder. The AVX512 64-bit sorting is nearly 10x faster and AVX512 16-bit sorting is nearly 16x faster when compared to
std::sort
.Still working on running NumPy benchmarks to get exact benchmark numbers16-bit int sped up by 17x and float64 by nearly 10x for random arrays. Benchmarked on a 11th Gen Tigerlake i7-1165G7.