ENH: Vectorize quicksort for 16-bit and 64-bit dtype using AVX512 #22315

r-devulap · 2022-09-20T04:40:45Z

This patch adds AVX512 based 64-bit on AVX512-SKX and 16-bit sorting on AVX512-ICL. All the AVX512 sorting code has been reformatted as a separate header files and put in a separate folder. The AVX512 64-bit sorting is nearly 10x faster and AVX512 16-bit sorting is nearly 16x faster when compared to std::sort. ~~Still working on running NumPy benchmarks to get exact benchmark numbers~~

16-bit int sped up by 17x and float64 by nearly 10x for random arrays. Benchmarked on a 11th Gen Tigerlake i7-1165G7.

      before           after         ratio
     [41b6ac0e]       [2b384ac6]
     <main>           <avxsort> 
-        44.2±1μs       41.8±0.5μs     0.95  bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100))
-      39.7±0.1μs      37.1±0.02μs     0.93  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000))
-        45.7±3μs       42.6±0.4μs     0.93  bench_function_base.Sort.time_sort('quick', 'float32', ('random',))
-      39.8±0.1μs       37.1±0.4μs     0.93  bench_function_base.Sort.time_sort('quick', 'int32', ('random',))
-     39.1±0.03μs      36.4±0.03μs     0.93  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))
-      39.9±0.1μs       37.1±0.2μs     0.93  bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',))
-      39.4±0.2μs       36.3±0.2μs     0.92  bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',))
-     40.6±0.03μs       37.1±0.3μs     0.91  bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100))
-      46.4±0.7μs       41.8±0.4μs     0.90  bench_function_base.Sort.time_sort('quick', 'float32', ('ordered',))
-      40.5±0.4μs       36.4±0.1μs     0.90  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10))
-        42.5±1μs      37.5±0.08μs     0.88  bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000))
-        42.4±1μs       36.8±0.4μs     0.87  bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))
-        43.4±3μs       37.4±0.2μs     0.86  bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10))
-        44.5±1μs         38.0±1μs     0.85  bench_function_base.Sort.time_sort('quick', 'uint32', ('reversed',))
-        81.7±4μs       67.9±0.2μs     0.83  bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))
-        45.7±1μs       37.6±0.2μs     0.82  bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))
-       119±0.6μs       77.7±0.2μs     0.65  bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))
-         136±5μs       67.1±0.3μs     0.49  bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))
-     70.6±0.03μs      33.9±0.07μs     0.48  bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',))
-         113±1μs       34.8±0.1μs     0.31  bench_function_base.Sort.time_sort('quick', 'int16', ('reversed',))
-         325±9μs      80.9±0.07μs     0.25  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))
-         452±8μs         85.7±4μs     0.19  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100))
-         386±5μs       70.3±0.6μs     0.18  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000))
-         460±5μs         82.0±3μs     0.18  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10))
-      94.5±0.5ms       14.2±0.4ms     0.15  bench_function_base.Sort.time_sort_worst
-        537±10μs         80.8±2μs     0.15  bench_function_base.Sort.time_sort('quick', 'int64', ('random',))
-        518±10μs         73.9±3μs     0.14  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))
-        533±20μs       74.5±0.3μs     0.14  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10))
-        70.3±1μs       8.54±0.2μs     0.12  bench_function_base.Sort.time_sort('quick', 'int64', ('uniform',))
-        652±30μs         70.6±1μs     0.11  bench_function_base.Sort.time_sort('quick', 'float64', ('random',))
-         317±3μs       32.5±0.2μs     0.10  bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 1000))
-         439±2μs         34.4±2μs     0.08  bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100))
-         133±6μs       9.99±0.4μs     0.07  bench_function_base.Sort.time_sort('quick', 'float64', ('uniform',))
-         452±1μs       33.4±0.1μs     0.07  bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 10))
-      63.5±0.8μs       4.44±0.2μs     0.07  bench_function_base.Sort.time_sort('quick', 'int16', ('uniform',))
-         555±7μs       32.8±0.1μs     0.06  bench_function_base.Sort.time_sort('quick', 'int16', ('random',))

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

r-devulap · 2022-09-20T04:56:01Z

@serge-sans-paille: Reformatted your C++ code and added 16-bit and 64-bit sorting.

r-devulap · 2022-09-21T05:16:37Z

Added benchmark numbers. Need to fix the CI errors.

mattip · 2022-09-29T10:40:32Z

The cygwin failure is strange: "Permission denied" when importing the c-extension. Is there a compilation problem?

r-devulap · 2022-09-29T18:00:28Z

The cygwin failure is strange: "Permission denied" when importing the c-extension. Is there a compilation problem?

Doesn't look like it from logs. It was able to build and install fine. I am very confused with all the errors. Still working my way through them.

DWesl · 2022-10-04T16:06:07Z

The cygwin failure is strange: "Permission denied" when importing the c-extension. Is there a compilation problem?

Very odd: the error message says it's numpy.core._multiarray_umath that had trouble importing, but the permissions on the file providing that module indicate both read and execute permissions for all users, which is what is required for importing. Last time something failed to import despite good permissions, the problem was DLL dependencies not being on PATH, but I have a check for that now and it doesn't report missing dependencies, so I'm confused about what's going on.

Next commit imports fine before hitting a segfault so the permissions failure seems to have gone away. Hopefully it stays that way.
The segfault in SIMD reminds me of this problem, which was fixed in #18795; maybe that came up again?

r-devulap · 2022-10-04T17:52:02Z

The segfault in SIMD reminds me of this problem, which was fixed in #18795; maybe that came up again?

That's possible. What would be the easiest way to reproduce this error locally? Any pointers would be helpful :)

DWesl · 2022-10-04T20:43:34Z

The segfault in SIMD reminds me of this problem, which was fixed in #18795; maybe that came up again?

That's possible. What would be the easiest way to reproduce this error locally? Any pointers would be helpful :)

This will mostly be following the CI workflow. On a 64-bit Windows machine (it's possible to adapt the directons for 32-bit windows, but not recommended):

Get setup-86_64.exe from cygwin.com
Run the installer: when you get to the "Select Packages" page, install (this may be simpler if you change the view in the package picker from "Categories" to "Full" or "Not Installed":
- python38-devel
- python38-zipp
- python38-importlib-metadata
- python38-cython
- python38-pip
- python38-wheel
- python38-cffi
- python38-pytz
- python38-setuptools
- python38-pytest
- python38-hypothesis
- liblapack-devel
- gcc-fortran
- gcc-g++
- git
- dash
- Other packages you may find useful for debugging:
  - python38-debuginfo
  - cygwin-debuginfo
  - gdb
- libopenblas: Might speed up the linear algebra tests, but I don't think you're looking at those
Start Cygwin: by default it will give you a start menu item and a desktop shortcut, but you can also open a command prompt, navigate to the directory where you installed Cygwin (C:\cygwin64 by default) and run Cygwin.bat. You can run mintty - to get a different terminal, or just deal with blocky fonts and stay in the command prompt.
Set up development environment according to the build and test guidelines
- python3.8 -m pip install 'setuptools<49.2.0' pytest pytz cffi pickle5 importlib_metadata typing_extensions
- python3.8 -m pip install -r test_requirements.txt
Build NumPy (runtests.py should handle this fine, but the long version is less likely to run into fork failures)
- python3.8 setup.py bdist_wheel (The Cygwin CI job uploads the wheel file on failure, so you can download that and skip this step if you want)
- python3.8 -m pip install dist/numpy-*cp38*.whl (wheel file name will change with every patch update to Cygwin, so you may need to delete old wheels every so often)
- dash "tools/rebase_installed_dlls_cygwin.sh" 3.8
- python3.8 runtests.py -n -vv
Debug failures using whatever means you normally use. You might find it useful to install pytest-forked and run just the tests that are likely to fail with python3.8 runtests.py -n -- --forked, but forking before every test is too slow on Cygwin to be useful for more than a couple dozen tests.

Another option would be to mark the new functions NPY_FINLINE one at a time until the segfault goes away or moves.

This page has a brief description of how to use the CYGWIN environment variable to make programs (including the python process running the numpy tests) start gdb on segfault: I don't know how to set environment variables on WIndows, so I'm not sure how useful that is (some Cygwin options need to be set before any Cygwin process starts, but env CYGWIN='error_start:C:\cygwin64\bin\gdb.exe' python3.8 runtests.py -n -vv might still work)

r-devulap · 2022-10-05T00:20:38Z

@DWesl Thanks for that! And yup, its the same problem as you mentioned! Bunch of aligned mov instructions to unaligned stack pointer.

0x5af801157 <sort_128_32bit<vector<float>, float>(float*, int32_t)+4279>        vmovaps %zmm1,0x120(%rsp)
0x5af801162 <sort_128_32bit<vector<float>, float>(float*, int32_t)+4290>        vzeroupper
0x5af801165 <sort_128_32bit<vector<float>, float>(float*, int32_t)+4293>        call   0x5af7f12b0 <sort_zmm_32bit<vector<float>, __vector(16) float>(float __attribute__ ((vector_size(16))))
0x5af80116a <sort_128_32bit<vector<float>, float>(float*, int32_t)+4298>        vmovups 0x20(%rsp),%zmm3
0x5af801175 <sort_128_32bit<vector<float>, float>(float*, int32_t)+4309>        mov    %r13,%rdx
0x5af801178 <sort_128_32bit<vector<float>, float>(float*, int32_t)+4312>        mov    %r12,%rcx
0x5af80117b <sort_128_32bit<vector<float>, float>(float*, int32_t)+4315>        vmovaps 0x160(%rsp),%zmm0
0x5af801186 <sort_128_32bit<vector<float>, float>(float*, int32_t)+4326>        vmovaps %zmm3,0x120(%rsp)
0x5af801191 <sort_128_32bit<vector<float>, float>(float*, int32_t)+4337>        vmovups %zmm0,0xa0(%rsp)
0x5af80119c <sort_128_32bit<vector<float>, float>(float*, int32_t)+4348>        vzeroupper

r-devulap · 2022-10-07T18:57:53Z

I'm trying to reproduce the Windows Python38-32bit-fast CI failure (followed the steps listed in azure-steps-windows.yml). I can't reproduce the failing tests but I see a different test fail:

FAILED numpy\distutils\tests\test_mingw32ccompiler.py::test_build_import - ValueError: 'nm.exe' found but it does not support 64-bit dlls when using 64-bit python. Supported formats: 'b'supported targets: pe-i386 pei-i386 elf32-i386 e...
1 failed, 22545 passed, 545 skipped, 1306 deselected, 35 xfailed, 28 xpassed in 703.17s (0:11:43)

Any idea where I am going wrong?

mattip · 2022-10-08T20:52:28Z

it does not support 64-bit dlls when using 64-bit python

Somehow you are running the test with a 64-bit python (sys.maxsize is used to check)

r-devulap · 2022-10-09T17:23:28Z

Yes looks like it. Doesn't the CI use the python.exe downloaded from here:?

numpy/azure-steps-windows.yml

Line 11 in 3cf2ca1

$url = "https://downloads.python.org/pypy/pypy3.8-v7.3.9-win64.zip"

And that is a 64-bit one:

PS C:\Users\rdevulap\numpy> ..\pypy3\python.exe --version
Python 3.8.12 (0089b4a7ab2306925a251b35912885d52ead1aba, Mar 16 2022, 13:51:04)
[PyPy 7.3.9 with MSC v.1929 64 bit (AMD64)]

mattip · 2022-10-10T18:29:28Z

No, that is PyPy-only. Otherwise $PYTHON_VERSION and $PYTHON_ARCH are used in the UsePythonVersion task.

r-devulap · 2022-10-11T05:20:07Z

Might need some help reproducing and figuring out why the tests fail on WIN32 ~~and MacOS~~. Disabling it for now. Will open an issue so we keep track of it. I am guessing the fix should be simple, but I am somehow struggling to reproduce the errors.

r-devulap · 2022-10-12T22:34:05Z

The ever elusive "All checks have passed" 🤦‍♂️But I think the last CI failure has nothing to do with this patch.

r-devulap · 2022-10-27T17:11:13Z

ping .. any idea why the two failures? one is on Arm64 and other one might also not be related (this patch is disabled on 32-bit windows).

mattip · 2022-10-31T11:57:24Z

The aarch64 run is timing out at ~80%

r-devulap · 2022-11-02T21:14:29Z

Added AVX-512 quicksort for float16. This was the tricky one and I have a feeling it can be improved further. Benchmarks:

       before           after         ratio
     [db7414b7]       [cf13cd63]
     <main>           <avxsort>
-       196±0.4μs        140±0.1μs     0.72  bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',))
-         462±3μs       131±0.05μs     0.28  bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000))
-         609±2μs        129±0.1μs     0.21  bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 10))
-         590±7μs        124±0.1μs     0.21  bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 100))
-        698±10μs        129±0.1μs     0.18  bench_function_base.Sort.time_sort('quick', 'float16', ('random',))
-       174±0.4μs      6.58±0.03μs     0.04  bench_function_base.Sort.time_sort('quick', 'float16', ('uniform',))
-       175±0.2μs      6.58±0.06μs     0.04  bench_function_base.Sort.time_sort('quick', 'float16', ('reversed',))

r-devulap · 2022-11-03T03:46:59Z

@mattip The ARM64 failure on TravisCI is unlikely to be related to this patch (this patch is x86 code only). Looks like other PR's have the similar failures too: https://app.travis-ci.com/github/numpy/numpy/jobs/587139824 and https://app.travis-ci.com/github/numpy/numpy/jobs/587353553

mattip · 2022-11-03T06:18:40Z

@seiko2plus any thoughts?

r-devulap · 2022-11-15T19:55:13Z

just a friendly ping :)

seiko2plus · 2022-11-16T01:02:33Z

@r-devulap, sorry, currently I suffer from lack of time, please just give me 2 or 3 days max to give a full review.

r-devulap · 2022-11-16T16:23:30Z

@seiko2plus Didnt mean to rush you. Please take your time.

r-devulap · 2023-02-07T22:51:41Z

It seems the following directive needs to be extended to cover cygwin

numpy/numpy/core/src/npysort/quicksort.cpp

Lines 72 to 76 in 472a47f

// Temporarily disable AVX512 sorting on WIN32 until we can figure

// out why it has test failures

#ifdef _MSC_VER

template<typename T>

inline bool quicksort_dispatch(T*, npy_intp)

disabled on Cygwin.

mattip · 2023-02-08T08:09:17Z

This has become quite a large PR. If I understand correctly it adds new headers, new ways of dispatching, and a new dependency. Could we split this into two: one PR to enhance the dispatching mechanism, including documentation, and another to add AVX512 float16 sorting?

The speed increase is really nice. Correct me if I am wrong, I would imagine much of it comes from having a native float16 sort. If that is true, then implementing a float16 sort without casting float16 to float32 would also speed up non-AVX512 architectures.

seiko2plus · 2023-02-08T16:03:03Z

This has become quite a large PR.

the other way around, the code is reduced and became easier to read.

adds new headers

Yes, that's correct. mainly initial new road for using C++ that enhances the use of type deduced and overloading
rather than using the current approach that is based on meta/tags e.g. float_tag.

The new four headers are located in src/common, and as I explained in the commit message b358ba4:

 ENH: Towards modern C++

 This patch initializes new C++ headers and also brings new
 namespace `np::` to break away from the current approach
 of using C++ which tends not to be drawn into modernity.

I can move this patch into a separate pr if it's necessary.

new ways of dispatching

Nothing changed, the same way.

Correct me if I am wrong, I would imagine much of it comes from having a native float16 sort

This patch doesn't enable any native float16 instructions, except native conversion between single/half precision.
Sort operation mainly based on 16-bit shuffle/permuting instructions has nothing to do with FP unit except for the part that related to comparison wouldn't affect that much on performance and it can be numerically emulated without even the need for cast between single/half precision.

If that is true, then implementing a float16 sort without casting float16 to float32 would also speed up non-AVX512 architectures.

yes, indeed.

seiko2plus

LGTM, Thank you

mattip · 2023-02-15T12:01:26Z

Thanks @r-devulap

seberg · 2023-02-15T13:30:19Z

This broke MacOS CI with (I have a PR to fix it by ensuring the submodule is initalized so headers are there):

In file included from build/src.macosx-11.7-x86_64-3.9/numpy/core/src/npysort/simd_qsort.dispatch.avx512_skx.cpp:21:
/Users/runner/work/1/s/numpy/core/src/npysort/simd_qsort.dispatch.cpp:11:14: fatal error: 'x86-simd-sort/src/avx512-32bit-qsort.hpp' file not found
    #include "x86-simd-sort/src/avx512-32bit-qsort.hpp"
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

I presume we are on hardware there were AVX512 use (or compiling for it), is OK and that submodule initialization is right (see gh-23216). If not, then some guard seems not narrow enough.

seberg · 2023-02-15T18:18:34Z

@r-devulap if you would like some visibility on this, do feel free to add a performance release note, it seems like that may be worthwhile.

r-devulap · 2023-02-15T18:29:51Z

This broke MacOS CI with (I have a PR to fix it by ensuring the submodule is initalized so headers are there):

I wonder why the macOS CI failure didnt show up here. This PR already has a commit to update submodules in the macOS CI.

@r-devulap if you would like some visibility on this, do feel free to add a performance release note, it seems like that may be worthwhile.

Will do, thanks!

seberg · 2023-02-15T18:48:58Z

Ohhh weird, so now the submodule init is there twice :/. It seemed to have been a bit random. Maybe a weird caching issue?

mvtec-bergdolll · 2023-02-17T09:44:58Z

Doesn't this change break this documented promise:

Previous to numpy 1.4.0 sorting real and complex arrays containing nan values led to undefined behaviour. In numpy versions >= 1.4.0 nan values are sorted to the end. The extended sort order is:

From the Intel x86-simd-sort repo:

If you expect your array to contain NANs, please be aware that the these routines do not preserve your NANs as you pass them.

jan-wassenberg · 2023-02-17T15:00:14Z

A quick heads-up that the x86-simd-sort code (I downloaded HEAD from Github) may have an issue sorting large arrays.

I noticed incorrect results when integrating this into our vqsort benchmark. Happy to discuss in HN comments or here.

seberg · 2023-02-17T15:07:35Z

I opened a milestoned issue. We should investigate this (correctness and maybe also NaN sorting stability) before the next release (which is plenty of time). (Happy if discussion continues here.)

Another thing that I hope might happen is to have contributions to expand so this is not/less AVX512 specific.

jan-wassenberg · 2023-02-17T15:22:01Z

Sounds good 👍

Another thing that I hope might happen is to have contributions to expand so this is not/less AVX512 specific.

Our vqsort works on S-SSE3/SSE4/AVX2/AVX-512, Arm NEON/SVE/SVE2, WASM SIMD and RISC-V V.

The x86-simd-sort work cites previous work by one of the coauthors (Mark Blacher) of vqsort, but does not include its faster sorting network and more robust pivot sampling (note: some enhancements are not mentioned in our paper because they came later).

r-devulap · 2023-02-17T16:18:57Z

Doesn't this change break this documented promise:

Previous to numpy 1.4.0 sorting real and complex arrays containing nan values led to undefined behaviour. In numpy versions >= 1.4.0 nan values are sorted to the end. The extended sort order is:

From the Intel x86-simd-sort repo:

If you expect your array to contain NANs, please be aware that the these routines do not preserve your NANs as you pass them.

The AVX-512 sorting routines does put all the nan's at the end of the array. What it doesn't do is preserve your NAN's bit by bit. Meaning, if you pass 0xFFFFFFFF, it will replace it with std::nanf("1"), which is still a NAN. AFAIK, NumPy doesnt distinguish between the two.

r-devulap · 2023-02-17T16:19:56Z

A quick heads-up that the x86-simd-sort code (I downloaded HEAD from Github) may have an issue sorting large arrays.

I noticed incorrect results when integrating this into our vqsort benchmark. Happy to discuss in HN comments or here.

@jan-wassenberg Couldn't figure out the bug you encountered from that link (perhaps I didn't look deep enough). Could you elaborate?

seberg · 2023-02-17T16:23:21Z

which is still a NAN. AFAIK, NumPy doesnt distinguish between the two.

Ah, OK. I agree that NumPy should not have to worry about preserving the NaN payload, thanks for clarifying. (There are interesting use-cases for using NaN payload to identify NAs which are different from NaN, but that would be a problem to solve once someone wants to do that, it might be for R which does that IIRC.)

jan-wassenberg · 2023-02-17T18:53:48Z

@jan-wassenberg Couldn't figure out the bug you encountered from that link (perhaps I didn't look deep enough). Could you elaborate?

Sure. If you patch in that pull request, but remove the algo == Algo::kIntel || change, running bench_sort will raise an error complaining that the result does not match expectations:

BenchSort: i=0 of 10000 lanes: N1=1 -2146707942 -2146801302 vs. -2146300295 -2146707942
Abort at third_party/highway/hwy/contrib/sort/result-inl.h:125: 32-bit sort is incorrect

The problem is that the result doesn't match the expected sort order. But now I see what happened: x86-simd-sort doesn't seem to support reverse sorting, but our test also exercised that. Sorry I didn't notice that before - with only ascending order the test passes and it looks like all is well. I've updated our patch.

r-devulap · 2023-02-17T18:55:41Z

@jan-wassenberg Sounds good. Thanks for clarifying!

MilesParker · 2023-12-06T23:15:15Z

Curious if these changes might impact order for element subsets that are undetermined (i.e. identical values for fields)?
Asking because I'm trying to sort out some regressions in internal tests.

charris · 2023-12-07T00:26:16Z

if these changes might impact order

Not sure what you are asking here. If you need identical elements to be taken in the same order, stable sorts are what you want, quicksort variants can differ in many ways.

MilesParker · 2023-12-07T00:31:50Z

[Edit: verified that this is in fact the case. Editorially, having quicksort be the default sort seems to violate principle of least surprise, but I get that numpy wants to have the best perfromance OOTB.]

Yep, that's what I'm asking. Of course it doesn't violate spec for quicksort to vary over identical fields, I was just wondering if you thought that was possible/likely with this particular change. In this case, we have regeressions tests that expect a certain ordering for underdetermined elements, even though that ordering is arbitrary, and I just need to adjust so tests don't have that expectation. (Normally I would assume implementations to stay stable wrt to algorithm behaviour on x.n releases, but it is clearly stated in release notes, so this isn't a critique, just an observation.)

github-actions bot added the 01 - Enhancement label Sep 20, 2022

seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Sep 26, 2022

r-devulap force-pushed the avxsort branch 3 times, most recently from 0b5ef42 to b2a5ff8 Compare September 29, 2022 05:07

r-devulap force-pushed the avxsort branch from 041c576 to afc334c Compare October 11, 2022 05:15

r-devulap force-pushed the avxsort branch 2 times, most recently from f353b83 to 6b3e914 Compare October 11, 2022 18:20

seiko2plus approved these changes Feb 8, 2023

View reviewed changes

mattip merged commit 0bd56e7 into numpy:main Feb 15, 2023

mattip mentioned this pull request Feb 15, 2023

ENH: sort 16-bit floats without converting to 32-bit floats #23215

Closed

seberg mentioned this pull request Feb 17, 2023

BUG?: New SIMD quicksort code may have some issues #23234

Closed

rgommers mentioned this pull request Mar 10, 2023

DOC: Promote to build with meson #23367

Closed

r-devulap mentioned this pull request May 18, 2023

ENH: speed up 32-bit and 64-bit np.argsort by 5x with AVX-512 #23707

Merged

pllim mentioned this pull request Jun 1, 2023

TST: test_table_group_by[True] and test_group_by_masked[True] failed with numpy 1.25rc1 astropy/astropy#14882

Closed

Bulat-Ziganshin mentioned this pull request Jun 2, 2023

Consider checking out vqsort performance ClickHouse/ClickHouse#46348

Open

taldcroft mentioned this pull request Jun 3, 2023

Specify stable sort in indexing astropy/astropy#14907

Merged

mfbalin mentioned this pull request Jun 29, 2024

[GraphBolt] modify logic for HeteroItemSet indexing dmlc/dgl#7428

Open

8 tasks

Uh oh!

ENH: Vectorize quicksort for 16-bit and 64-bit dtype using AVX512 #22315

ENH: Vectorize quicksort for 16-bit and 64-bit dtype using AVX512 #22315

Uh oh!

Conversation

r-devulap commented Sep 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Sep 20, 2022

Uh oh!

r-devulap commented Sep 21, 2022

Uh oh!

mattip commented Sep 29, 2022

Uh oh!

r-devulap commented Sep 29, 2022

Uh oh!

DWesl commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Oct 4, 2022

Uh oh!

DWesl commented Oct 4, 2022

Uh oh!

r-devulap commented Oct 5, 2022

Uh oh!

r-devulap commented Oct 7, 2022

Uh oh!

mattip commented Oct 8, 2022

Uh oh!

r-devulap commented Oct 9, 2022

Uh oh!

mattip commented Oct 10, 2022

Uh oh!

r-devulap commented Oct 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Oct 27, 2022

Uh oh!

mattip commented Oct 31, 2022

Uh oh!

r-devulap commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Nov 3, 2022

Uh oh!

mattip commented Nov 3, 2022

Uh oh!

r-devulap commented Nov 15, 2022

Uh oh!

seiko2plus commented Nov 16, 2022

Uh oh!

r-devulap commented Nov 16, 2022

Uh oh!

r-devulap commented Feb 7, 2023

Uh oh!

mattip commented Feb 8, 2023

Uh oh!

seiko2plus commented Feb 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

mattip commented Feb 15, 2023

Uh oh!

seberg commented Feb 15, 2023

Uh oh!

seberg commented Feb 15, 2023

Uh oh!

r-devulap commented Feb 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Feb 15, 2023

Uh oh!

mvtec-bergdolll commented Feb 17, 2023

Uh oh!

jan-wassenberg commented Feb 17, 2023

r-devulap commented Sep 20, 2022 •

edited

Loading

DWesl commented Oct 4, 2022 •

edited

Loading

r-devulap commented Oct 11, 2022 •

edited

Loading

r-devulap commented Oct 12, 2022 •

edited

Loading

r-devulap commented Nov 2, 2022 •

edited

Loading

seiko2plus commented Feb 8, 2023 •

edited

Loading

r-devulap commented Feb 15, 2023 •

edited

Loading

MilesParker commented Dec 7, 2023 •

edited

Loading