Skip to content

ENH: Vectorize quicksort for 16-bit and 64-bit dtype using AVX512 #22315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Feb 15, 2023

Conversation

r-devulap
Copy link
Member

@r-devulap r-devulap commented Sep 20, 2022

This patch adds AVX512 based 64-bit on AVX512-SKX and 16-bit sorting on AVX512-ICL. All the AVX512 sorting code has been reformatted as a separate header files and put in a separate folder. The AVX512 64-bit sorting is nearly 10x faster and AVX512 16-bit sorting is nearly 16x faster when compared to std::sort. Still working on running NumPy benchmarks to get exact benchmark numbers

16-bit int sped up by 17x and float64 by nearly 10x for random arrays. Benchmarked on a 11th Gen Tigerlake i7-1165G7.

      before           after         ratio
     [41b6ac0e]       [2b384ac6]
     <main>           <avxsort> 
-        44.2±1μs       41.8±0.5μs     0.95  bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100))
-      39.7±0.1μs      37.1±0.02μs     0.93  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000))
-        45.7±3μs       42.6±0.4μs     0.93  bench_function_base.Sort.time_sort('quick', 'float32', ('random',))
-      39.8±0.1μs       37.1±0.4μs     0.93  bench_function_base.Sort.time_sort('quick', 'int32', ('random',))
-     39.1±0.03μs      36.4±0.03μs     0.93  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))
-      39.9±0.1μs       37.1±0.2μs     0.93  bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',))
-      39.4±0.2μs       36.3±0.2μs     0.92  bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',))
-     40.6±0.03μs       37.1±0.3μs     0.91  bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100))
-      46.4±0.7μs       41.8±0.4μs     0.90  bench_function_base.Sort.time_sort('quick', 'float32', ('ordered',))
-      40.5±0.4μs       36.4±0.1μs     0.90  bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10))
-        42.5±1μs      37.5±0.08μs     0.88  bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000))
-        42.4±1μs       36.8±0.4μs     0.87  bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))
-        43.4±3μs       37.4±0.2μs     0.86  bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10))
-        44.5±1μs         38.0±1μs     0.85  bench_function_base.Sort.time_sort('quick', 'uint32', ('reversed',))
-        81.7±4μs       67.9±0.2μs     0.83  bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))
-        45.7±1μs       37.6±0.2μs     0.82  bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))
-       119±0.6μs       77.7±0.2μs     0.65  bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))
-         136±5μs       67.1±0.3μs     0.49  bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))
-     70.6±0.03μs      33.9±0.07μs     0.48  bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',))
-         113±1μs       34.8±0.1μs     0.31  bench_function_base.Sort.time_sort('quick', 'int16', ('reversed',))
-         325±9μs      80.9±0.07μs     0.25  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))
-         452±8μs         85.7±4μs     0.19  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100))
-         386±5μs       70.3±0.6μs     0.18  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000))
-         460±5μs         82.0±3μs     0.18  bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10))
-      94.5±0.5ms       14.2±0.4ms     0.15  bench_function_base.Sort.time_sort_worst
-        537±10μs         80.8±2μs     0.15  bench_function_base.Sort.time_sort('quick', 'int64', ('random',))
-        518±10μs         73.9±3μs     0.14  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))
-        533±20μs       74.5±0.3μs     0.14  bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10))
-        70.3±1μs       8.54±0.2μs     0.12  bench_function_base.Sort.time_sort('quick', 'int64', ('uniform',))
-        652±30μs         70.6±1μs     0.11  bench_function_base.Sort.time_sort('quick', 'float64', ('random',))
-         317±3μs       32.5±0.2μs     0.10  bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 1000))
-         439±2μs         34.4±2μs     0.08  bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100))
-         133±6μs       9.99±0.4μs     0.07  bench_function_base.Sort.time_sort('quick', 'float64', ('uniform',))
-         452±1μs       33.4±0.1μs     0.07  bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 10))
-      63.5±0.8μs       4.44±0.2μs     0.07  bench_function_base.Sort.time_sort('quick', 'int16', ('uniform',))
-         555±7μs       32.8±0.1μs     0.06  bench_function_base.Sort.time_sort('quick', 'int16', ('random',))

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@r-devulap
Copy link
Member Author

@serge-sans-paille: Reformatted your C++ code and added 16-bit and 64-bit sorting.

@r-devulap
Copy link
Member Author

Added benchmark numbers. Need to fix the CI errors.

@seiko2plus seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Sep 26, 2022
@r-devulap r-devulap force-pushed the avxsort branch 3 times, most recently from 0b5ef42 to b2a5ff8 Compare September 29, 2022 05:07
@mattip
Copy link
Member

mattip commented Sep 29, 2022

The cygwin failure is strange: "Permission denied" when importing the c-extension. Is there a compilation problem?

@r-devulap
Copy link
Member Author

The cygwin failure is strange: "Permission denied" when importing the c-extension. Is there a compilation problem?

Doesn't look like it from logs. It was able to build and install fine. I am very confused with all the errors. Still working my way through them.

@DWesl
Copy link
Contributor

DWesl commented Oct 4, 2022

The cygwin failure is strange: "Permission denied" when importing the c-extension. Is there a compilation problem?

Very odd: the error message says it's numpy.core._multiarray_umath that had trouble importing, but the permissions on the file providing that module indicate both read and execute permissions for all users, which is what is required for importing. Last time something failed to import despite good permissions, the problem was DLL dependencies not being on PATH, but I have a check for that now and it doesn't report missing dependencies, so I'm confused about what's going on.

Next commit imports fine before hitting a segfault so the permissions failure seems to have gone away. Hopefully it stays that way.
The segfault in SIMD reminds me of this problem, which was fixed in #18795; maybe that came up again?

@r-devulap
Copy link
Member Author

The segfault in SIMD reminds me of this problem, which was fixed in #18795; maybe that came up again?

That's possible. What would be the easiest way to reproduce this error locally? Any pointers would be helpful :)

@DWesl
Copy link
Contributor

DWesl commented Oct 4, 2022

The segfault in SIMD reminds me of this problem, which was fixed in #18795; maybe that came up again?

That's possible. What would be the easiest way to reproduce this error locally? Any pointers would be helpful :)

This will mostly be following the CI workflow. On a 64-bit Windows machine (it's possible to adapt the directons for 32-bit windows, but not recommended):

  • Get setup-86_64.exe from cygwin.com
  • Run the installer: when you get to the "Select Packages" page, install (this may be simpler if you change the view in the package picker from "Categories" to "Full" or "Not Installed":
    • python38-devel
    • python38-zipp
    • python38-importlib-metadata
    • python38-cython
    • python38-pip
    • python38-wheel
    • python38-cffi
    • python38-pytz
    • python38-setuptools
    • python38-pytest
    • python38-hypothesis
    • liblapack-devel
    • gcc-fortran
    • gcc-g++
    • git
    • dash
    • Other packages you may find useful for debugging:
      • python38-debuginfo
      • cygwin-debuginfo
      • gdb
    • libopenblas: Might speed up the linear algebra tests, but I don't think you're looking at those
  • Start Cygwin: by default it will give you a start menu item and a desktop shortcut, but you can also open a command prompt, navigate to the directory where you installed Cygwin (C:\cygwin64 by default) and run Cygwin.bat. You can run mintty - to get a different terminal, or just deal with blocky fonts and stay in the command prompt.
  • Set up development environment according to the build and test guidelines
    • python3.8 -m pip install 'setuptools<49.2.0' pytest pytz cffi pickle5 importlib_metadata typing_extensions
    • python3.8 -m pip install -r test_requirements.txt
  • Build NumPy (runtests.py should handle this fine, but the long version is less likely to run into fork failures)
    • python3.8 setup.py bdist_wheel (The Cygwin CI job uploads the wheel file on failure, so you can download that and skip this step if you want)
    • python3.8 -m pip install dist/numpy-*cp38*.whl (wheel file name will change with every patch update to Cygwin, so you may need to delete old wheels every so often)
    • dash "tools/rebase_installed_dlls_cygwin.sh" 3.8
    • python3.8 runtests.py -n -vv
  • Debug failures using whatever means you normally use. You might find it useful to install pytest-forked and run just the tests that are likely to fail with python3.8 runtests.py -n -- --forked, but forking before every test is too slow on Cygwin to be useful for more than a couple dozen tests.

Another option would be to mark the new functions NPY_FINLINE one at a time until the segfault goes away or moves.

This page has a brief description of how to use the CYGWIN environment variable to make programs (including the python process running the numpy tests) start gdb on segfault: I don't know how to set environment variables on WIndows, so I'm not sure how useful that is (some Cygwin options need to be set before any Cygwin process starts, but env CYGWIN='error_start:C:\cygwin64\bin\gdb.exe' python3.8 runtests.py -n -vv might still work)

@r-devulap
Copy link
Member Author

@DWesl Thanks for that! And yup, its the same problem as you mentioned! Bunch of aligned mov instructions to unaligned stack pointer.

0x5af801157 <sort_128_32bit<vector<float>, float>(float*, int32_t)+4279>        vmovaps %zmm1,0x120(%rsp)
0x5af801162 <sort_128_32bit<vector<float>, float>(float*, int32_t)+4290>        vzeroupper
0x5af801165 <sort_128_32bit<vector<float>, float>(float*, int32_t)+4293>        call   0x5af7f12b0 <sort_zmm_32bit<vector<float>, __vector(16) float>(float __attribute__ ((vector_size(16))))
0x5af80116a <sort_128_32bit<vector<float>, float>(float*, int32_t)+4298>        vmovups 0x20(%rsp),%zmm3
0x5af801175 <sort_128_32bit<vector<float>, float>(float*, int32_t)+4309>        mov    %r13,%rdx
0x5af801178 <sort_128_32bit<vector<float>, float>(float*, int32_t)+4312>        mov    %r12,%rcx
0x5af80117b <sort_128_32bit<vector<float>, float>(float*, int32_t)+4315>        vmovaps 0x160(%rsp),%zmm0
0x5af801186 <sort_128_32bit<vector<float>, float>(float*, int32_t)+4326>        vmovaps %zmm3,0x120(%rsp)
0x5af801191 <sort_128_32bit<vector<float>, float>(float*, int32_t)+4337>        vmovups %zmm0,0xa0(%rsp)
0x5af80119c <sort_128_32bit<vector<float>, float>(float*, int32_t)+4348>        vzeroupper

@r-devulap
Copy link
Member Author

I'm trying to reproduce the Windows Python38-32bit-fast CI failure (followed the steps listed in azure-steps-windows.yml). I can't reproduce the failing tests but I see a different test fail:

FAILED numpy\distutils\tests\test_mingw32ccompiler.py::test_build_import - ValueError: 'nm.exe' found but it does not support 64-bit dlls when using 64-bit python. Supported formats: 'b'supported targets: pe-i386 pei-i386 elf32-i386 e...
1 failed, 22545 passed, 545 skipped, 1306 deselected, 35 xfailed, 28 xpassed in 703.17s (0:11:43)

Any idea where I am going wrong?

@mattip
Copy link
Member

mattip commented Oct 8, 2022

it does not support 64-bit dlls when using 64-bit python

Somehow you are running the test with a 64-bit python (sys.maxsize is used to check)

@r-devulap
Copy link
Member Author

Yes looks like it. Doesn't the CI use the python.exe downloaded from here:?

$url = "https://downloads.python.org/pypy/pypy3.8-v7.3.9-win64.zip"

And that is a 64-bit one:

PS C:\Users\rdevulap\numpy> ..\pypy3\python.exe --version
Python 3.8.12 (0089b4a7ab2306925a251b35912885d52ead1aba, Mar 16 2022, 13:51:04)
[PyPy 7.3.9 with MSC v.1929 64 bit (AMD64)]

@mattip
Copy link
Member

mattip commented Oct 10, 2022

No, that is PyPy-only. Otherwise $PYTHON_VERSION and $PYTHON_ARCH are used in the UsePythonVersion task.

@r-devulap
Copy link
Member Author

r-devulap commented Oct 11, 2022

Might need some help reproducing and figuring out why the tests fail on WIN32 and MacOS. Disabling it for now. Will open an issue so we keep track of it. I am guessing the fix should be simple, but I am somehow struggling to reproduce the errors.

@r-devulap r-devulap force-pushed the avxsort branch 2 times, most recently from f353b83 to 6b3e914 Compare October 11, 2022 18:20
@r-devulap
Copy link
Member Author

r-devulap commented Oct 12, 2022

The ever elusive "All checks have passed" 🤦‍♂️But I think the last CI failure has nothing to do with this patch.

@r-devulap
Copy link
Member Author

ping .. any idea why the two failures? one is on Arm64 and other one might also not be related (this patch is disabled on 32-bit windows).

@mattip
Copy link
Member

mattip commented Oct 31, 2022

The aarch64 run is timing out at ~80%

@r-devulap
Copy link
Member Author

r-devulap commented Nov 2, 2022

Added AVX-512 quicksort for float16. This was the tricky one and I have a feeling it can be improved further. Benchmarks:

       before           after         ratio
     [db7414b7]       [cf13cd63]
     <main>           <avxsort>
-       196±0.4μs        140±0.1μs     0.72  bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',))
-         462±3μs       131±0.05μs     0.28  bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000))
-         609±2μs        129±0.1μs     0.21  bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 10))
-         590±7μs        124±0.1μs     0.21  bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 100))
-        698±10μs        129±0.1μs     0.18  bench_function_base.Sort.time_sort('quick', 'float16', ('random',))
-       174±0.4μs      6.58±0.03μs     0.04  bench_function_base.Sort.time_sort('quick', 'float16', ('uniform',))
-       175±0.2μs      6.58±0.06μs     0.04  bench_function_base.Sort.time_sort('quick', 'float16', ('reversed',))

@r-devulap
Copy link
Member Author

@mattip The ARM64 failure on TravisCI is unlikely to be related to this patch (this patch is x86 code only). Looks like other PR's have the similar failures too: https://app.travis-ci.com/github/numpy/numpy/jobs/587139824 and https://app.travis-ci.com/github/numpy/numpy/jobs/587353553

@mattip
Copy link
Member

mattip commented Nov 3, 2022

@seiko2plus any thoughts?

@r-devulap
Copy link
Member Author

just a friendly ping :)

@seiko2plus
Copy link
Member

@r-devulap, sorry, currently I suffer from lack of time, please just give me 2 or 3 days max to give a full review.

@r-devulap
Copy link
Member Author

@seiko2plus Didnt mean to rush you. Please take your time.

@r-devulap
Copy link
Member Author

It seems the following directive needs to be extended to cover cygwin

// Temporarily disable AVX512 sorting on WIN32 until we can figure
// out why it has test failures
#ifdef _MSC_VER
template<typename T>
inline bool quicksort_dispatch(T*, npy_intp)

disabled on Cygwin.

@mattip
Copy link
Member

mattip commented Feb 8, 2023

This has become quite a large PR. If I understand correctly it adds new headers, new ways of dispatching, and a new dependency. Could we split this into two: one PR to enhance the dispatching mechanism, including documentation, and another to add AVX512 float16 sorting?

The speed increase is really nice. Correct me if I am wrong, I would imagine much of it comes from having a native float16 sort. If that is true, then implementing a float16 sort without casting float16 to float32 would also speed up non-AVX512 architectures.

@seiko2plus
Copy link
Member

seiko2plus commented Feb 8, 2023

This has become quite a large PR.

the other way around, the code is reduced and became easier to read.

adds new headers

Yes, that's correct. mainly initial new road for using C++ that enhances the use of type deduced and overloading
rather than using the current approach that is based on meta/tags e.g. float_tag.

The new four headers are located in src/common, and as I explained in the commit message b358ba4:

 ENH: Towards modern C++

 This patch initializes new C++ headers and also brings new
 namespace `np::` to break away from the current approach
 of using C++ which tends not to be drawn into modernity.

I can move this patch into a separate pr if it's necessary.

new ways of dispatching

Nothing changed, the same way.

Correct me if I am wrong, I would imagine much of it comes from having a native float16 sort

This patch doesn't enable any native float16 instructions, except native conversion between single/half precision.
Sort operation mainly based on 16-bit shuffle/permuting instructions has nothing to do with FP unit except for the part that related to comparison wouldn't affect that much on performance and it can be numerically emulated without even the need for cast between single/half precision.

If that is true, then implementing a float16 sort without casting float16 to float32 would also speed up non-AVX512 architectures.

yes, indeed.

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you

@mattip mattip merged commit 0bd56e7 into numpy:main Feb 15, 2023
@mattip
Copy link
Member

mattip commented Feb 15, 2023

Thanks @r-devulap

@seberg
Copy link
Member

seberg commented Feb 15, 2023

This broke MacOS CI with (I have a PR to fix it by ensuring the submodule is initalized so headers are there):

In file included from build/src.macosx-11.7-x86_64-3.9/numpy/core/src/npysort/simd_qsort.dispatch.avx512_skx.cpp:21:
/Users/runner/work/1/s/numpy/core/src/npysort/simd_qsort.dispatch.cpp:11:14: fatal error: 'x86-simd-sort/src/avx512-32bit-qsort.hpp' file not found
    #include "x86-simd-sort/src/avx512-32bit-qsort.hpp"
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

I presume we are on hardware there were AVX512 use (or compiling for it), is OK and that submodule initialization is right (see gh-23216). If not, then some guard seems not narrow enough.

@seberg
Copy link
Member

seberg commented Feb 15, 2023

@r-devulap if you would like some visibility on this, do feel free to add a performance release note, it seems like that may be worthwhile.

@r-devulap
Copy link
Member Author

r-devulap commented Feb 15, 2023

This broke MacOS CI with (I have a PR to fix it by ensuring the submodule is initalized so headers are there):

I wonder why the macOS CI failure didnt show up here. This PR already has a commit to update submodules in the macOS CI.

@r-devulap if you would like some visibility on this, do feel free to add a performance release note, it seems like that may be worthwhile.

Will do, thanks!

@seberg
Copy link
Member

seberg commented Feb 15, 2023

Ohhh weird, so now the submodule init is there twice :/. It seemed to have been a bit random. Maybe a weird caching issue?

@mvtec-bergdolll
Copy link

Doesn't this change break this documented promise:

Previous to numpy 1.4.0 sorting real and complex arrays containing nan values led to undefined behaviour. In numpy versions >= 1.4.0 nan values are sorted to the end. The extended sort order is:

From the Intel x86-simd-sort repo:

If you expect your array to contain NANs, please be aware that the these routines do not preserve your NANs as you pass them.

@jan-wassenberg
Copy link
Contributor

A quick heads-up that the x86-simd-sort code (I downloaded HEAD from Github) may have an issue sorting large arrays.

I noticed incorrect results when integrating this into our vqsort benchmark. Happy to discuss in HN comments or here.

@seberg
Copy link
Member

seberg commented Feb 17, 2023

I opened a milestoned issue. We should investigate this (correctness and maybe also NaN sorting stability) before the next release (which is plenty of time). (Happy if discussion continues here.)

Another thing that I hope might happen is to have contributions to expand so this is not/less AVX512 specific.

@jan-wassenberg
Copy link
Contributor

Sounds good 👍

Another thing that I hope might happen is to have contributions to expand so this is not/less AVX512 specific.

Our vqsort works on S-SSE3/SSE4/AVX2/AVX-512, Arm NEON/SVE/SVE2, WASM SIMD and RISC-V V.

The x86-simd-sort work cites previous work by one of the coauthors (Mark Blacher) of vqsort, but does not include its faster sorting network and more robust pivot sampling (note: some enhancements are not mentioned in our paper because they came later).

@r-devulap
Copy link
Member Author

Doesn't this change break this documented promise:

Previous to numpy 1.4.0 sorting real and complex arrays containing nan values led to undefined behaviour. In numpy versions >= 1.4.0 nan values are sorted to the end. The extended sort order is:

From the Intel x86-simd-sort repo:

If you expect your array to contain NANs, please be aware that the these routines do not preserve your NANs as you pass them.

The AVX-512 sorting routines does put all the nan's at the end of the array. What it doesn't do is preserve your NAN's bit by bit. Meaning, if you pass 0xFFFFFFFF, it will replace it with std::nanf("1"), which is still a NAN. AFAIK, NumPy doesnt distinguish between the two.

@r-devulap
Copy link
Member Author

A quick heads-up that the x86-simd-sort code (I downloaded HEAD from Github) may have an issue sorting large arrays.

I noticed incorrect results when integrating this into our vqsort benchmark. Happy to discuss in HN comments or here.

@jan-wassenberg Couldn't figure out the bug you encountered from that link (perhaps I didn't look deep enough). Could you elaborate?

@seberg
Copy link
Member

seberg commented Feb 17, 2023

which is still a NAN. AFAIK, NumPy doesnt distinguish between the two.

Ah, OK. I agree that NumPy should not have to worry about preserving the NaN payload, thanks for clarifying. (There are interesting use-cases for using NaN payload to identify NAs which are different from NaN, but that would be a problem to solve once someone wants to do that, it might be for R which does that IIRC.)

@jan-wassenberg
Copy link
Contributor

@jan-wassenberg Couldn't figure out the bug you encountered from that link (perhaps I didn't look deep enough). Could you elaborate?

Sure. If you patch in that pull request, but remove the algo == Algo::kIntel || change, running bench_sort will raise an error complaining that the result does not match expectations:

BenchSort: i=0 of 10000 lanes: N1=1 -2146707942 -2146801302 vs. -2146300295 -2146707942
Abort at third_party/highway/hwy/contrib/sort/result-inl.h:125: 32-bit sort is incorrect

The problem is that the result doesn't match the expected sort order. But now I see what happened: x86-simd-sort doesn't seem to support reverse sorting, but our test also exercised that. Sorry I didn't notice that before - with only ascending order the test passes and it looks like all is well. I've updated our patch.

@r-devulap
Copy link
Member Author

@jan-wassenberg Sounds good. Thanks for clarifying!

@MilesParker
Copy link

Curious if these changes might impact order for element subsets that are undetermined (i.e. identical values for fields)?
Asking because I'm trying to sort out some regressions in internal tests.

@charris
Copy link
Member

charris commented Dec 7, 2023

if these changes might impact order

Not sure what you are asking here. If you need identical elements to be taken in the same order, stable sorts are what you want, quicksort variants can differ in many ways.

@MilesParker
Copy link

MilesParker commented Dec 7, 2023

[Edit: verified that this is in fact the case. Editorially, having quicksort be the default sort seems to violate principle of least surprise, but I get that numpy wants to have the best perfromance OOTB.]

Yep, that's what I'm asking. Of course it doesn't violate spec for quicksort to vary over identical fields, I was just wondering if you thought that was possible/likely with this particular change. In this case, we have regeressions tests that expect a certain ordering for underdetermined elements, even though that ordering is arbitrary, and I just need to adjust so tests don't have that expectation. (Normally I would assume implementations to stay stable wrt to algorithm behaviour on x.n releases, but it is clearly stated in release notes, so this isn't a critique, just an observation.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery triage review Issue/PR to be discussed at the next triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants