ENH: speed up 32-bit and 64-bit np.argsort by 5x with AVX-512 #23707

r-devulap · 2023-05-03T22:28:04Z

Leverages latest optimizations to x86-simd-sort which provides AVX-512 routines for argsort. Algorithm details:

Broadly the same algorithm that was used to vectorize quicksort. Instead of using load instructions to load array into registers, this uses gather instructions to load the array elements via the index array. This allows sorting only the indices without modifying the original array or having to make a copy of it.
O(1) space
Up to 5x speed up on random arrays.
See PR: Add AVX-512 argsort for 32 and 64-bit data type intel/x86-simd-sort#34 for all the details of the algorithm.

r-devulap · 2023-05-04T19:02:25Z

PR intel/x86-simd-sort#38 should resolve some of the test failures.

charris · 2023-05-15T15:03:30Z

Needs rebase.

charris · 2023-05-16T20:02:55Z

What is the reasoning behind using intptr_t? It seems to be an optional type in the C++-11.

charris · 2023-05-16T21:10:53Z

Segfault on 32 bit windows. I think it is legitimate, but may be because of amergesort (timsort). See https://tinyurl.com/yrwhpffy.

r-devulap · 2023-05-18T04:36:26Z

What is the reasoning behind using intptr_t? It seems to be an optional type in the C++-11.

I should stick to using npy_intp. No reason to use intptr_t.

r-devulap · 2023-05-18T04:42:33Z

Segfault on 32 bit windows. I think it is legitimate, but may be because of amergesort (timsort). See https://tinyurl.com/yrwhpffy.

Yeah, I should have expected that. My AVX512 argsort implementations use i64gather instructions which means I can only use it when sizeof(npy_intp) is 8 bytes. 1d29c11 disables avx512_argsort in 32-bit mode.

r-devulap · 2023-05-18T04:57:19Z

Benchmark numbers:

      before           after         ratio
     [921a97e6]       [c03369a1]
     <main>           <argsort-avx>
+      63.9±0.1μs       98.6±0.4μs     1.54  bench_function_base.Sort.time_argsort('quick', 'uint32', ('ordered',))
+     73.5±0.02μs        110±0.2μs     1.50  bench_function_base.Sort.time_argsort('quick', 'int64', ('ordered',))
+     75.2±0.08μs       98.6±0.4μs     1.31  bench_function_base.Sort.time_argsort('quick', 'int32', ('ordered',))
+      93.1±0.1μs        106±0.2μs     1.14  bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',))
+      92.5±0.7μs        104±0.3μs     1.12  bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))
-       109±0.7μs        100±0.6μs     0.92  bench_function_base.Sort.time_argsort('quick', 'int32', ('reversed',))
-       149±0.5μs        106±0.2μs     0.71  bench_function_base.Sort.time_argsort('quick', 'float64', ('reversed',))
-       155±0.1μs        108±0.5μs     0.70  bench_function_base.Sort.time_argsort('quick', 'float32', ('reversed',))
-       341±0.4μs        105±0.4μs     0.31  bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 1000))
-       389±0.3μs        119±0.5μs     0.31  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 1000))
-       414±0.2μs        114±0.2μs     0.28  bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 1000))
-       395±0.4μs        105±0.3μs     0.27  bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 1000))
-       439±0.3μs        113±0.2μs     0.26  bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 1000))
-       446±0.3μs        111±0.2μs     0.25  bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 100))
-       511±0.3μs        127±0.4μs     0.25  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 10))
-       500±0.6μs        124±0.5μs     0.25  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 100))
-       457±0.4μs        109±0.3μs     0.24  bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 10))
-       536±0.2μs        119±0.1μs     0.22  bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 100))
-         504±1μs        112±0.2μs     0.22  bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 100))
-       506±0.3μs        109±0.2μs     0.22  bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 10))
-         561±1μs        121±0.3μs     0.22  bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 10))
-       600±0.5μs        125±0.2μs     0.21  bench_function_base.Sort.time_argsort('quick', 'int64', ('random',))
-       569±0.8μs        118±0.8μs     0.21  bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 100))
-       545±0.2μs        110±0.3μs     0.20  bench_function_base.Sort.time_argsort('quick', 'uint32', ('random',))
-       586±0.5μs        117±0.2μs     0.20  bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 10))
-       598±0.9μs        111±0.6μs     0.19  bench_function_base.Sort.time_argsort('quick', 'int32', ('random',))
-         658±2μs        119±0.2μs     0.18  bench_function_base.Sort.time_argsort('quick', 'float64', ('random',))
-       698±0.3μs        118±0.2μs     0.17  bench_function_base.Sort.time_argsort('quick', 'float32', ('random',))
-     94.2±0.07μs      10.3±0.07μs     0.11  bench_function_base.Sort.time_argsort('quick', 'uint32', ('uniform',))
-      99.1±0.2μs      10.4±0.06μs     0.10  bench_function_base.Sort.time_argsort('quick', 'int32', ('uniform',))
-      111±0.05μs      10.5±0.04μs     0.09  bench_function_base.Sort.time_argsort('quick', 'int64', ('uniform',))
-       177±0.2μs      11.4±0.02μs     0.06  bench_function_base.Sort.time_argsort('quick', 'float64', ('uniform',))
-       220±0.2μs      11.2±0.03μs     0.05  bench_function_base.Sort.time_argsort('quick', 'float32', ('uniform',))

charris · 2023-05-18T20:21:32Z

Let's give this a shot. Thanks @r-devulap .

r-devulap · 2023-05-18T20:49:04Z

Do you think this needs a release note? I would like to combine release notes for #22315 and this, if that is okay.

charris · 2023-05-18T21:34:51Z

I was thinking a release note would be nice, thanks for offering :) The release notes are linked to the PR, so it is probably simpler to just make two.

github-actions bot added the 01 - Enhancement label May 3, 2023

Raghuveer Devulapalli added 3 commits May 15, 2023 21:49

Bring x86-simd-sort ot latest

d9b737d

ENH: dispatch argsort to leverage avx512_argsort routines

42626c6

Update to latest commit: arg-sort bug fix

e9126d3

r-devulap force-pushed the argsort-avx branch from 9b94a15 to e9126d3 Compare May 16, 2023 05:00

Raghuveer Devulapalli added 2 commits May 16, 2023 12:22

Explicitly cast arg pointer from intptr_t* to int64_t*

54ef7ae

Update to latest: fix missing intrinsics in g++-7

2e408c6

Enable avx512_argsort only on 64-bit

1d29c11

Use npy_intp instead of intptr_t

c03369a

charris merged commit f2db090 into numpy:main May 18, 2023

r-devulap added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label May 22, 2023

Bulat-Ziganshin mentioned this pull request Jun 2, 2023

Consider checking out vqsort performance ClickHouse/ClickHouse#46348

Open

mroeschke mentioned this pull request Jun 7, 2023

CI/DEPS: Add xfail(strict=False) to related unstable sorting changes in Numpy 1.25 pandas-dev/pandas#53548

Merged

This was referenced Jun 27, 2023

TEST: Only use stable argsorts in PARREC tests nipy/nibabel#1234

Merged

BUG: Argsort can produce different output on different architectures #24064

Closed

eleftherioszisis mentioned this pull request Jul 7, 2023

Different results depending on availability of AVX-512 instruction set and numpy>1.25 's vectorized sort/argsort BlueBrain/NeuroR#115

Closed

charris mentioned this pull request Apr 15, 2025

BUG: Performance regression in argsort on sorted data #28714

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: speed up 32-bit and 64-bit np.argsort by 5x with AVX-512 #23707

ENH: speed up 32-bit and 64-bit np.argsort by 5x with AVX-512 #23707

Uh oh!

r-devulap commented May 3, 2023

Uh oh!

r-devulap commented May 4, 2023

Uh oh!

charris commented May 15, 2023

Uh oh!

charris commented May 16, 2023

Uh oh!

charris commented May 16, 2023

Uh oh!

r-devulap commented May 18, 2023

Uh oh!

r-devulap commented May 18, 2023

Uh oh!

r-devulap commented May 18, 2023

Uh oh!

charris commented May 18, 2023

Uh oh!

r-devulap commented May 18, 2023

Uh oh!

charris commented May 18, 2023

Uh oh!

Uh oh!

Uh oh!

ENH: speed up 32-bit and 64-bit np.argsort by 5x with AVX-512 #23707

ENH: speed up 32-bit and 64-bit np.argsort by 5x with AVX-512 #23707

Uh oh!

Conversation

r-devulap commented May 3, 2023

Uh oh!

r-devulap commented May 4, 2023

Uh oh!

charris commented May 15, 2023

Uh oh!

charris commented May 16, 2023

Uh oh!

charris commented May 16, 2023

Uh oh!

r-devulap commented May 18, 2023

Uh oh!

r-devulap commented May 18, 2023

Uh oh!

r-devulap commented May 18, 2023

Uh oh!

charris commented May 18, 2023

Uh oh!

r-devulap commented May 18, 2023

Uh oh!

charris commented May 18, 2023

Uh oh!

Uh oh!