Skip to content

ENH: speed up 32-bit and 64-bit np.argsort by 5x with AVX-512 #23707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 18, 2023

Conversation

r-devulap
Copy link
Member

Leverages latest optimizations to x86-simd-sort which provides AVX-512 routines for argsort. Algorithm details:

  • Broadly the same algorithm that was used to vectorize quicksort. Instead of using load instructions to load array into registers, this uses gather instructions to load the array elements via the index array. This allows sorting only the indices without modifying the original array or having to make a copy of it.
  • O(1) space
  • Up to 5x speed up on random arrays.
  • See PR: Add AVX-512 argsort for 32 and 64-bit data type intel/x86-simd-sort#34 for all the details of the algorithm.

@r-devulap
Copy link
Member Author

PR intel/x86-simd-sort#38 should resolve some of the test failures.

@charris
Copy link
Member

charris commented May 15, 2023

Needs rebase.

@charris
Copy link
Member

charris commented May 16, 2023

What is the reasoning behind using intptr_t? It seems to be an optional type in the C++-11.

@charris
Copy link
Member

charris commented May 16, 2023

Segfault on 32 bit windows. I think it is legitimate, but may be because of amergesort (timsort). See https://tinyurl.com/yrwhpffy.

@r-devulap
Copy link
Member Author

What is the reasoning behind using intptr_t? It seems to be an optional type in the C++-11.

I should stick to using npy_intp. No reason to use intptr_t.

@r-devulap
Copy link
Member Author

Segfault on 32 bit windows. I think it is legitimate, but may be because of amergesort (timsort). See https://tinyurl.com/yrwhpffy.

Yeah, I should have expected that. My AVX512 argsort implementations use i64gather instructions which means I can only use it when sizeof(npy_intp) is 8 bytes. 1d29c11 disables avx512_argsort in 32-bit mode.

@r-devulap
Copy link
Member Author

Benchmark numbers:

      before           after         ratio
     [921a97e6]       [c03369a1]
     <main>           <argsort-avx>
+      63.9±0.1μs       98.6±0.4μs     1.54  bench_function_base.Sort.time_argsort('quick', 'uint32', ('ordered',))
+     73.5±0.02μs        110±0.2μs     1.50  bench_function_base.Sort.time_argsort('quick', 'int64', ('ordered',))
+     75.2±0.08μs       98.6±0.4μs     1.31  bench_function_base.Sort.time_argsort('quick', 'int32', ('ordered',))
+      93.1±0.1μs        106±0.2μs     1.14  bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',))
+      92.5±0.7μs        104±0.3μs     1.12  bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))
-       109±0.7μs        100±0.6μs     0.92  bench_function_base.Sort.time_argsort('quick', 'int32', ('reversed',))
-       149±0.5μs        106±0.2μs     0.71  bench_function_base.Sort.time_argsort('quick', 'float64', ('reversed',))
-       155±0.1μs        108±0.5μs     0.70  bench_function_base.Sort.time_argsort('quick', 'float32', ('reversed',))
-       341±0.4μs        105±0.4μs     0.31  bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 1000))
-       389±0.3μs        119±0.5μs     0.31  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 1000))
-       414±0.2μs        114±0.2μs     0.28  bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 1000))
-       395±0.4μs        105±0.3μs     0.27  bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 1000))
-       439±0.3μs        113±0.2μs     0.26  bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 1000))
-       446±0.3μs        111±0.2μs     0.25  bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 100))
-       511±0.3μs        127±0.4μs     0.25  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 10))
-       500±0.6μs        124±0.5μs     0.25  bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 100))
-       457±0.4μs        109±0.3μs     0.24  bench_function_base.Sort.time_argsort('quick', 'uint32', ('sorted_block', 10))
-       536±0.2μs        119±0.1μs     0.22  bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 100))
-         504±1μs        112±0.2μs     0.22  bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 100))
-       506±0.3μs        109±0.2μs     0.22  bench_function_base.Sort.time_argsort('quick', 'int32', ('sorted_block', 10))
-         561±1μs        121±0.3μs     0.22  bench_function_base.Sort.time_argsort('quick', 'float64', ('sorted_block', 10))
-       600±0.5μs        125±0.2μs     0.21  bench_function_base.Sort.time_argsort('quick', 'int64', ('random',))
-       569±0.8μs        118±0.8μs     0.21  bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 100))
-       545±0.2μs        110±0.3μs     0.20  bench_function_base.Sort.time_argsort('quick', 'uint32', ('random',))
-       586±0.5μs        117±0.2μs     0.20  bench_function_base.Sort.time_argsort('quick', 'float32', ('sorted_block', 10))
-       598±0.9μs        111±0.6μs     0.19  bench_function_base.Sort.time_argsort('quick', 'int32', ('random',))
-         658±2μs        119±0.2μs     0.18  bench_function_base.Sort.time_argsort('quick', 'float64', ('random',))
-       698±0.3μs        118±0.2μs     0.17  bench_function_base.Sort.time_argsort('quick', 'float32', ('random',))
-     94.2±0.07μs      10.3±0.07μs     0.11  bench_function_base.Sort.time_argsort('quick', 'uint32', ('uniform',))
-      99.1±0.2μs      10.4±0.06μs     0.10  bench_function_base.Sort.time_argsort('quick', 'int32', ('uniform',))
-      111±0.05μs      10.5±0.04μs     0.09  bench_function_base.Sort.time_argsort('quick', 'int64', ('uniform',))
-       177±0.2μs      11.4±0.02μs     0.06  bench_function_base.Sort.time_argsort('quick', 'float64', ('uniform',))
-       220±0.2μs      11.2±0.03μs     0.05  bench_function_base.Sort.time_argsort('quick', 'float32', ('uniform',))

@charris charris merged commit f2db090 into numpy:main May 18, 2023
@charris
Copy link
Member

charris commented May 18, 2023

Let's give this a shot. Thanks @r-devulap .

@r-devulap
Copy link
Member Author

Do you think this needs a release note? I would like to combine release notes for #22315 and this, if that is okay.

@charris
Copy link
Member

charris commented May 18, 2023

I was thinking a release note would be nice, thanks for offering :) The release notes are linked to the PR, so it is probably simpler to just make two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants