Q2k interleaving implementation - x86/x64 SIMD #14373

Srihari-mcw · 2025-06-25T12:52:00Z

The PR contains block interleaving approach for Q2_K quantization for x64/x86 AVX2/AVX512 SIMD Architecture
AVX512 and AVX2 Versions are implemented for the GEMM function, whereas GEMV is implemented with AVX2 intrinsics
The existing quantize_q8_K_4x8 function quantizes the float values to block_q8_Kx4 format
repack_q2_K_to_q2_K_8_bl function rearranges the weight in Q2_K format to Q2_Kx8 format(block_q2_Kx8)

Block Interleaving Formats

Block_Q2_Kx8 :

Used to contain data of 8 Q2_K blocks in interleaved fashion
uint8 scales[128] - Scales and Mins from source Q2_K blocks are taken. Every 16 byte here is packed such that it contains scales and mins for corresponding sub blocks from Q2_K structure - There are 16 sub blocks in original Q2_K structure
The d and dmin values from source Q2_K blocks are stored together in an array
Quant values from the source Q2_K blocks are sequentially extracted and interleaved into groups of eight bytes

Performance Impact :

Gains of ~5.5 % seen with the AVX2 version and gains of ~25.5% seen with the AVX512 Version over the base commit with GCC Linux

GCC Linux :

Q2_K Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	84.64 ± 0.20		38de3fb - Base Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	89.26 ± 0.21	5.45%	ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	106.27 ± 0.32	25.54%	ef03580- AVX512 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	37.81 ± 0.02		38de3fb - Base Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	37.80 ± 0.02	-0.03%	ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	37.64 ± 0.01	-0.46%	ef03580 - AVX512 Commit

GCC Version = 12.3

Clang Linux:

More gains of ~26.3% seen with the AVX2 version and gains of ~53.9% seen with the AVX512 Version over the base commit with Clang Linux

Q2_K Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	92.33 ± 0.20		38de3fb - Base Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	116.68 ± 0.40	26.37%	ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	142.13 ± 0.63	53.93%	ef03580- AVX512 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	38.26 ± 0.00		38de3fb - Base Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	38.11 ± 0.01	-0.38%	ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	37.98 ± 0.01	-0.71%	ef03580 - AVX512 Commit

Clang Version = 20.1.0

The model tested was - https://huggingface.co/bartowski/Phi-3-mini-4k-instruct-GGUF

The PR was tested in AMD Ryzen 5 9600X which supports the following flags by default :

CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Further the perplexity was tested and found to be similar with the Q2_K Model

The perplexity results are tabulated as follows :

model	perplexity (Final estimate PPL)	Commit id
phi3 3B Q2_K - Medium	9.5511 +/- 0.064212	38de3fb - Base Commit
phi3 3B Q2_K - Medium	9.5488 +/- 0.06419	ef03580 - Updated Commit

slaren · 2025-07-04T13:01:04Z

I tested this on a 13900k with gcc 13 and clang 19, but the improvement is not very significant. Repacking has a significant cost, since it increases load time and prevents usage of mmap, and as it is, I find this very hard to justify for AVX2. It may make sense for AVX512, but I cannot test that.

GCC-13:

Model	Threads	Test	t/s master	t/s q2k_interleaving_implementation	Speedup
llama 7B Q2_K_M	8	pp64	47.56	48.55	1.02
llama 7B Q2_K_M	8	tg32	19.92	19.38	0.97
llama 7B Q2_K_M	16	pp64	63.04	60.08	0.95
llama 7B Q2_K_M	16	tg32	21.22	20.44	0.96
llama 7B Q2_K_M	24	pp64	68.39	68.07	1.00
llama 7B Q2_K_M	24	tg32	19.72	19.76	1.00
llama 7B Q2_K_M	32	pp64	71.18	71.62	1.01
llama 7B Q2_K_M	32	tg32	17.87	17.51	0.98

Clang-19:

Model	Threads	Test	t/s master	t/s q2k_interleaving_implementation	Speedup
llama 7B Q2_K_M	8	pp64	48.28	52.27	1.08
llama 7B Q2_K_M	8	tg32	20.78	19.08	0.92
llama 7B Q2_K_M	16	pp64	65.23	61.42	0.94
llama 7B Q2_K_M	16	tg32	20.94	19.79	0.95
llama 7B Q2_K_M	24	pp64	69.69	71.17	1.02
llama 7B Q2_K_M	24	tg32	19.90	19.59	0.98
llama 7B Q2_K_M	32	pp64	71.04	75.26	1.06
llama 7B Q2_K_M	32	tg32	16.91	17.30	1.02

Srihari-mcw · 2025-07-11T16:08:32Z

Hi @slaren ,
Thanks for the reply. Based on your feedback and further internal testing, we have currently updated the patch to enable the repacking for machines that have AVX512 support alone, so that the patch can be considered for optimization in AVX512 based machines. We will continue investigating the AVX2 performance further

Thanks

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 25, 2025

Srihari-mcw changed the title ~~Q2k interleaving implementation~~ Q2k interleaving implementation - x86/x64 SIMD Jun 25, 2025

Srihari-mcw and others added 2 commits June 26, 2025 11:24

Initial Q2_K Block Interleaving Implementation

d82cdc2

Addressed review comments and clean up of the code

48b4d5a

Srihari-mcw force-pushed the q2k_interleaving_implementation branch 2 times, most recently from ba56a3c to 39ab344 Compare June 26, 2025 06:00

Post rebase fixes

c2c53bc

Srihari-mcw force-pushed the q2k_interleaving_implementation branch from 39ab344 to c2c53bc Compare June 26, 2025 06:02

Manogna-Sree added 3 commits June 30, 2025 06:35

Initial CI/CD fixes

6426ad5

Update declarations in arch-fallback.h

d017195

Changes for GEMV Q2_K in arch-fallback.h

eb7f9a3

Enable repacking only on AVX-512 machines

3f6c61d

Srihari-mcw force-pushed the q2k_interleaving_implementation branch from 75dd04b to 3f6c61d Compare July 11, 2025 11:28

slaren approved these changes Jul 17, 2025

View reviewed changes

ggerganov approved these changes Jul 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Q2k interleaving implementation - x86/x64 SIMD #14373

Q2k interleaving implementation - x86/x64 SIMD #14373

Srihari-mcw commented Jun 25, 2025

Uh oh!

slaren commented Jul 4, 2025

Uh oh!

Srihari-mcw commented Jul 11, 2025

Uh oh!

Uh oh!

Q2k interleaving implementation - x86/x64 SIMD #14373

Are you sure you want to change the base?

Q2k interleaving implementation - x86/x64 SIMD #14373

Conversation

Srihari-mcw commented Jun 25, 2025

Uh oh!

slaren commented Jul 4, 2025

Uh oh!

Srihari-mcw commented Jul 11, 2025

Uh oh!

Uh oh!