-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Q2k interleaving implementation - x86/x64 SIMD #14373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Q2k interleaving implementation - x86/x64 SIMD #14373
Conversation
ba56a3c
to
39ab344
Compare
39ab344
to
c2c53bc
Compare
I tested this on a 13900k with gcc 13 and clang 19, but the improvement is not very significant. Repacking has a significant cost, since it increases load time and prevents usage of mmap, and as it is, I find this very hard to justify for AVX2. It may make sense for AVX512, but I cannot test that. GCC-13:
Clang-19:
|
75dd04b
to
3f6c61d
Compare
Hi @slaren , Thanks |
Block Interleaving Formats
Block_Q2_Kx8 :
Performance Impact :
Gains of ~5.5 % seen with the AVX2 version and gains of ~25.5% seen with the AVX512 Version over the base commit with GCC Linux
GCC Linux :
Q2_K Model :
GCC Version = 12.3
Clang Linux:
More gains of ~26.3% seen with the AVX2 version and gains of ~53.9% seen with the AVX512 Version over the base commit with Clang Linux
Q2_K Model :
Clang Version = 20.1.0
The model tested was - https://huggingface.co/bartowski/Phi-3-mini-4k-instruct-GGUF
The PR was tested in AMD Ryzen 5 9600X which supports the following flags by default :
CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
Further the perplexity was tested and found to be similar with the Q2_K Model
The perplexity results are tabulated as follows :