[AMDGPU] Add support for `v_exp_bf16` on gfx1250 #149229

shiltian · 2025-07-17T00:19:52Z

Co-authored-by: Mekhanoshin, Stanislav Stanislav.Mekhanoshin@amd.com

shiltian · 2025-07-17T00:20:13Z

[AMDGPU] Add support for v_sin_bf16 on gfx1250 #149241
[AMDGPU] Add support for v_exp_bf16 on gfx1250 #149229 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-07-17T00:20:21Z

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-clang-codegen

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>

Patch is 78.39 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/149229.diff

25 Files Affected:

(modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+1)
(modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+1)
(modified) clang/test/CodeGenOpenCL/builtins-amdgcn-gfx1250.cl (+19)
(modified) llvm/lib/Target/AMDGPU/VOP1Instructions.td (+2)
(modified) llvm/test/CodeGen/AMDGPU/bf16-math.ll (+23)
(added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.exp.bf16.ll (+33)
(added) llvm/test/CodeGen/AMDGPU/llvm.exp2.bfloat.ll (+240)
(modified) llvm/test/MC/AMDGPU/gfx1250_asm_vop1-fake16.s (+45)
(modified) llvm/test/MC/AMDGPU/gfx1250_asm_vop1.s (+48)
(modified) llvm/test/MC/AMDGPU/gfx1250_asm_vop1_dpp16-fake16.s (+56)
(modified) llvm/test/MC/AMDGPU/gfx1250_asm_vop1_dpp16.s (+60)
(modified) llvm/test/MC/AMDGPU/gfx1250_asm_vop1_dpp8-fake16.s (+12)
(modified) llvm/test/MC/AMDGPU/gfx1250_asm_vop1_dpp8.s (+16)
(modified) llvm/test/MC/AMDGPU/gfx1250_asm_vop3_from_vop1-fake16.s (+45)
(modified) llvm/test/MC/AMDGPU/gfx1250_asm_vop3_from_vop1.s (+48)
(modified) llvm/test/MC/AMDGPU/gfx1250_asm_vop3_from_vop1_dpp16-fake16.s (+56)
(modified) llvm/test/MC/AMDGPU/gfx1250_asm_vop3_from_vop1_dpp16.s (+60)
(modified) llvm/test/MC/AMDGPU/gfx1250_asm_vop3_from_vop1_dpp8-fake16.s (+16)
(modified) llvm/test/MC/AMDGPU/gfx1250_asm_vop3_from_vop1_dpp8.s (+20)
(modified) llvm/test/MC/Disassembler/AMDGPU/gfx1250_dasm_vop1.txt (+63)
(modified) llvm/test/MC/Disassembler/AMDGPU/gfx1250_dasm_vop1_dpp16.txt (+59)
(modified) llvm/test/MC/Disassembler/AMDGPU/gfx1250_dasm_vop1_dpp8.txt (+15)
(modified) llvm/test/MC/Disassembler/AMDGPU/gfx1250_dasm_vop3_from_vop1.txt (+64)
(modified) llvm/test/MC/Disassembler/AMDGPU/gfx1250_dasm_vop3_from_vop1_dpp16.txt (+60)
(modified) llvm/test/MC/Disassembler/AMDGPU/gfx1250_dasm_vop3_from_vop1_dpp8.txt (+20)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index eee0a94f6fc64..7eb5e2acc8b37 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -673,6 +673,7 @@ TARGET_BUILTIN(__builtin_amdgcn_tanh_bf16, "yy", "nc", "bf16-trans-insts")
 TARGET_BUILTIN(__builtin_amdgcn_rcp_bf16, "yy", "nc", "bf16-trans-insts")
 TARGET_BUILTIN(__builtin_amdgcn_rsq_bf16, "yy", "nc", "bf16-trans-insts")
 TARGET_BUILTIN(__builtin_amdgcn_log_bf16, "yy", "nc", "bf16-trans-insts")
+TARGET_BUILTIN(__builtin_amdgcn_exp2_bf16, "yy", "nc", "bf16-trans-insts")
 
 TARGET_BUILTIN(__builtin_amdgcn_cvt_f16_fp8, "hiIi", "nc", "gfx1250-insts")
 TARGET_BUILTIN(__builtin_amdgcn_cvt_f16_bf8, "hiIi", "nc", "gfx1250-insts")
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 0312205d4ff8d..f7450373d1309 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -439,6 +439,7 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_log_bf16:
     return emitBuiltinWithOneOverloadedType<1>(*this, E, Intrinsic::amdgcn_log);
   case AMDGPU::BI__builtin_amdgcn_exp2f:
+  case AMDGPU::BI__builtin_amdgcn_exp2_bf16:
     return emitBuiltinWithOneOverloadedType<1>(*this, E,
                                                Intrinsic::amdgcn_exp2);
   case AMDGPU::BI__builtin_amdgcn_log_clampf:
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx1250.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx1250.cl
index bdf169a1a97da..7b1fd8aefe5be 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx1250.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx1250.cl
@@ -118,6 +118,25 @@ void test_log_bf16(global __bf16* out, __bf16 a)
   *out = __builtin_amdgcn_log_bf16(a);
 }
 
+// CHECK-LABEL: @test_exp2_bf16(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[OUT_ADDR:%.*]] = alloca ptr addrspace(1), align 8, addrspace(5)
+// CHECK-NEXT:    [[A_ADDR:%.*]] = alloca bfloat, align 2, addrspace(5)
+// CHECK-NEXT:    [[OUT_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[OUT_ADDR]] to ptr
+// CHECK-NEXT:    [[A_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[A_ADDR]] to ptr
+// CHECK-NEXT:    store ptr addrspace(1) [[OUT:%.*]], ptr [[OUT_ADDR_ASCAST]], align 8
+// CHECK-NEXT:    store bfloat [[A:%.*]], ptr [[A_ADDR_ASCAST]], align 2
+// CHECK-NEXT:    [[TMP0:%.*]] = load bfloat, ptr [[A_ADDR_ASCAST]], align 2
+// CHECK-NEXT:    [[TMP1:%.*]] = call bfloat @llvm.amdgcn.exp2.bf16(bfloat [[TMP0]])
+// CHECK-NEXT:    [[TMP2:%.*]] = load ptr addrspace(1), ptr [[OUT_ADDR_ASCAST]], align 8
+// CHECK-NEXT:    store bfloat [[TMP1]], ptr addrspace(1) [[TMP2]], align 2
+// CHECK-NEXT:    ret void
+//
+void test_exp2_bf16(global __bf16* out, __bf16 a)
+{
+  *out = __builtin_amdgcn_exp2_bf16(a);
+}
+
 // CHECK-LABEL: @test_cvt_f16_fp8(
 // CHECK-NEXT:  entry:
 // CHECK-NEXT:    [[OUT_ADDR:%.*]] = alloca ptr addrspace(1), align 8, addrspace(5)
diff --git a/llvm/lib/Target/AMDGPU/VOP1Instructions.td b/llvm/lib/Target/AMDGPU/VOP1Instructions.td
index e1bc39302e126..d93f5e5b81454 100644
--- a/llvm/lib/Target/AMDGPU/VOP1Instructions.td
+++ b/llvm/lib/Target/AMDGPU/VOP1Instructions.td
@@ -533,6 +533,7 @@ defm V_RCP_BF16  : VOP1Inst_t16 <"v_rcp_bf16",  VOP_BF16_BF16, AMDGPUrcp>;
 defm V_SQRT_BF16 : VOP1Inst_t16 <"v_sqrt_bf16", VOP_BF16_BF16, any_amdgcn_sqrt>;
 defm V_RSQ_BF16  : VOP1Inst_t16 <"v_rsq_bf16",  VOP_BF16_BF16, AMDGPUrsq>;
 defm V_LOG_BF16  : VOP1Inst_t16 <"v_log_bf16",  VOP_BF16_BF16, AMDGPUlogf16>;
+defm V_EXP_BF16  : VOP1Inst_t16 <"v_exp_bf16",  VOP_BF16_BF16, AMDGPUexpf16>;
 }
 } // End TRANS = 1, SchedRW = [WriteTrans32]
 defm V_FREXP_MANT_F16 : VOP1Inst_t16 <"v_frexp_mant_f16", VOP_F16_F16, int_amdgcn_frexp_mant>;
@@ -1145,6 +1146,7 @@ defm V_RCP_BF16              : VOP1_Real_FULL_t16_and_fake16_gfx1250<0x079>;
 defm V_SQRT_BF16             : VOP1_Real_FULL_t16_and_fake16_gfx1250<0x07a>;
 defm V_RSQ_BF16              : VOP1_Real_FULL_t16_and_fake16_gfx1250<0x07b>;
 defm V_LOG_BF16              : VOP1_Real_FULL_t16_and_fake16_gfx1250<0x07c>;
+defm V_EXP_BF16              : VOP1_Real_FULL_t16_and_fake16_gfx1250<0x07d>;
 
 //===----------------------------------------------------------------------===//
 // GFX10.
diff --git a/llvm/test/CodeGen/AMDGPU/bf16-math.ll b/llvm/test/CodeGen/AMDGPU/bf16-math.ll
index 05eee2d4d549d..029604c2933a9 100644
--- a/llvm/test/CodeGen/AMDGPU/bf16-math.ll
+++ b/llvm/test/CodeGen/AMDGPU/bf16-math.ll
@@ -25,4 +25,27 @@ define amdgpu_ps void @llvm_log2_bf16_s(ptr addrspace(1) %out, bfloat inreg %src
   ret void
 }
 
+define amdgpu_ps void @llvm_exp2_bf16_v(ptr addrspace(1) %out, bfloat %src) {
+; GCN-LABEL: llvm_exp2_bf16_v:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    v_exp_bf16_e32 v2, v2
+; GCN-NEXT:    global_store_b16 v[0:1], v2, off
+; GCN-NEXT:    s_endpgm
+  %exp = call bfloat @llvm.exp2.bf16(bfloat %src)
+  store bfloat %exp, ptr addrspace(1) %out, align 2
+  ret void
+}
+
+define amdgpu_ps void @llvm_exp2_bf16_s(ptr addrspace(1) %out, bfloat inreg %src) {
+; GCN-LABEL: llvm_exp2_bf16_s:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    v_exp_bf16_e32 v2, s0
+; GCN-NEXT:    global_store_b16 v[0:1], v2, off
+; GCN-NEXT:    s_endpgm
+  %exp = call bfloat @llvm.exp2.bf16(bfloat %src)
+  store bfloat %exp, ptr addrspace(1) %out, align 2
+  ret void
+}
+
 declare bfloat @llvm.log2.bf16(bfloat)
+declare bfloat @llvm.exp2.bf16(bfloat)
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.exp.bf16.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.exp.bf16.ll
new file mode 100644
index 0000000000000..6304923790ad5
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.exp.bf16.ll
@@ -0,0 +1,33 @@
+; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx1250 < %s | FileCheck -check-prefixes=GCN %s
+; xUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx1250 < %s | FileCheck -check-prefix=GCN %s
+
+; FIXME: GlobalISel does not work with bf16
+
+declare bfloat @llvm.amdgcn.exp2.bf16(bfloat) #0
+
+; GCN-LABEL: {{^}}exp_bf16:
+; GCN: v_exp_bf16_e32 {{v[0-9]+}}, {{s[0-9]+}}
+define amdgpu_kernel void @exp_bf16(ptr addrspace(1) %out, bfloat %src) #1 {
+  %exp = call bfloat @llvm.amdgcn.exp2.bf16(bfloat %src) #0
+  store bfloat %exp, ptr addrspace(1) %out, align 2
+  ret void
+}
+
+; GCN-LABEL: {{^}}exp_bf16_constant_4
+; GCN: v_exp_bf16_e32 v0, 4.0
+define amdgpu_kernel void @exp_bf16_constant_4(ptr addrspace(1) %out) #1 {
+  %exp = call bfloat @llvm.amdgcn.exp2.bf16(bfloat 4.0) #0
+  store bfloat %exp, ptr addrspace(1) %out, align 2
+  ret void
+}
+
+; GCN-LABEL: {{^}}exp_bf16_constant_100
+; GCN: v_exp_bf16_e32 {{v[0-9]+}}, 0x42c8
+define amdgpu_kernel void @exp_bf16_constant_100(ptr addrspace(1) %out) #1 {
+  %exp = call bfloat @llvm.amdgcn.exp2.bf16(bfloat 100.0) #0
+  store bfloat %exp, ptr addrspace(1) %out, align 2
+  ret void
+}
+
+attributes #0 = { nounwind readnone }
+attributes #1 = { nounwind }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.exp2.bfloat.ll b/llvm/test/CodeGen/AMDGPU/llvm.exp2.bfloat.ll
new file mode 100644
index 0000000000000..a48e34ae7a8f8
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.exp2.bfloat.ll
@@ -0,0 +1,240 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -global-isel=0 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -mattr=+real-true16 %s -o - | FileCheck %s -check-prefixes=SDAG-TRUE16
+; RUN: llc -global-isel=0 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -mattr=-real-true16 %s -o - | FileCheck %s -check-prefixes=SDAG-FAKE16
+; xUN: llc -global-isel=1 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -mattr=+real-true16 %s -o - | FileCheck %s -check-prefixes=GI-TRUE16
+; xUN: llc -global-isel=1 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -mattr=-real-true16 %s -o - | FileCheck %s -check-prefixes=GI-FAKE16
+
+define bfloat @v_exp2_bf16(bfloat %in) {
+; SDAG-TRUE16-LABEL: v_exp2_bf16:
+; SDAG-TRUE16:       ; %bb.0:
+; SDAG-TRUE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-TRUE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e32 v0.l, v0.l
+; SDAG-TRUE16-NEXT:    s_set_pc_i64 s[30:31]
+;
+; SDAG-FAKE16-LABEL: v_exp2_bf16:
+; SDAG-FAKE16:       ; %bb.0:
+; SDAG-FAKE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-FAKE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e32 v0, v0
+; SDAG-FAKE16-NEXT:    s_set_pc_i64 s[30:31]
+  %result = call bfloat @llvm.exp2.bf16(bfloat %in)
+  ret bfloat %result
+}
+
+define bfloat @v_exp2_fabs_bf16(bfloat %in) {
+; SDAG-TRUE16-LABEL: v_exp2_fabs_bf16:
+; SDAG-TRUE16:       ; %bb.0:
+; SDAG-TRUE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-TRUE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e64 v0.l, |v0.l|
+; SDAG-TRUE16-NEXT:    s_set_pc_i64 s[30:31]
+;
+; SDAG-FAKE16-LABEL: v_exp2_fabs_bf16:
+; SDAG-FAKE16:       ; %bb.0:
+; SDAG-FAKE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-FAKE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e64 v0, |v0|
+; SDAG-FAKE16-NEXT:    s_set_pc_i64 s[30:31]
+  %fabs = call bfloat @llvm.fabs.bf16(bfloat %in)
+  %result = call bfloat @llvm.exp2.bf16(bfloat %fabs)
+  ret bfloat %result
+}
+
+define bfloat @v_exp2_fneg_fabs_bf16(bfloat %in) {
+; SDAG-TRUE16-LABEL: v_exp2_fneg_fabs_bf16:
+; SDAG-TRUE16:       ; %bb.0:
+; SDAG-TRUE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-TRUE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e64 v0.l, -|v0.l|
+; SDAG-TRUE16-NEXT:    s_set_pc_i64 s[30:31]
+;
+; SDAG-FAKE16-LABEL: v_exp2_fneg_fabs_bf16:
+; SDAG-FAKE16:       ; %bb.0:
+; SDAG-FAKE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-FAKE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e64 v0, -|v0|
+; SDAG-FAKE16-NEXT:    s_set_pc_i64 s[30:31]
+  %fabs = call bfloat @llvm.fabs.bf16(bfloat %in)
+  %fneg.fabs = fneg bfloat %fabs
+  %result = call bfloat @llvm.exp2.bf16(bfloat %fneg.fabs)
+  ret bfloat %result
+}
+
+define bfloat @v_exp2_fneg_bf16(bfloat %in) {
+; SDAG-TRUE16-LABEL: v_exp2_fneg_bf16:
+; SDAG-TRUE16:       ; %bb.0:
+; SDAG-TRUE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-TRUE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e64 v0.l, -v0.l
+; SDAG-TRUE16-NEXT:    s_set_pc_i64 s[30:31]
+;
+; SDAG-FAKE16-LABEL: v_exp2_fneg_bf16:
+; SDAG-FAKE16:       ; %bb.0:
+; SDAG-FAKE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-FAKE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e64 v0, -v0
+; SDAG-FAKE16-NEXT:    s_set_pc_i64 s[30:31]
+  %fneg = fneg bfloat %in
+  %result = call bfloat @llvm.exp2.bf16(bfloat %fneg)
+  ret bfloat %result
+}
+
+define bfloat @v_exp2_bf16_fast(bfloat %in) {
+; SDAG-TRUE16-LABEL: v_exp2_bf16_fast:
+; SDAG-TRUE16:       ; %bb.0:
+; SDAG-TRUE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-TRUE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e32 v0.l, v0.l
+; SDAG-TRUE16-NEXT:    s_set_pc_i64 s[30:31]
+;
+; SDAG-FAKE16-LABEL: v_exp2_bf16_fast:
+; SDAG-FAKE16:       ; %bb.0:
+; SDAG-FAKE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-FAKE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e32 v0, v0
+; SDAG-FAKE16-NEXT:    s_set_pc_i64 s[30:31]
+  %result = call fast bfloat @llvm.exp2.bf16(bfloat %in)
+  ret bfloat %result
+}
+
+define <2 x bfloat> @v_exp2_v2bf16(<2 x bfloat> %in) {
+; SDAG-TRUE16-LABEL: v_exp2_v2bf16:
+; SDAG-TRUE16:       ; %bb.0:
+; SDAG-TRUE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-TRUE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e32 v0.h, v0.h
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e32 v0.l, v0.l
+; SDAG-TRUE16-NEXT:    s_set_pc_i64 s[30:31]
+;
+; SDAG-FAKE16-LABEL: v_exp2_v2bf16:
+; SDAG-FAKE16:       ; %bb.0:
+; SDAG-FAKE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-FAKE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-FAKE16-NEXT:    v_lshrrev_b32_e32 v1, 16, v0
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e32 v0, v0
+; SDAG-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(TRANS32_DEP_1)
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e32 v1, v1
+; SDAG-FAKE16-NEXT:    v_nop
+; SDAG-FAKE16-NEXT:    v_perm_b32 v0, v1, v0, 0x5040100
+; SDAG-FAKE16-NEXT:    s_set_pc_i64 s[30:31]
+  %result = call <2 x bfloat> @llvm.exp2.v2bf16(<2 x bfloat> %in)
+  ret <2 x bfloat> %result
+}
+
+define <2 x bfloat> @v_exp2_fabs_v2bf16(<2 x bfloat> %in) {
+; SDAG-TRUE16-LABEL: v_exp2_fabs_v2bf16:
+; SDAG-TRUE16:       ; %bb.0:
+; SDAG-TRUE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-TRUE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-TRUE16-NEXT:    v_and_b32_e32 v1, 0x7fff7fff, v0
+; SDAG-TRUE16-NEXT:    v_bfe_u32 v2, v0, 16, 15
+; SDAG-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e32 v0.l, v1.l
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e32 v0.h, v2.l
+; SDAG-TRUE16-NEXT:    s_set_pc_i64 s[30:31]
+;
+; SDAG-FAKE16-LABEL: v_exp2_fabs_v2bf16:
+; SDAG-FAKE16:       ; %bb.0:
+; SDAG-FAKE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-FAKE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-FAKE16-NEXT:    v_and_b32_e32 v1, 0x7fff7fff, v0
+; SDAG-FAKE16-NEXT:    v_bfe_u32 v0, v0, 16, 15
+; SDAG-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e32 v1, v1
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e32 v0, v0
+; SDAG-FAKE16-NEXT:    v_nop
+; SDAG-FAKE16-NEXT:    s_delay_alu instid0(TRANS32_DEP_1)
+; SDAG-FAKE16-NEXT:    v_perm_b32 v0, v0, v1, 0x5040100
+; SDAG-FAKE16-NEXT:    s_set_pc_i64 s[30:31]
+  %fabs = call <2 x bfloat> @llvm.fabs.v2bf16(<2 x bfloat> %in)
+  %result = call <2 x bfloat> @llvm.exp2.v2bf16(<2 x bfloat> %fabs)
+  ret <2 x bfloat> %result
+}
+
+define <2 x bfloat> @v_exp2_fneg_fabs_v2bf16(<2 x bfloat> %in) {
+; SDAG-TRUE16-LABEL: v_exp2_fneg_fabs_v2bf16:
+; SDAG-TRUE16:       ; %bb.0:
+; SDAG-TRUE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-TRUE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-TRUE16-NEXT:    v_and_b32_e32 v1, 0x7fff7fff, v0
+; SDAG-TRUE16-NEXT:    v_bfe_u32 v2, v0, 16, 15
+; SDAG-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e64 v0.l, -v1.l
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e64 v0.h, -v2.l
+; SDAG-TRUE16-NEXT:    s_set_pc_i64 s[30:31]
+;
+; SDAG-FAKE16-LABEL: v_exp2_fneg_fabs_v2bf16:
+; SDAG-FAKE16:       ; %bb.0:
+; SDAG-FAKE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-FAKE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-FAKE16-NEXT:    v_and_b32_e32 v1, 0x7fff7fff, v0
+; SDAG-FAKE16-NEXT:    v_bfe_u32 v0, v0, 16, 15
+; SDAG-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e64 v1, -v1
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e64 v0, -v0
+; SDAG-FAKE16-NEXT:    v_nop
+; SDAG-FAKE16-NEXT:    s_delay_alu instid0(TRANS32_DEP_1)
+; SDAG-FAKE16-NEXT:    v_perm_b32 v0, v0, v1, 0x5040100
+; SDAG-FAKE16-NEXT:    s_set_pc_i64 s[30:31]
+  %fabs = call <2 x bfloat> @llvm.fabs.v2bf16(<2 x bfloat> %in)
+  %fneg.fabs = fneg <2 x bfloat> %fabs
+  %result = call <2 x bfloat> @llvm.exp2.v2bf16(<2 x bfloat> %fneg.fabs)
+  ret <2 x bfloat> %result
+}
+
+define <2 x bfloat> @v_exp2_fneg_v2bf16(<2 x bfloat> %in) {
+; SDAG-TRUE16-LABEL: v_exp2_fneg_v2bf16:
+; SDAG-TRUE16:       ; %bb.0:
+; SDAG-TRUE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-TRUE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e64 v0.h, -v0.h
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e64 v0.l, -v0.l
+; SDAG-TRUE16-NEXT:    s_set_pc_i64 s[30:31]
+;
+; SDAG-FAKE16-LABEL: v_exp2_fneg_v2bf16:
+; SDAG-FAKE16:       ; %bb.0:
+; SDAG-FAKE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-FAKE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-FAKE16-NEXT:    v_lshrrev_b32_e32 v1, 16, v0
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e64 v0, -v0
+; SDAG-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(TRANS32_DEP_1)
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e64 v1, -v1
+; SDAG-FAKE16-NEXT:    v_nop
+; SDAG-FAKE16-NEXT:    v_perm_b32 v0, v1, v0, 0x5040100
+; SDAG-FAKE16-NEXT:    s_set_pc_i64 s[30:31]
+  %fneg = fneg <2 x bfloat> %in
+  %result = call <2 x bfloat> @llvm.exp2.v2bf16(<2 x bfloat> %fneg)
+  ret <2 x bfloat> %result
+}
+
+define <2 x bfloat> @v_exp2_v2bf16_fast(<2 x bfloat> %in) {
+; SDAG-TRUE16-LABEL: v_exp2_v2bf16_fast:
+; SDAG-TRUE16:       ; %bb.0:
+; SDAG-TRUE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-TRUE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e32 v0.h, v0.h
+; SDAG-TRUE16-NEXT:    v_exp_bf16_e32 v0.l, v0.l
+; SDAG-TRUE16-NEXT:    s_set_pc_i64 s[30:31]
+;
+; SDAG-FAKE16-LABEL: v_exp2_v2bf16_fast:
+; SDAG-FAKE16:       ; %bb.0:
+; SDAG-FAKE16-NEXT:    s_wait_loadcnt_dscnt 0x0
+; SDAG-FAKE16-NEXT:    s_wait_kmcnt 0x0
+; SDAG-FAKE16-NEXT:    v_lshrrev_b32_e32 v1, 16, v0
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e32 v0, v0
+; SDAG-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(TRANS32_DEP_1)
+; SDAG-FAKE16-NEXT:    v_exp_bf16_e32 v1, v1
+; SDAG-FAKE16-NEXT:    v_nop
+; SDAG-FAKE16-NEXT:    v_perm_b32 v0, v1, v0, 0x5040100
+; SDAG-FAKE16-NEXT:    s_set_pc_i64 s[30:31]
+  %result = call fast <2 x bfloat> @llvm.exp2.v2bf16(<2 x bfloat> %in)
+  ret <2 x bfloat> %result
+}
+
+declare bfloat @llvm.exp2.bf16(bfloat) #0
+declare <2 x bfloat> @llvm.exp2.v2bf16(<2 x bfloat>) #0
+declare bfloat @llvm.fabs.bf16(bfloat) #0
+declare <2 x bfloat> @llvm.fabs.v2bf16(<2 x bfloat>) #0
+
+attributes #0 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
diff --git a/llvm/test/MC/AMDGPU/gfx1250_asm_vop1-fake16.s b/llvm/test/MC/AMDGPU/gfx1250_asm_vop1-fake16.s
index 0f5ce56f1a2cf..426f480200e4b 100644
--- a/llvm/test/MC/AMDGPU/gfx1250_asm_vop1-fake16.s
+++ b/llvm/test/MC/AMDGPU/gfx1250_asm_vop1-fake16.s
@@ -253,6 +253,51 @@ v_log_bf16 v5, src_scc
 v_log_bf16 v127, 0x8000
 // GFX1250: v_log_bf16_e32 v127, 0x8000             ; encoding: [0xff,0xf8,0xfe,0x7e,0x00,0x80,0x00,0x00]
 
+v_exp_bf16 v5, v1
+// GFX1250: v_exp_bf16_e32 v5, v1                   ; encoding: [0x01,0xfb,0x0a,0x7e]
+
+v_exp_bf16 v5, v127
+// GFX1250: v_exp_bf16_e32 v5, v127                 ; encoding: [0x7f,0xfb,0x0a,0x7e]
+
+v_exp_bf16 v5, s1
+// GFX1250: v_exp_bf16_e32 v5, s1                   ; encoding: [0x01,0xfa,0x0a,0x7e]
+
+v_exp_bf16 v5, s105
+// GFX1250: v_exp_bf16_e32 v5, s105                 ; encoding: [0x69,0xfa,0x0a,0x7e]
+
+v_exp_bf16 v5, vcc_lo
+// GFX1250: v_exp_bf16_e32 v5, vcc_lo               ; encoding: [0x6a,0xfa,0x0a,0x7e]
+
+v_exp_bf16 v5, vcc_hi
+// GFX1250: v_exp_bf16_e32 v5, vcc_hi               ; encoding: [0x6b,0xfa,0x0a,0x7e]
+
+v_exp_bf16 v5, ttmp15
+// GFX1250: v_exp_bf16_e32 v5, ttmp15               ; encoding: [0x7b,0xfa,0x0a,0x7e]
+
+v_exp_bf16 v5, m0
+// GFX1250: v_exp_bf16_e32 v5, m0                   ; encoding: [0x7d,0xfa,0x0a,0x7e]
+
+v_exp_bf16 v5, exec_lo
+// GFX1250: v_exp_bf16_e32 v5, exec_lo              ; encoding: [0x7e,0xfa,0x0a,0x7e]
+
+v_exp_bf16 v5, exec_hi
+// GFX1250: v_exp_bf16_e32 v5, exec_hi              ; encoding: [0x7f,0xfa,0x0a,0x7e]
+
+v_exp_bf16 v5, null
+// GFX1250: v_exp_bf16_e32 v5, null                 ; encoding: [0x7c,0xfa,0x0a,0x7e]
+
+v_exp_bf16 v5, -1
+// GFX1250: v_exp_bf16_e32 v5, -1                   ; encoding: [0xc1,0xfa,0x0a,0x7e]
+
+v_exp_bf16 v5, 0.5
+// GFX1250: v_exp_bf16_e32 v5, 0.5                  ; encoding: [0xf0,0xfa,0x0a,0x7e]
+
+v_exp_bf16 v5, src_scc
+// GFX1250: v_exp_bf16_e32 v5, src_scc              ; encoding: [0xfd,0xfa,0x0a,0x7e]
+
+v_exp_bf16 v127, 0x8000
+// GFX1250: v_exp_bf16_e32 v127, 0x8000             ; encoding: [0xff,0xfa,0xfe,0x7e,0x00,0x80,0x00,0x00]
+
 v_cvt_f32_bf16 v5, v1
 // GFX1250: v_cvt_f32_bf16_e32 v5, v1               ; encoding: [0x01,0xe5,0x0a,0x7e]
 
diff --git a/llvm/test/MC/AMDGPU/gfx1250_asm_vop1.s b/llvm/test/MC/AMDGPU/gfx1250_asm_vop1.s
index 9dd11e6249b27..93999043d0fb8 100644
--- a/llvm/test/MC/AMDGPU/gfx1250_asm_vop1.s
+++ b/llvm/test/MC/AMDGPU/gfx1250_asm_vop1.s
@@ -268,6 +268,54 @@ v_log_bf16 v127, 0x8000
 v_log_bf16 v5.h, v1.h
 // GFX1250: v_log_bf16_e32 v5.h, v1.h               ; encoding: [0x81,0xf9,0x0a,0x7f]
 
+v_exp_bf16 v5, v1
+// GFX1250: v_exp_bf16_e32 v5, v1                   ; encoding: [0x01,0xfb,0x0a,0x7e]
+
+v_exp_bf16 v5, v127
+// GFX1250: v_exp_bf16_e32 v5, v127                 ; encoding: [0x7f,0xfb,0x0a,0x7e]
+
+v_exp_bf16 v5, s1
+// GFX1250: v_exp_bf16_e32 v5, s1                   ; encoding: [0x01,0xfa,0x0a,0x7e]
+
+v_exp_bf16 v5, s105
+// GF...
[truncated]

arsenm · 2025-07-17T00:45:10Z

llvm/test/CodeGen/AMDGPU/bf16-math.ll

+; GCN-NEXT:    v_exp_bf16_e32 v2, v2
+; GCN-NEXT:    global_store_b16 v[0:1], v2, off
+; GCN-NEXT:    s_endpgm
+  %exp = call bfloat @llvm.exp2.bf16(bfloat %src)


If it's OK to directly select the generic intrinsic, why add the amdgcn one?

They were added in different times in the past and we do have a builtin for that.

I don't follow

The target intrinsic int_amdgcn_exp2 was added a long time ago (along with the pattern AMDGPUexpf16). We didn't explicitly define a bf16 version of it. However, we did add a Clang builtin for that.

On the other hand, the generic intrinsic support was only added about a month ago.

Because it is direct access to the instruction w/o following potential legalization.

There's no legalization if the instruction works correctly. There's no reason to have the clang builtin, though the codegen for the target intrinsic should still work with the new type if legal

llvm/test/CodeGen/AMDGPU/llvm.exp2.bfloat.ll

rampitec

LGTM.

rampitec · 2025-07-17T08:59:59Z

llvm/test/CodeGen/AMDGPU/bf16-math.ll

+; GCN-NEXT:    v_exp_bf16_e32 v2, v2
+; GCN-NEXT:    global_store_b16 v[0:1], v2, off
+; GCN-NEXT:    s_endpgm
+  %exp = call bfloat @llvm.exp2.bf16(bfloat %src)


Because it is direct access to the instruction w/o following potential legalization.

shiltian · 2025-07-17T12:41:39Z

Merge activity

Jul 17, 12:41 PM UTC: A user started a stack merge that includes this pull request via Graphite.
Jul 17, 12:44 PM UTC: Graphite rebased this pull request as part of a merge.
Jul 17, 12:46 PM UTC: @shiltian merged this pull request with Graphite.

Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>

shiltian requested review from changpeng and rampitec July 17, 2025 00:19

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. mc Machine (object) code labels Jul 17, 2025

arsenm reviewed Jul 17, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/llvm.exp2.bfloat.ll Outdated Show resolved Hide resolved

shiltian force-pushed the users/shiltian/v_exp_bf16 branch from ac1b5d3 to 4efddc2 Compare July 17, 2025 02:52

shiltian mentioned this pull request Jul 17, 2025

[AMDGPU] Add support for v_sin_bf16 on gfx1250 #149241

Merged

rampitec approved these changes Jul 17, 2025

View reviewed changes

[AMDGPU] Add support for v_exp_bf16 on gfx1250

62986e0

Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>

shiltian force-pushed the users/shiltian/v_exp_bf16 branch from 4efddc2 to 62986e0 Compare July 17, 2025 12:43

shiltian merged commit a6b5ece into main Jul 17, 2025
7 of 9 checks passed

shiltian deleted the users/shiltian/v_exp_bf16 branch July 17, 2025 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Add support for `v_exp_bf16` on gfx1250 #149229

[AMDGPU] Add support for `v_exp_bf16` on gfx1250 #149229

shiltian commented Jul 17, 2025

Uh oh!

shiltian commented Jul 17, 2025 •

edited

Loading

Uh oh!

llvmbot commented Jul 17, 2025 •

edited

Loading

Uh oh!

arsenm Jul 17, 2025

Uh oh!

shiltian Jul 17, 2025

Uh oh!

arsenm Jul 17, 2025

Uh oh!

shiltian Jul 17, 2025 •

edited

Loading

Uh oh!

rampitec Jul 17, 2025

Uh oh!

arsenm Jul 17, 2025

Uh oh!

Uh oh!

rampitec left a comment

Uh oh!

rampitec Jul 17, 2025

Uh oh!

shiltian commented Jul 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[AMDGPU] Add support for v_exp_bf16 on gfx1250 #149229

[AMDGPU] Add support for v_exp_bf16 on gfx1250 #149229

Conversation

shiltian commented Jul 17, 2025

Uh oh!

shiltian commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

shiltian Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

shiltian Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rampitec Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rampitec left a comment

Choose a reason for hiding this comment

Uh oh!

rampitec Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

shiltian commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

Uh oh!

[AMDGPU] Add support for `v_exp_bf16` on gfx1250 #149229

[AMDGPU] Add support for `v_exp_bf16` on gfx1250 #149229

shiltian commented Jul 17, 2025 •

edited

Loading

llvmbot commented Jul 17, 2025 •

edited

Loading

shiltian Jul 17, 2025 •

edited

Loading

shiltian commented Jul 17, 2025 •

edited

Loading