VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers

Run Wang, Gamze Islamoglu, Andrea Belano, Viviane Potocnik, Francesco Conti, Angelo Garofalo, Luca Benini

January, 2025

Abstract

While Transformers are dominated by floating-point Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. To address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph’s method. This block is integrated into the floating-point unit (FPU) of the RISC-V cores of a compute cluster through custom instruction set architecture (ISA) extensions, with a negligible area overhead of 1%. By optimizing the software kernels to leverage this extension, we execute Softmax with 162.7x less latency and 74.3x less energy compared to the baseline cluster, achieving an 8.2x performance improvement and 4.1x higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT-2, GPT-3, and ViT, achieving up to 5.8x and 3.6x reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss.

Type

Conference paper

Publication

2025 IEEE 32nd Symposium on Computer Arithmetic (ARITH)

VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers

Abstract

Run Wang

PhD Student at Integrated Systems Laboratory (IIS), ETH Zürich

Related