While Transformers are dominated by floating-point Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. To address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph’s method. This block is integrated into the floating-point unit (FPU) of the RISC-V cores of a compute cluster through custom instruction set architecture (ISA) extensions, with a negligible area overhead of 1%. By optimizing the software kernels to leverage this extension, we execute Softmax with 162.7x less latency and 74.3x less energy compared to the baseline cluster, achieving an 8.2x performance improvement and 4.1x higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT-2, GPT-3, and ViT, achieving up to 5.8x and 3.6x reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss.