I was recently running some performance benchmarks across different platforms to understand the cost of some simple mathematic functions such as sqrtf and quickly noticed, under certain circumstances compilers such as MSVC, Clang and GCC by default will not generate a single x86-64 sqrtss instruction, there is a little more to it…
I think discovering the following is a good example of why it’s always a good idea to inspect the generated assembly code when optimising performance to fully understand what is happening under the hood.
Source code:
#include <math.h>
float square(float num)
{
return sqrtf(num);
}
MSVC v19.latest + /O2:
float square(float) PROC
xorps xmm1, xmm1
ucomiss xmm1, xmm0
ja SHORT $LN3@square
sqrtss xmm0, xmm0
ret 0
$LN3@square:
jmp sqrtf
float square(float) ENDP
So what’s going on here?
Isn’t the sqrtss instruction exactly what we need? We can see it generated in the assembly but there is some additional logic in there plus an actual function call to the sqrtf function on line 8. So lets get a better understanding of what’s happening:
xorps xmm1, xmm1 – Here we are putting a value of zero in to the xmm1 register. xorps is: “Bitwise Logical XOR of Packed Single Precision Floating-Point Values”. It’s common for compilers to generate a value of zero through a simple, fast XOR operation because if you XOR the same value, the result is always zero.
ucomiss xmm1, xmm0 – Next we have a comparison operation. ucomiss: “Unordered Compare Scalar Single Precision Floating-Point Values and Set EFLAGS”. So now we are comparing our input value to the function (xmm0) to zero (xmm1). This operation will set the EFLAGS based on the comparison of the values which will then be used in the following ja operation.
ja SHORT $LN3@square – This is a jump instruction. ja will jump to the specified label if the comparison is above. So in our case, it will jump if the value in xmm0 is above xmm1. So it’s checking if zero is above the value passed to the function. Or to flip this on its head, it’s checking if the value we passed to our function is less than zero.
So if we pass a value less than zero, we’re going to jump to the label $LN3@square: which ends up calling the full sqrtf function. Otherwise we end up calling the sqrtss function and returning.
So why is the logic branching here based on whether a negative value is passed? Let’s start by taking a look at the documentation for the sqrtf function (https://linux.die.net/man/3/sqrtf):
Domain error: x less than -0
errno is set to EDOM. An invalid floating-point exception (FE_INVALID) is raised.
Hmm… Let’s understand the error handling with these functions a little but further…
Math Error
math_error is used for detecting errors from mathematical functions. It’s a feature I’ve not come across before in game development code, most times we call math functions we will sanitise the input before calling the function and we likely won’t be writing code which can gracefully handle the results of a bad value passed to a math function (except perhaps in non-shipping builds).
You can use math error library like below to check the results of a call such as sqrtf:
float safe_sqrtf(float f)
{
// Always clear errno by calling _set_errno(0) immediately
// before a call that may set it,
// and check it immediately after the call.
errno = 0;
float result = sqrtf(f);
int error_value = errno;
if (error_value != 0)
{
printf("sqrtf error %i for value: %f\n", error_value, f);
return 0.0f;
}
return result;
}
So if we passed a negative value to our safe_sqrtf function, we can expect errno to be set to EDOM and we can handle this however we like. The return value from sqrtf will be -nan(ind) so here we instead return 0.0f.
Can we omit math errors?
Yes we can! With MSVC we are given different compiler options for floating point operations these are /fp:strict, /fp:precise and /fp:fast. The default option is /fp:precise. It’s important to know that these floating point controls are global and affect more than just how error handing is performed so it’s not always recommended to enable this without careful scrutiny. You can read more about what these options do here: https://learn.microsoft.com/en-us/cpp/build/reference/fp-specify-floating-point-behavior
Below is the assembly code generated for each of the three options:
/fp:strict
float square(float) PROC
rex_jmp QWORD PTR __imp_sqrtf
float square(float) ENDP
Here we never use the sqrtss instruction, it always ends up calling the sqrtf implementation.
/fp:precise
float square(float) PROC
rex_jmp QWORD PTR __imp_sqrtf
float square(float) ENDPfloat square(float) PROC
xorps xmm1, xmm1
ucomiss xmm1, xmm0
ja SHORT $LN3@square
sqrtss xmm0, xmm0
ret 0
$LN3@square:
jmp sqrtf
float square(float) ENDP
/fp:fast
float square(float) PROC
sqrtss xmm0, xmm0
ret 0
float square(float) ENDP
What about other compilers? With Clang or GCC we can get the same results by passing -ffast-math but this enables a lot of other optimisations and assumptions which in my experience within game development can have a negative impact on gameplay features relying on very precise floating point accuracy. The good news though is that -ffast-math is a big collection of other compiler options and we can just disable math error handling specifically with the following: -fno-math-errno.
Can we do it without compiler options?
Yes! We can use the intrinsic _mm_sqrt_ss which will translate to the exact sqrtss assembly instruction we’ve been aiming for. Lets see how that could look:
Source code:
#include <xmmintrin.h>
__m128 square_mm(__m128 num)
{
return _mm_sqrt_ss(num);
}
MSVC v19.latest + /O2:
float square(float) ENDP__m128 square_mm(__m128) PROC
movups xmm0, XMMWORD PTR [rcx]
sqrtss xmm0, xmm0
ret 0
__m128 square_mm(__m128) ENDP
Hmm, that’s not as optimal as previous results with the assembly generated! What’s going on here? Under MSVC if we’re expecting to use SIMD registers directly, it’s best to add the __vectorcall to the function. Here is an extract from: https://learn.microsoft.com/en-us/cpp/cpp/vectorcall
The __vectorcall calling convention specifies that arguments to functions are to be passed in registers, when possible. __vectorcall uses more registers for arguments than __fastcall or the default x64 calling convention use. The __vectorcall calling convention is only supported in native code on x86 and x64 processors that include Streaming SIMD Extensions 2 (SSE2) and above.
So lets try it again with __vectorcall added:
Source code:
#include <xmmintrin.h>
__m128 __vectorcall square_mm(__m128 num)
{
return _mm_sqrt_ss(num);
}
MSVC v19.latest + /O2:
__m128 square_mm(__m128) PROC
sqrtss xmm0, xmm0
ret 0
__m128 square_mm(__m128) ENDP
So you can see how the explicit __vectorcall now has produced the same instructions as the /fp:fast variant but we’ve done it without global compiler options. I expect there are other options you could explore with pragma’s or function attributes in order to dictate math error functionality for specific sections of code but I have not extensively explored this.
Further SIMD behaviour
If we run some more tests using SIMD, how will the compiler handle it?
4 Float Loop:
Here we have 4 floats passed to a function and a loop where we explicitly call sqrtf() on each of them. The interesting part here is that even with the default /fp:precise behaviour, the compiler automatically omits the negative check and call to the sqrtf C function which is different to the behaviour on a single float.
__m128 __vectorcall m128_sqrtf(__m128 m)
{
__m128 out;
for (size_t i = 0; i < 4; i++)
{
out.m128_f32[i] = sqrtf(m.m128_f32[i]);
}
return out;
}
The assembly produced for the code above with standard /O2 on MSVC is our most optimal outcome:
__m128 m128_sqrtf(__m128) PROC
sqrtps xmm0, xmm0
ret 0
The fp:strict variant here naturally procudes a far more complext result with the loop unrolled to make 4 individual calls to the sqrtf. While the fp:fast variant produces the same code.
3 Float Loop:
This is the same as the test on the left but with a loop of only 3 floats instead with the rest of the setup identical. The differences in the code produced are stark:
__m128 __vectorcall m128_sqrtf(__m128 m)
{
__m128 out;
for (size_t i = 0; i < 3; i++)
{
out.m128_f32[i] = sqrtf(m.m128_f32[i]);
}
return out;
}
__m128 m128_sqrtf(__m128) PROC
$LN22:
sub rsp, 72
movaps XMMWORD PTR [rsp+48], xmm6
xorps xmm1, xmm1
movaps xmm6, xmm0
ucomiss xmm1, xmm6
ja SHORT $LN15@m128_sqrtf
xorps xmm0, xmm0
sqrtss xmm0, xmm6
jmp SHORT $LN16@m128_sqrtf
$LN15@m128_sqrtf:
call sqrtf
$LN16@m128_sqrtf:
movaps xmm1, xmm6
movss DWORD PTR out$[rsp], xmm0
shufps xmm1, xmm6, 85
xorps xmm0, xmm0
ucomiss xmm0, xmm1
ja SHORT $LN13@m128_sqrtf
xorps xmm0, xmm0
sqrtss xmm0, xmm1
jmp SHORT $LN14@m128_sqrtf
$LN13@m128_sqrtf:
movaps xmm0, xmm1
call sqrtf
$LN14@m128_sqrtf:
movss DWORD PTR out$[rsp+4], xmm0
xorps xmm0, xmm0
shufps xmm6, xmm6, 170
ucomiss xmm0, xmm6
ja SHORT $LN11@m128_sqrtf
xorps xmm1, xmm1
sqrtss xmm1, xmm6
jmp SHORT $LN20@m128_sqrtf
$LN11@m128_sqrtf:
movaps xmm0, xmm6
call sqrtf
movaps xmm1, xmm0
$LN20@m128_sqrtf:
movaps xmm0, XMMWORD PTR out$[rsp]
movaps xmm6, XMMWORD PTR [rsp+48]
shufps xmm0, xmm0, 210
movss xmm0, xmm1
shufps xmm0, xmm0, 201
add rsp, 72
ret 0
__m128 m128_sqrtf(__m128) ENDP
I also ran the tests above using a custom struct containing 4 sequential floats rather than the __m128 type to see if this was being handled as a special case but I observed identical behaviour with my custom structure.
Compiler Explorer Examples
View the comparisons of all compilers and their variations on godbolt.org here: https://godbolt.org/z/v3na8z91P
Performance
So lets take a look at the performance difference of our variants… For these tests I have used Google Benchmark: https://github.com/google/benchmark – A great open source tool for benchmarking small sections of code in isolation.
When using a tool like Google Benchmark and testing a very small section of code like I am here with a single function, it’s incredibly important to validate that the actual test you’ve written is being executed as expected. I’m always surprised how clever compilers these days can actually be in how they can completely eliminate and optimise code away.
The key part in the code below is to ensure the variable passed to sqrtf is a volatile, so the compiler is forced to read the value from memory again and cannot make assumptions about its values. Without this, the compiler is clever enough to realise the work done in the loop never changes for each iteration and only invoke sqrtf once in certain situations. We also shouldn’t take the figures here as “the cost of calling”, what we’re interested in here is how much faster each variation is compared to the default.
void BM_sqrtf(benchmark::State& state)
{
float A = 0.0f;
volatile float B = 1.0f;
do_not_optimise(A);
do_not_optimise(B);
for (auto _ : state)
{
A += sqrtf(B);
}
do_not_optimise(A);
do_not_optimise(B);
}
| Test Setup | /fp:strict | /fp:precise (default) | /fp:fast |
| AMD Ryzen 5 5600X | 2.76 ns | 2.63 ns | 2.63 ns |
| Intel i5-8265U | 3.90 ns | 2.44 ns | 2.44 ns |
So what’s really interesting here is no observable difference between /fp:precise and /fp:fast despite us knowing /fp:fast is a simple sqrtss instruction while /fp:fast has additional logic to check if the value is negative first and then branch based on that. I can’t fully explain the reasons for this but I expect perhaps branch prediction is always taking the correct path with my running the same test over and over with the same values but it’s very interesting how the additional instructions we know are present here do not affect the test results.
Comparing to the /fp:strict variant, we see a performance difference, especially so on my laptop Intel chip, over a 50% performance hit and a far smaller hit on the AMD chip.
Raspberry Pi (ARM64)
I also took the opportunity to run the test on a Raspberry Pi 3 Model B V.2. GCC here does not have the equivalent of the /fp:strict which always generated a jmp instruction so instead the test here is with the default -O3 versus -O3 -fno-math-errno. The tests here were compiled using GCC version 14.2.0.
-O3:
square_root(float):
fcmp s0, #0.0
bpl .L5
b sqrtf
.L5:
fsqrt s31, s0
fmov s0, s31
ret
-O3 -fno-math-errno:
square_root(float):
fsqrt s0, s0
ret
| Test Setup | -O3 | -O3 -fno-math-errno |
| Raspberry Pi 3 Model B V.2 | 20.1 ns | 15.9 ns |
The comparison on the ARM chip of -O3 vs -O3 -fno-math-errno is the equivalent of /fp:precise and /fp:fast on the MSVC compiler so it’s very interesting that we see a large performance difference here on the ARM and GCC compiler while on the MSVC and x86-64 chips we saw no difference at all. On the Raspberry Pi we observe over a 25% speed improvement by using the -fno-math-errno.
3 Float Loop Performance
I also elected to run some further tests on the setup where we have a loop calling sqrtf three times because this resulted in some large differences with the code produced. The function for this test looks like this:
struct Float3
{
float f[3];
};
__declspec(noinline) Float3 __vectorcall sqrt_float3_no_inline(Float3 m)
{
Float3 o;
for (size_t i = 0; i < 3; i++)
{
o.f[i] = sqrtf(m.f[i]);
}
return o;
}
Below is the assembly produced for each of the results on MSVC and the performance results on AMD Ryzen 5 5600X. In this test case, we do observe a very small performance improvement between /fp:precise and /fp:fast.
/fp:strict (5.54 ns)
Float3 sqrt_float3_no_inline(Float3) PROC
$LN12:
push rbx
sub rsp, 80
movaps XMMWORD PTR [rsp+64], xmm6
movaps xmm6, xmm0
unpcklps xmm6, xmm1
movaps xmm0, xmm6
movaps XMMWORD PTR [rsp+48], xmm7
movd ebx, xmm2
call QWORD PTR __imp_sqrtf
movaps xmm7, xmm0
movaps xmm0, xmm6
shufps xmm0, xmm0, 85
call QWORD PTR __imp_sqrtf
movaps xmm6, xmm0
movd xmm0, ebx
call QWORD PTR __imp_sqrtf
movaps xmm2, xmm0
movaps xmm1, xmm6
movaps xmm6, XMMWORD PTR [rsp+64]
movaps xmm0, xmm7
movaps xmm7, XMMWORD PTR [rsp+48]
add rsp, 80
pop rbx
ret 0
Float3 sqrt_float3_no_inline(Float3) ENDP
/fp:precise (3.55 ns)
Float3 sqrt_float3_no_inline(Float3) PROC
$LN21:
sub rsp, 88
movaps XMMWORD PTR [rsp+64], xmm6
xorps xmm3, xmm3
ucomiss xmm3, xmm0
movaps XMMWORD PTR [rsp+48], xmm7
movaps xmm6, xmm1
movaps XMMWORD PTR [rsp+32], xmm8
movaps xmm7, xmm2
ja SHORT $LN15@sqrt_float
xorps xmm8, xmm8
sqrtss xmm8, xmm0
jmp SHORT $LN16@sqrt_float
$LN15@sqrt_float:
call sqrtf
movaps xmm8, xmm0
$LN16@sqrt_float:
xorps xmm0, xmm0
ucomiss xmm0, xmm6
ja SHORT $LN13@sqrt_float
sqrtss xmm6, xmm6
jmp SHORT $LN14@sqrt_float
$LN13@sqrt_float:
movaps xmm0, xmm6
call sqrtf
movaps xmm6, xmm0
$LN14@sqrt_float:
xorps xmm0, xmm0
ucomiss xmm0, xmm7
ja SHORT $LN11@sqrt_float
xorps xmm0, xmm0
sqrtss xmm0, xmm7
jmp SHORT $LN12@sqrt_float
$LN11@sqrt_float:
movaps xmm0, xmm7
call sqrtf
$LN12@sqrt_float:
movaps xmm7, XMMWORD PTR [rsp+48]
movaps xmm2, xmm0
movaps xmm0, xmm8
movaps xmm1, xmm6
movaps xmm8, XMMWORD PTR [rsp+32]
movaps xmm6, XMMWORD PTR [rsp+64]
add rsp, 88
ret 0
Float3 sqrt_float3_no_inline(Float3) ENDP
/fp:fast (3.36 ns)
Float3 sqrt_float3_no_inline(Float3) PROC
sqrtss xmm0, xmm0
sqrtss xmm1, xmm1
sqrtss xmm2, xmm2
ret 0
Float3 sqrt_float3_no_inline(Float3) ENDP
Closing thoughts
I started writing this post because I noticed additional complexity when calling sqrtf and it’s led me down a bit of a rabbit hole! But it’s been an interesting journey. The things I’ve taken away from this are:
- Math.h functions can internally set errno to help diagnose when bad values are passed to them. Disabling this functionality can produce faster and simpler code.
- Compiler options such as
-ffast-mathare not all or nothing and contain many individual compiler options so we can select safe ones without unwanted side effects. - This little experiment reinforces that you’ve always got to measure performance and you can’t make assumptions about how things will perform on different hardware or platforms. Additionally it’s shown me that using tools such as Google Benchmark, you need to be so careful in the code to write that your compiler will not optimise away your intentions. Don’t underestimate the compiler!