0

Does calling sqrtf always result in a single asm instruction?

I was recently running some performance benchmarks across different platforms to understand the cost of some simple mathematic functions such as sqrtf and quickly noticed, under certain circumstances compilers such as MSVC, Clang and GCC by default will not generate a single x86-64 sqrtss instruction, there is a little more to it…

I think discovering the following is a good example of why it’s always a good idea to inspect the generated assembly code when optimising performance to fully understand what is happening under the hood.

Source code:

#include <math.h>
float square(float num) 
{
    return sqrtf(num);
}

MSVC v19.latest + /O2:

float square(float) PROC
        xorps   xmm1, xmm1
        ucomiss xmm1, xmm0
        ja      SHORT $LN3@square
        sqrtss  xmm0, xmm0
        ret     0
$LN3@square:
        jmp     sqrtf
float square(float) ENDP

So what’s going on here?

Isn’t the sqrtss instruction exactly what we need? We can see it generated in the assembly but there is some additional logic in there plus an actual function call to the sqrtf function on line 8. So lets get a better understanding of what’s happening:

xorps xmm1, xmm1 – Here we are putting a value of zero in to the xmm1 register. xorps is: “Bitwise Logical XOR of Packed Single Precision Floating-Point Values”. It’s common for compilers to generate a value of zero through a simple, fast XOR operation because if you XOR the same value, the result is always zero.

ucomiss xmm1, xmm0 – Next we have a comparison operation. ucomiss: “Unordered Compare Scalar Single Precision Floating-Point Values and Set EFLAGS”. So now we are comparing our input value to the function (xmm0) to zero (xmm1). This operation will set the EFLAGS based on the comparison of the values which will then be used in the following ja operation.

ja SHORT $LN3@square – This is a jump instruction. ja will jump to the specified label if the comparison is above. So in our case, it will jump if the value in xmm0 is above xmm1. So it’s checking if zero is above the value passed to the function. Or to flip this on its head, it’s checking if the value we passed to our function is less than zero.

So if we pass a value less than zero, we’re going to jump to the label $LN3@square: which ends up calling the full sqrtf function. Otherwise we end up calling the sqrtss function and returning.

So why is the logic branching here based on whether a negative value is passed? Let’s start by taking a look at the documentation for the sqrtf function (https://linux.die.net/man/3/sqrtf):

Domain error: x less than -0
errno is set to EDOM. An invalid floating-point exception (FE_INVALID) is raised.

Hmm… Let’s understand the error handling with these functions a little but further…

Math Error

math_error is used for detecting errors from mathematical functions. It’s a feature I’ve not come across before in game development code, most times we call math functions we will sanitise the input before calling the function and we likely won’t be writing code which can gracefully handle the results of a bad value passed to a math function (except perhaps in non-shipping builds).

You can use math error library like below to check the results of a call such as sqrtf:

float safe_sqrtf(float f)
{
    // Always clear errno by calling _set_errno(0) immediately 
    // before a call that may set it, 
    // and check it immediately after the call.
    errno = 0;
    float result = sqrtf(f);
    int error_value = errno;
    if (error_value != 0)
    {
        printf("sqrtf error %i for value: %f\n", error_value, f);
        return 0.0f;
    }
    return result;
}

So if we passed a negative value to our safe_sqrtf function, we can expect errno to be set to EDOM and we can handle this however we like. The return value from sqrtf will be -nan(ind) so here we instead return 0.0f.

Can we omit math errors?

Yes we can! With MSVC we are given different compiler options for floating point operations these are /fp:strict, /fp:precise and /fp:fast. The default option is /fp:precise. It’s important to know that these floating point controls are global and affect more than just how error handing is performed so it’s not always recommended to enable this without careful scrutiny. You can read more about what these options do here: https://learn.microsoft.com/en-us/cpp/build/reference/fp-specify-floating-point-behavior

Below is the assembly code generated for each of the three options:

/fp:strict

float square(float) PROC
        rex_jmp QWORD PTR __imp_sqrtf
float square(float) ENDP

Here we never use the sqrtss instruction, it always ends up calling the sqrtf implementation.

/fp:precise

float square(float) PROC
        rex_jmp QWORD PTR __imp_sqrtf
float square(float) ENDPfloat square(float) PROC
        xorps   xmm1, xmm1
        ucomiss xmm1, xmm0
        ja      SHORT $LN3@square
        sqrtss  xmm0, xmm0
        ret     0
$LN3@square:
        jmp     sqrtf
float square(float) ENDP

/fp:fast

float square(float) PROC
        sqrtss  xmm0, xmm0
        ret     0
float square(float) ENDP

What about other compilers? With Clang or GCC we can get the same results by passing -ffast-math but this enables a lot of other optimisations and assumptions which in my experience within game development can have a negative impact on gameplay features relying on very precise floating point accuracy. The good news though is that -ffast-math is a big collection of other compiler options and we can just disable math error handling specifically with the following: -fno-math-errno.

Can we do it without compiler options?

Yes! We can use the intrinsic _mm_sqrt_ss which will translate to the exact sqrtss assembly instruction we’ve been aiming for. Lets see how that could look:

Source code:

#include <xmmintrin.h>
__m128 square_mm(__m128 num) 
{
    return _mm_sqrt_ss(num);
}

MSVC v19.latest + /O2:

float square(float) ENDP__m128 square_mm(__m128) PROC
        movups  xmm0, XMMWORD PTR [rcx]
        sqrtss  xmm0, xmm0
        ret     0
__m128 square_mm(__m128) ENDP

Hmm, that’s not as optimal as previous results with the assembly generated! What’s going on here? Under MSVC if we’re expecting to use SIMD registers directly, it’s best to add the __vectorcall to the function. Here is an extract from: https://learn.microsoft.com/en-us/cpp/cpp/vectorcall

The __vectorcall calling convention specifies that arguments to functions are to be passed in registers, when possible. __vectorcall uses more registers for arguments than __fastcall or the default x64 calling convention use. The __vectorcall calling convention is only supported in native code on x86 and x64 processors that include Streaming SIMD Extensions 2 (SSE2) and above.

So lets try it again with __vectorcall added:

Source code:

#include <xmmintrin.h>
__m128 __vectorcall square_mm(__m128 num) 
{
    return _mm_sqrt_ss(num);
}

MSVC v19.latest + /O2:

__m128 square_mm(__m128) PROC
        sqrtss  xmm0, xmm0
        ret     0
__m128 square_mm(__m128) ENDP

So you can see how the explicit __vectorcall now has produced the same instructions as the /fp:fast variant but we’ve done it without global compiler options. I expect there are other options you could explore with pragma’s or function attributes in order to dictate math error functionality for specific sections of code but I have not extensively explored this.

Further SIMD behaviour

If we run some more tests using SIMD, how will the compiler handle it?

4 Float Loop:

Here we have 4 floats passed to a function and a loop where we explicitly call sqrtf() on each of them. The interesting part here is that even with the default /fp:precise behaviour, the compiler automatically omits the negative check and call to the sqrtf C function which is different to the behaviour on a single float.

__m128 __vectorcall m128_sqrtf(__m128 m)
{
    __m128 out;
    for (size_t i = 0; i < 4; i++)
	{
		out.m128_f32[i] = sqrtf(m.m128_f32[i]);
	}
    return out;
}

The assembly produced for the code above with standard /O2 on MSVC is our most optimal outcome:

__m128 m128_sqrtf(__m128) PROC
        sqrtps  xmm0, xmm0
        ret     0

The fp:strict variant here naturally procudes a far more complext result with the loop unrolled to make 4 individual calls to the sqrtf. While the fp:fast variant produces the same code.

3 Float Loop:

This is the same as the test on the left but with a loop of only 3 floats instead with the rest of the setup identical. The differences in the code produced are stark:

__m128 __vectorcall m128_sqrtf(__m128 m)
{
    __m128 out;
    for (size_t i = 0; i < 3; i++)
	{
		out.m128_f32[i] = sqrtf(m.m128_f32[i]);
	}
    return out;
}
__m128 m128_sqrtf(__m128) PROC             
$LN22:
        sub     rsp, 72                    
        movaps  XMMWORD PTR [rsp+48], xmm6
        xorps   xmm1, xmm1
        movaps  xmm6, xmm0
        ucomiss xmm1, xmm6
        ja      SHORT $LN15@m128_sqrtf
        xorps   xmm0, xmm0
        sqrtss  xmm0, xmm6
        jmp     SHORT $LN16@m128_sqrtf
$LN15@m128_sqrtf:
        call    sqrtf
$LN16@m128_sqrtf:
        movaps  xmm1, xmm6
        movss   DWORD PTR out$[rsp], xmm0
        shufps  xmm1, xmm6, 85             
        xorps   xmm0, xmm0
        ucomiss xmm0, xmm1
        ja      SHORT $LN13@m128_sqrtf
        xorps   xmm0, xmm0
        sqrtss  xmm0, xmm1
        jmp     SHORT $LN14@m128_sqrtf
$LN13@m128_sqrtf:
        movaps  xmm0, xmm1
        call    sqrtf
$LN14@m128_sqrtf:
        movss   DWORD PTR out$[rsp+4], xmm0
        xorps   xmm0, xmm0
        shufps  xmm6, xmm6, 170            
        ucomiss xmm0, xmm6
        ja      SHORT $LN11@m128_sqrtf
        xorps   xmm1, xmm1
        sqrtss  xmm1, xmm6
        jmp     SHORT $LN20@m128_sqrtf
$LN11@m128_sqrtf:
        movaps  xmm0, xmm6
        call    sqrtf
        movaps  xmm1, xmm0
$LN20@m128_sqrtf:
        movaps  xmm0, XMMWORD PTR out$[rsp]
        movaps  xmm6, XMMWORD PTR [rsp+48]
        shufps  xmm0, xmm0, 210            
        movss   xmm0, xmm1
        shufps  xmm0, xmm0, 201            
        add     rsp, 72                    
        ret     0
__m128 m128_sqrtf(__m128) ENDP 

I also ran the tests above using a custom struct containing 4 sequential floats rather than the __m128 type to see if this was being handled as a special case but I observed identical behaviour with my custom structure.

Compiler Explorer Examples

View the comparisons of all compilers and their variations on godbolt.org here: https://godbolt.org/z/v3na8z91P

Performance

So lets take a look at the performance difference of our variants… For these tests I have used Google Benchmark: https://github.com/google/benchmark – A great open source tool for benchmarking small sections of code in isolation.

When using a tool like Google Benchmark and testing a very small section of code like I am here with a single function, it’s incredibly important to validate that the actual test you’ve written is being executed as expected. I’m always surprised how clever compilers these days can actually be in how they can completely eliminate and optimise code away.

The key part in the code below is to ensure the variable passed to sqrtf is a volatile, so the compiler is forced to read the value from memory again and cannot make assumptions about its values. Without this, the compiler is clever enough to realise the work done in the loop never changes for each iteration and only invoke sqrtf once in certain situations. We also shouldn’t take the figures here as “the cost of calling”, what we’re interested in here is how much faster each variation is compared to the default.

void BM_sqrtf(benchmark::State& state)
{
	float A = 0.0f;
	volatile float B = 1.0f;

	do_not_optimise(A);
	do_not_optimise(B);

	for (auto _ : state)
	{
		A += sqrtf(B);
	}

	do_not_optimise(A);
	do_not_optimise(B);
}
Test Setup/fp:strict/fp:precise (default)/fp:fast
AMD Ryzen 5 5600X2.76 ns2.63 ns2.63 ns
Intel i5-8265U3.90 ns2.44 ns2.44 ns

So what’s really interesting here is no observable difference between /fp:precise and /fp:fast despite us knowing /fp:fast is a simple sqrtss instruction while /fp:fast has additional logic to check if the value is negative first and then branch based on that. I can’t fully explain the reasons for this but I expect perhaps branch prediction is always taking the correct path with my running the same test over and over with the same values but it’s very interesting how the additional instructions we know are present here do not affect the test results.

Comparing to the /fp:strict variant, we see a performance difference, especially so on my laptop Intel chip, over a 50% performance hit and a far smaller hit on the AMD chip.

Raspberry Pi (ARM64)

I also took the opportunity to run the test on a Raspberry Pi 3 Model B V.2. GCC here does not have the equivalent of the /fp:strict which always generated a jmp instruction so instead the test here is with the default -O3 versus -O3 -fno-math-errno. The tests here were compiled using GCC version 14.2.0.

-O3:

square_root(float):
        fcmp    s0, #0.0
        bpl     .L5
        b       sqrtf
.L5:
        fsqrt   s31, s0
        fmov    s0, s31
        ret

-O3 -fno-math-errno:

square_root(float):
        fsqrt   s0, s0
        ret
Test Setup-O3-O3 -fno-math-errno
Raspberry Pi 3 Model B V.220.1 ns15.9 ns

The comparison on the ARM chip of -O3 vs -O3 -fno-math-errno is the equivalent of /fp:precise and /fp:fast on the MSVC compiler so it’s very interesting that we see a large performance difference here on the ARM and GCC compiler while on the MSVC and x86-64 chips we saw no difference at all. On the Raspberry Pi we observe over a 25% speed improvement by using the -fno-math-errno.

3 Float Loop Performance

I also elected to run some further tests on the setup where we have a loop calling sqrtf three times because this resulted in some large differences with the code produced. The function for this test looks like this:

struct Float3
{
   float f[3];
};
__declspec(noinline) Float3 __vectorcall sqrt_float3_no_inline(Float3 m)
{
	Float3 o;
	for (size_t i = 0; i < 3; i++)
	{
		o.f[i] = sqrtf(m.f[i]);
	}
	return o;
}

Below is the assembly produced for each of the results on MSVC and the performance results on AMD Ryzen 5 5600X. In this test case, we do observe a very small performance improvement between /fp:precise and /fp:fast.

/fp:strict (5.54 ns)

Float3 sqrt_float3_no_inline(Float3) PROC
$LN12:
        push    rbx
        sub     rsp, 80
        movaps  XMMWORD PTR [rsp+64], xmm6
        movaps  xmm6, xmm0
        unpcklps xmm6, xmm1
        movaps  xmm0, xmm6
        movaps  XMMWORD PTR [rsp+48], xmm7
        movd    ebx, xmm2
        call    QWORD PTR __imp_sqrtf
        movaps  xmm7, xmm0
        movaps  xmm0, xmm6
        shufps  xmm0, xmm0, 85
        call    QWORD PTR __imp_sqrtf
        movaps  xmm6, xmm0
        movd    xmm0, ebx
        call    QWORD PTR __imp_sqrtf
        movaps  xmm2, xmm0
        movaps  xmm1, xmm6
        movaps  xmm6, XMMWORD PTR [rsp+64]
        movaps  xmm0, xmm7
        movaps  xmm7, XMMWORD PTR [rsp+48]
        add     rsp, 80
        pop     rbx
        ret     0
Float3 sqrt_float3_no_inline(Float3) ENDP

/fp:precise (3.55 ns)

Float3 sqrt_float3_no_inline(Float3) PROC
$LN21:
        sub     rsp, 88       
        movaps  XMMWORD PTR [rsp+64], xmm6
        xorps   xmm3, xmm3
        ucomiss xmm3, xmm0
        movaps  XMMWORD PTR [rsp+48], xmm7
        movaps  xmm6, xmm1
        movaps  XMMWORD PTR [rsp+32], xmm8
        movaps  xmm7, xmm2
        ja      SHORT $LN15@sqrt_float
        xorps   xmm8, xmm8
        sqrtss  xmm8, xmm0
        jmp     SHORT $LN16@sqrt_float
$LN15@sqrt_float:
        call    sqrtf
        movaps  xmm8, xmm0
$LN16@sqrt_float:
        xorps   xmm0, xmm0
        ucomiss xmm0, xmm6
        ja      SHORT $LN13@sqrt_float
        sqrtss  xmm6, xmm6
        jmp     SHORT $LN14@sqrt_float
$LN13@sqrt_float:
        movaps  xmm0, xmm6
        call    sqrtf
        movaps  xmm6, xmm0
$LN14@sqrt_float:
        xorps   xmm0, xmm0
        ucomiss xmm0, xmm7
        ja      SHORT $LN11@sqrt_float
        xorps   xmm0, xmm0
        sqrtss  xmm0, xmm7
        jmp     SHORT $LN12@sqrt_float
$LN11@sqrt_float:
        movaps  xmm0, xmm7
        call    sqrtf
$LN12@sqrt_float:
        movaps  xmm7, XMMWORD PTR [rsp+48]
        movaps  xmm2, xmm0
        movaps  xmm0, xmm8
        movaps  xmm1, xmm6
        movaps  xmm8, XMMWORD PTR [rsp+32]
        movaps  xmm6, XMMWORD PTR [rsp+64]
        add     rsp, 88  
        ret     0
Float3 sqrt_float3_no_inline(Float3) ENDP

/fp:fast (3.36 ns)

Float3 sqrt_float3_no_inline(Float3) PROC
        sqrtss  xmm0, xmm0
        sqrtss  xmm1, xmm1
        sqrtss  xmm2, xmm2
        ret     0
Float3 sqrt_float3_no_inline(Float3) ENDP

Closing thoughts

I started writing this post because I noticed additional complexity when calling sqrtf and it’s led me down a bit of a rabbit hole! But it’s been an interesting journey. The things I’ve taken away from this are:

  • Math.h functions can internally set errno to help diagnose when bad values are passed to them. Disabling this functionality can produce faster and simpler code.
  • Compiler options such as -ffast-math are not all or nothing and contain many individual compiler options so we can select safe ones without unwanted side effects.
  • This little experiment reinforces that you’ve always got to measure performance and you can’t make assumptions about how things will perform on different hardware or platforms. Additionally it’s shown me that using tools such as Google Benchmark, you need to be so careful in the code to write that your compiler will not optimise away your intentions. Don’t underestimate the compiler!