So two years ago, I've been coding a function that solves quadratic equations, 8 at a time using AVX. Lately, I have been looking into ARM NEON to see what performance can be had on mobile devices. This is what I came up with. The square root implementation is available pmeerw.
/*
* solve aX^2 + bX + c = 0
* solves 4 instances at the same time, using NEON SIMD without any branching to avoid stalls.
* returns two solutions per equation in root0 and root1.
* returns FLT_UNDF if there is no solution due to discriminant being negative.
* For sqrtv() see: https://pmeerw.net/blog/programming/neon1.html
* I've put the reciprocal of (2*a) in the argument list as this one is fixed in my particular problem.
*/
inline void evaluate_quadratic4
(
float32x4_t a,
float32x4_t twoa_recip,
float32x4_t b,
float32x4_t c,
float32x4_t* __restrict root0,
float32x4_t* __restrict root1
)
{
const float32x4_t four4 = vdupq_n_f32( 4.0f );
const float32x4_t undf4 = vdupq_n_f32( FLT_UNDF );
const float32x4_t minb = vnegq_f32( b ); // -b
const float32x4_t bb = vmulq_f32( b, b ); // b*b
const float32x4_t foura = vmulq_f32( four4, a ); // 4*a
const float32x4_t fourac = vmulq_f32( foura, c ); // 4*a*c
const float32x4_t det = vsubq_f32( bb, fourac ); // b*b - 4*a*c
// We want only positive roots!
const uint32x4_t dvalid = vcleq_f32( fourac, bb );
const float32x4_t sr = sqrtv( det ); // approximation of sqrt( b*b - 4*a*c )
float32x4_t r0 = vaddq_f32( minb, sr ); // -b + sqrt( b*b - 4*a*c )
float32x4_t r1 = vsubq_f32( minb, sr ); // -b - sqrt( b*b - 4*a*c )
r0 = vmulq_f32( r0, twoa_recip ); // ( -b + sqrt( b*b - 4*a*c ) ) / (2*a)
r1 = vmulq_f32( r1, twoa_recip ); // ( -b - sqrt( b*b - 4*a*c ) ) / (2*a)
// Filter out negative roots.
*root0 = vbslq_f32( dvalid, r0, undf4 );
*root1 = vbslq_f32( dvalid, r1, undf4 );
}
I benched this code on Android Galaxy Note4, against a scalar version. The speed-up I measured was 3.7X which I think is pretty good.
So, yeah, writing intrinsics is totally justified. Doubly so, because I tried to have the compiler auto vectorize, but whatever flag I tried, results hardly differed: -fno-vectorize, -ftree-vectorize and -fslp-vectorize-aggressive all showed the same performance.