A naive way of generating a random direction d(x,y,z) would to take random values for x,y,z and then normalize the resulting vector d. But this will lead to a skewed distribution where too many samples fall in the "cube corners" directions.

A proper way to generate a random direction d(x,y,z) is to do the same, but add rejection-sampling.

The downsides of the proper method are:

- It is slower.
- Freak occurrences of having to retry many times.
- It has branch instructions in it.

I want the fastest possible, 8x SIMD way of generating random vectors. That means, no branching. And that got me thinking about a direction generator that is fast, but less skewed than the naive way. We would tolerate a little bias at the benefit of speed.

An approach I came up with: just generate two sets of coordinates, yielding two candidates. Use the candidate with the shortest length. Picking that candidate can be achieved without any branching, by just using the _mm256_blendv_ps intrinsic.

// 8-way vector for the candidates a: __m256 cand_ax = [random values -1..1] __m256 cand_bx = [random values -1..1] __m256 cand_cx = [random values -1..1] // get lengths for candidates in a and b. __m256 len_a = length_8way( cand_ax, cand_ay, cand_az ); __m256 len_b = length_8way( cand_bx, cand_by, cand_bz ); // pick the shortest candidates in each lane. __m256 a_is_shorter = _mm256_cmp_ps( len_a, len_b, _CMP_LE_OS ); __m256 cand_x = _mm256_blend_ps( cand_bx, cand_az, a_is_shorter ); __m256 cand_y = _mm256_blend_ps( cand_by, cand_ay, a_is_shorter ); __m256 cand_z = _mm256_blend_ps( cand_bz, cand_az, a_is_shorter ); __m256 len = _mm256_blend_ps( len_a, len_b, a_is_shorter ); // normalize __m256 ilen = _mm256_rcp_ps(len); __m256 x = _mm256_mul_ps( cand_x, ilen ); __m256 y = _mm256_mul_ps( cand_y, ilen ); __m256 z = _mm256_mul_ps( cand_z, ilen );

What is this all good for? You can generate noise fields with random gradients and not having to resort to lookup tables. No pesky gathers either! Just create the gradient on-the-fly, each time you sample the field. There is no memory bottle neck, it is all plain computing, without any branching. Added bonus: no wrap-around of the field due to limited size of lookup table.

**Postscriptum**