rope: Micro optimize the creation of masks (#41132)
Adam Richardson
created
Using compiler explorer I saw that the compiler wasn't clever enough to
optimise away the branches in the masking code. I thought the compiler
would have a better chance if we always branched, which [turned out to
be the case](https://godbolt.org/z/PM594Pz18).
Running the benchmarks the biggest benefit I saw was:
```
push/65536 time: [2.9067 ms 2.9243 ms 2.9417 ms]
thrpt: [21.246 MiB/s 21.373 MiB/s 21.502 MiB/s]
change:
time: [-8.3452% -7.2617% -6.2009%] (p = 0.00 < 0.05)
thrpt: [+6.6108% +7.8303% +9.1050%]
Performance has improved.
```
But I did also see some regressions:
```
slice/4096 time: [66.195 µs 66.815 µs 67.448 µs]
thrpt: [57.915 MiB/s 58.464 MiB/s 59.012 MiB/s]
change:
time: [+3.7131% +5.1698% +6.6971%] (p = 0.00 < 0.05)
thrpt: [-6.2768% -4.9157% -3.5802%]
Performance has regressed.
```
Release Notes:
- N/A