In sponge-based constructions, the block size is not the same as
the state size. For practical purposes, it's the same as the rate.
Size this is a constant for a given type, we don't need to keep
a copy of that value in the state itself. Just use the constant
directly. This saves some bytes and may even be slightly faster.
More importantly:
Fixes#14128
The HKDF extract function uses HMAC under the hood, but requiring
applications to directly use HMAC functions reduces clarity and
feels like the wrong abstraction.
So, in order to get the PRK length, add a `prk_length` constant
that applications can use directly.
Also add an `extractInit()` function for cases where the keying
material has to be provided as multiple chunks.
Does what the name says: rejects generators of low-order groups.
`clearCofactor()` was previously used to do it, but for e.g.
cofactored signature verification, we don't need the result of an
actual multiplication. Only check that we didn't end up with a
low-order point, which is a faster operation.
If the noise parameter was null, we didn't use any noise at all.
We unconditionally generated random noise (`noise2`) but didn't use it.
Spotted by @cryptocode, thanks!
* Update the AEGIS specification URL to the current draft
* std.crypto.auth: add AEGIS MAC
The Pelican-based authentication function of the AEGIS construction
can be used independently from authenticated encryption, as a faster
and more secure alternative to GHASH/POLYVAL/Poly1305.
We already expose GHASH, POLYVAL and Poly1305 for use outside AES-GCM
and ChaChaPoly, so there are no reasons not to expose the MAC from AEGIS
as well.
Like other 128-bit hash functions, finding a collision only requires
~2^64 attempts or inputs, which may still be acceptable for many
practical applications.
Benchmark (Apple M1):
siphash128-1-3: 3222 MiB/s
ghash: 8682 MiB/s
aegis-128l mac: 12544 MiB/s
Benchmark (Zen 2):
siphash128-1-3: 4732 MiB/s
ghash: 5563 MiB/s
aegis-128l mac: 19270 MiB/s
POLYVAL is GHASH's little brother, required by the AES-GCM-SIV
construction. It's defined in RFC8452.
The irreducible polynomial is a mirror of GHASH's (which doesn't
change anything in our implementation that didn't reverse the raw
bits to start with).
But most importantly, POLYVAL encodes byte strings as little-endian
instead of big-endian, which makes it a little bit faster on the
vast majority of modern CPUs.
So, both share the same code, just with comptime magic to use the
correct endianness and only double the key for GHASH.
* std.crypto.onetimeauth.ghash: faster GHASH on modern CPUs
Carryless multiplication was slow on older Intel CPUs, justifying
the need for using Karatsuba multiplication.
This is not the case any more; using 4 multiplications to multiply
two 128-bit numbers is actually faster than 3 multiplications +
shifts and additions.
This is also true on aarch64.
Keep using Karatsuba only when targeting x86 (granted, this is a bit
of a brutal shortcut, we should really list all the CPU models that
had a slow clmul instruction).
Also remove useless agg_2 treshold and restore the ability to
precompute only H and H^2 in ReleaseSmall.
Finally, avoid using u256. Using 128-bit registers is actually faster.
* Use a switch, add some comments
In the process of 'remaining blocks',
the length of processed message can be from 1 to 79.
The value of 'n-1' is ranged from 0 to 3.
So, st.hx[i] must be initialized at least from st.hx[0] to st.hx[3]
These constants were read as a block count in initForBlockCount()
but at the same time, as a size in update().
The unit could be blocks or bytes, but we should use the same one
everywhere.
So, use blocks as intended.
Fixes#13506
* std.crypto: make ghash faster, esp. for small messages
Aggregated reduction requires 5 additional multiplications (to
precompute the powers of H), in order to save 2 multiplications
per batch.
So, only use large batches when it's actually interesting to do so.
For the last blocks, reuse the precomputations in order to perform
a single reduction.
Also, even in .ReleaseSmall, allow 2-block aggregation.
The speedup is worth it, and the code increase is reasonable.
And in .ReleaseFast, bump the upper batch size up to 16.
Leverage comptime by the way instead of duplicating code.
std/crypto/benchmark.zig on Apple M1:
Zig 0.10.0: 2769 MiB/s
Before: 6014 MiB/s
After: 7334 MiB/s
Normalize function names by the way.
* Change clmul() to accept the half to be processed
This avoids a bunch of truncate() calls.
* Add more ghash tests to check all code paths
* crypto.core.aes: process 6 block in parallel instead of 8 on aarch64
At least on Apple Silicon, this is slightly faster than 8 blocks.
* AES: add parallel blocks for tigerlake, rocketlake, alderlake, zen3
...instead of hard-coding it to 20.
- This is consistent with the ChaCha implementation
- NaCl and libsodium, that this API is designed to interop with,
also support 8 and 12 round variants. The 12 round variant, in
particular, provides the same security level as the 20 round variant,
but is obviously faster.
- scrypt currently uses its own non optimized version of Salsa, just
because it use 8 rounds instead of 20. This will help remove code
duplication.
No behavior nor public API changes. The Salsa20 and XSalsa20 still
represent the 20-round variant.