Does what the name says: rejects generators of low-order groups.
`clearCofactor()` was previously used to do it, but for e.g.
cofactored signature verification, we don't need the result of an
actual multiplication. Only check that we didn't end up with a
low-order point, which is a faster operation.
If the noise parameter was null, we didn't use any noise at all.
We unconditionally generated random noise (`noise2`) but didn't use it.
Spotted by @cryptocode, thanks!
* Update the AEGIS specification URL to the current draft
* std.crypto.auth: add AEGIS MAC
The Pelican-based authentication function of the AEGIS construction
can be used independently from authenticated encryption, as a faster
and more secure alternative to GHASH/POLYVAL/Poly1305.
We already expose GHASH, POLYVAL and Poly1305 for use outside AES-GCM
and ChaChaPoly, so there are no reasons not to expose the MAC from AEGIS
as well.
Like other 128-bit hash functions, finding a collision only requires
~2^64 attempts or inputs, which may still be acceptable for many
practical applications.
Benchmark (Apple M1):
siphash128-1-3: 3222 MiB/s
ghash: 8682 MiB/s
aegis-128l mac: 12544 MiB/s
Benchmark (Zen 2):
siphash128-1-3: 4732 MiB/s
ghash: 5563 MiB/s
aegis-128l mac: 19270 MiB/s
POLYVAL is GHASH's little brother, required by the AES-GCM-SIV
construction. It's defined in RFC8452.
The irreducible polynomial is a mirror of GHASH's (which doesn't
change anything in our implementation that didn't reverse the raw
bits to start with).
But most importantly, POLYVAL encodes byte strings as little-endian
instead of big-endian, which makes it a little bit faster on the
vast majority of modern CPUs.
So, both share the same code, just with comptime magic to use the
correct endianness and only double the key for GHASH.
* std.crypto.onetimeauth.ghash: faster GHASH on modern CPUs
Carryless multiplication was slow on older Intel CPUs, justifying
the need for using Karatsuba multiplication.
This is not the case any more; using 4 multiplications to multiply
two 128-bit numbers is actually faster than 3 multiplications +
shifts and additions.
This is also true on aarch64.
Keep using Karatsuba only when targeting x86 (granted, this is a bit
of a brutal shortcut, we should really list all the CPU models that
had a slow clmul instruction).
Also remove useless agg_2 treshold and restore the ability to
precompute only H and H^2 in ReleaseSmall.
Finally, avoid using u256. Using 128-bit registers is actually faster.
* Use a switch, add some comments
In the process of 'remaining blocks',
the length of processed message can be from 1 to 79.
The value of 'n-1' is ranged from 0 to 3.
So, st.hx[i] must be initialized at least from st.hx[0] to st.hx[3]
These constants were read as a block count in initForBlockCount()
but at the same time, as a size in update().
The unit could be blocks or bytes, but we should use the same one
everywhere.
So, use blocks as intended.
Fixes#13506
* std.crypto: make ghash faster, esp. for small messages
Aggregated reduction requires 5 additional multiplications (to
precompute the powers of H), in order to save 2 multiplications
per batch.
So, only use large batches when it's actually interesting to do so.
For the last blocks, reuse the precomputations in order to perform
a single reduction.
Also, even in .ReleaseSmall, allow 2-block aggregation.
The speedup is worth it, and the code increase is reasonable.
And in .ReleaseFast, bump the upper batch size up to 16.
Leverage comptime by the way instead of duplicating code.
std/crypto/benchmark.zig on Apple M1:
Zig 0.10.0: 2769 MiB/s
Before: 6014 MiB/s
After: 7334 MiB/s
Normalize function names by the way.
* Change clmul() to accept the half to be processed
This avoids a bunch of truncate() calls.
* Add more ghash tests to check all code paths
* crypto.core.aes: process 6 block in parallel instead of 8 on aarch64
At least on Apple Silicon, this is slightly faster than 8 blocks.
* AES: add parallel blocks for tigerlake, rocketlake, alderlake, zen3
...instead of hard-coding it to 20.
- This is consistent with the ChaCha implementation
- NaCl and libsodium, that this API is designed to interop with,
also support 8 and 12 round variants. The 12 round variant, in
particular, provides the same security level as the 20 round variant,
but is obviously faster.
- scrypt currently uses its own non optimized version of Salsa, just
because it use 8 rounds instead of 20. This will help remove code
duplication.
No behavior nor public API changes. The Salsa20 and XSalsa20 still
represent the 20-round variant.
Rewrite GHASH to use 128-bit multiplication over non-reversed
integers, and up to 8 blocks aggregated reduction.
lib/std/crypto/benchmark.zig results:
Xeon E5:
Before: 1604 MiB/s
After: 4005 MiB/s
Apple M1:
Before: 2769 MiB/s
After: 6014 MiB/s
This also makes AES-GCM faster by the way.
Comptime code can't execute assembly code, so we need some way to
force comptime code to use the generic path. This should be replaced
with whatever is implemented for #868, when that day comes.
I am seeing that the result for the hash is incorrect in stage1 and
crashes stage2, so presumably this never worked correctly. I will follow
up on that soon.
This gets us most of the way back to the performance I had when
I was using the LLVM intrinsics:
- Intel Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz:
190.67 MB/s (w/o intrinsics) -> 1285.08 MB/s
- AMD EPYC 7763 (VM) @ 2.45 GHz:
240.09 MB/s (w/o intrinsics) -> 1360.78 MB/s
- Apple M1:
216.96 MB/s (w/o intrinsics) -> 2133.69 MB/s
Minor changes to this source can swing performance from 400 MB/s to
1400 MB/s or... 20 MB/s, depending on how it interacts with the
optimizer. I have a sneaking suspicion that despite LLVM inheriting
GCC's extremely strict inline assembly semantics, its passes are
rather skittish around inline assembly (and almost certainly, its
instruction cost models can assume nothing)
There's probably plenty of room to optimize these further in the
future, but for the moment this gives ~3x improvement on Intel
x86-64 processors, ~5x on AMD, and ~10x on M1 Macs.
These extensions are very new - Most processors prior to 2020 do
not support them.
AVX-512 is a slightly older alternative that we could use on Intel
for a much bigger performance bump, but it's been fused off on
Intel's latest hybrid architectures and it relies on computing
independent SHA hashes in parallel. In contrast, these SHA intrinsics
provide the usual single-threaded, single-stream interface, and should
continue working on new processors.
AArch64 also has SHA-512 intrinsics that we could take advantage
of in the future
Similar to what was done for EdDSA, allow incremental creation
and verification of ECDSA signatures.
Doing so for ECDSA is trivial, and can be useful for TLS as well
as the future package manager.
This commit accepts unusual parameters like EcdsaP384Sha256.
Some certifictes(below certs are in /etc/ssl/certs/ca-certificates.crt on Ubuntu 22.04) use EcdsaP384Sha256 to sign itself.
- Subject: C=GR, L=Athens, O=Hellenic Academic and Research Institutions Cert. Authority, CN=Hellenic Academic and Research Institutions ECC RootCA 2015
- Subject: C=US, ST=Texas, L=Houston, O=SSL Corporation, CN=SSL.com EV Root Certification Authority ECC
- Subject: C=US, ST=Texas, L=Houston, O=SSL Corporation, CN=SSL.com Root Certification Authority ECC
In verify(), hash array `h` is allocated to be larger than the scalar.encoded_length.
The array is regarded as big-endian.
Hash values are filled in the back of the array and the rest bytes in front are filled with zero.
In sign(), the hash array is allocated and filled as same as verify().
In deterministicScalar(), hash bytes are insufficient to generate `k`
To generate `k` without narrowing its value range,
this commit uses algorithm stage h. in "Section 3.2 Generation of k" in RFC6979.
This reverts commit 7cbd586ace.
This is causing a fail to build from source:
```
./lib/std/fmt.zig:492:17: error: cannot format optional without a specifier (i.e. {?} or {any})
@compileError("cannot format optional without a specifier (i.e. {?} or {any})");
^
./src/link/MachO/Atom.zig:544:26: note: called from here
log.debug(" RELA({s}) @ {x} => %{d} in object({d})", .{
^
```
I looked at the code to fix it but none of those args are optionals.
Key blinding allows public keys to be augmented with a secret
scalar, making multiple signatures from the same signer unlinkable.
https://datatracker.ietf.org/doc/draft-dew-cfrg-signature-key-blinding/
This is required by privacy-preserving applications such as Tor
onion services and the PrivacyPass protocol.
Gimli was a game changer. A permutation that is large enough to be
used in sponge-like constructions, yet small enough to be compact
to implement and fast on a wide range of platforms.
And Gimli being part of the Zig standard library was awesome.
But since then, Gimli entered the NIST Lightweight Cryptography
Competition, competing againt other candidates sharing a similar set
of properties.
Unfortunately, Gimli didn't pass the 3rd round.
There are no practical attacks against Gimli when used correctly, but
NIST's decision means that Gimli is unlikely to ever get any traction.
So, maybe the time has come to move Gimli from the standard library
to another repository.
We shouldn't do it without providing an alternative, though.
And the best candidate for this is probably Xoodoo.
Xoodoo is the core function of Xoodyak, one of the finalists of the
NIST LWC competition, and the most direct competitor to Gimli. It is
also a 384-bit permutation, so it can easily be used everywhere Gimli
was used with no parameter changes.
It is the building block of Xoodyak (for actual encryption and hashing)
as well as Charm, that some Zig applications are already using.
Like Gimli that it was heavily inspired from, it is compact and
suitable for constrained environments.
This change adds the Xoodoo permutation to std.crypto.core.
The set of public functions includes everything required to later
implement existing Xoodoo-based constructions.
In order to prepare for the Gimli deprecation, the default
CSPRNG was changed to a Xoodoo-based that works exactly the same way.
A hash function cascade was a common way to avoid length-extension
attacks with traditional hash functions such as the SHA-2 family.
Add `std.crypto.hash.composition` to do exactly that using arbitrary
hash functions, and pre-define the common SHA2-based ones.
With this, we can now sign and verify Bitcoin signatures in pure Zig.
std.crypto.ecc: add support for the secp256k1 curve
Usage of the secp256k1 elliptic curve recently grew exponentially,
since this is the curve used by Bitcoin and other popular blockchains
such as Ethereum.
With this, Zig has support for all the widely deployed elliptic curves
today.
For 25519, it's very likely that applications would ever need the
serialized representation. Expose the value as an integer as in
other curves. Rename the internal representation from `field_size`
to `field_order` for consistency.
Also fix a common typo in `scalar.sub()`.