* std.crypto.chacha: support larger vectors on AVX2 and AVX512 targets
Ryzen 7 7700, ChaCha20/8 stream, long outputs:
Generic: 3268 MiB/s
AVX2 : 6023 MiB/s
AVX512 : 8086 MiB/s
Bump the rand.chacha buffer a tiny bit to take advantage of this.
More than 8 blocks doesn't seem to make any measurable difference.
ChaChaPoly also gets a small performance boost from this, albeit
Poly1305 remains the bottleneck.
Generic: 707 MiB/s
AVX2 : 981 MiB/s
AVX512 : 1202 MiB/s
aarch64 appears to generally benefit from 4-way vectorization.
Verified on Apple Silicon, but also on a Cortex A72.
A minimal set of simple, safe functions for Montgomery arithmetic,
designed for cryptographic primitives.
Also update the current RSA cert validation to use it, getting rid
of the FixedBuffer hack and the previous limitations.
Make the check of the RSA public key a little bit more strict by
the way.
This commit removes the `field_call_bind` and `field_call_bind_named` ZIR
instructions, replacing them with a `field_call` instruction which does the bind
and call in one.
`field_call_bind` is an unfortunate instruction. It's tied into one very
specific usage pattern - its result can only be used as a callee. This means
that it creates a value of a "pseudo-type" of sorts, `bound_fn` - this type used
to exist in Zig, but now we just hide it from the user and have AstGen ensure
it's only used in one way. This is quite silly - `Type` and `Value` should, as
much as possible, reflect real Zig types and values.
It makes sense to instead encode the `a.b()` syntax as its own ZIR instruction,
so that's what we do here. This commit introduces a new instruction,
`field_call`. It's like `call`, but rather than a callee ref, it contains a ref
to the object pointer (`&a` in `a.b()`) and the string field name (`b`). This
eliminates `bound_fn` from the language, and slightly decreases the size of
generated ZIR - stats below.
This commit does remove a few usages which used to be allowed:
- `@field(a, "b")()`
- `@call(.auto, a.b, .{})`
- `@call(.auto, @field(a, "b"), .{})`
These forms used to work just like `a.b()`, but are no longer allowed. I believe
this is the correct choice for a few reasons:
- `a.b()` is a purely *syntactic* form; for instance, `(a.b)()` is not valid.
This means it is *not* inconsistent to not allow it in these cases; the
special case here isn't "a field access as a callee", but rather this exact
syntactic form.
- The second argument to `@call` looks much more visually distinct from the
callee in standard call syntax. To me, this makes it seem strange for that
argument to not work like a normal expression in this context.
- A more practical argument: it's confusing! `@field` and `@call` are used in
very different contexts to standard function calls: the former normally hints
at some comptime machinery, and the latter that you want more precise control
over parts of a function call. In these contexts, you don't want implicit
arguments adding extra confusion: you want to be very explicit about what
you're doing.
Lastly, some stats. I mentioned before that this change slightly reduces the
size of ZIR - this is due to two instructions (`field_call_bind` then `call`)
being replaced with one (`field_call`). Here are some numbers:
+--------------+----------+----------+--------+
| File | Before | After | Change |
+--------------+----------+----------+--------+
| Sema.zig | 4.72M | 4.53M | -4% |
| AstGen.zig | 1.52M | 1.48M | -3% |
| hash_map.zig | 283.9K | 276.2K | -3% |
| math.zig | 312.6K | 305.3K | -2% |
+--------------+----------+----------+--------+
* When there is buffered cleartext, return it without calling the
underlying read function. This prevents buffer overflow due to space
used up by cleartext.
* Avoid clearing the buffer when the buffered cleartext could not be
completely given to the result read buffer, and there is some
buffered ciphertext left.
* Instead of rounding up the amount of bytes to ask for to the nearest
TLS record size, round down, with a minimum of 1. This prevents the
code path from being taken which requires extra memory copies.
* Avoid calling `@memcpy` with overlapping arguments.
closes#15590
The majority of these are in comments, some in doc comments which might
affect the generated documentation, and a few in parameter names -
nothing that should be breaking, however.
* Sema: upgrade operands to array pointers if possible when emitting
AIR.
* Implement safety checks for length mismatch and aliasing.
* AIR: make ptrtoint support slice operands. Implement in LLVM backend.
* C backend: implement new `@memset` semantics. `@memcpy` is not done
yet.
Now they use slices or array pointers with any element type instead of
requiring byte pointers.
This is a breaking enhancement to the language.
The safety check for overlapping pointers will be implemented in a
future commit.
closes#14040
Tests that only reference decls for the purpose of analyzing more tests
should be unnamed, otherwise trying to filter for just a referenced test
can become impossible depending on the names.
On CPUs without AES support, ChaCha is always faster and safer than
software AES.
Add `crypto.core.aes.has_hardware_support` to represent whether
AES acceleration is available or not, and in `tls.Client`, favor
AES-based ciphers only if hardware support is available.
This matches what BoringSSL is doing.
HMAC supports arbitrary key sizes, and there are no practical reasons
to use more than 256 bit keys.
It still makes sense to match the security level, though, especially
since a distinction between the block size and the key size can be
confusing.
Using HMAC.key_size instead of HMAC.mac_size caused our TLS
implementation to compute wrong shared secrets when SHA-384 was
used. So, fix it directly in `crypto.hmac` in order to prevent
other misuses.
These are great permutations, and there's nothing wrong with them
from a practical security perspective.
However, both were competing in the NIST lightweight crypto
competition.
Gimli didn't pass the 3rd selection round, and is not much used
in the wild besides Zig and libhydrogen. It will never be
standardized and is unlikely to get more traction in the future.
Xoodyak, that Xoodoo is the permutation of, was a finalist.
It has a lot of advantages and *might* be standardized without NIST.
But this is too early to tell, and too risky to commit to it
in a standard library.
For lightweight crypto, Ascon is the one that we know NIST will
standardize and that we can safely rely on from a usage perspective.
Switch to a traditional ChaCha-based CSPRNG, with an Ascon-based one
as an option for constrained systems.
Add a RNG benchmark by the way.
Gimli and Xoodoo served us well. Their code will be maintained,
but outside the standard library.
Implementation of the IND-CCA2 post-quantum secure key encapsulation
mechanism (KEM) CRYSTALS-Kyber, as submitted to the third round of the NIST
Post-Quantum Cryptography (v3.02/"draft00"), and selected for standardisation.
Co-authored-by: Frank Denis <124872+jedisct1@users.noreply.github.com>
* Add configurable side channels mitigations; enable them on soft AES
Our software AES implementation doesn't have any mitigations against
side channels.
Go's generic implementation is not protected at all either, and even
OpenSSL only has minimal mitigations.
Full mitigations against cache-based attacks (bitslicing, fixslicing)
come at a huge performance cost, making AES-based primitives pretty
much useless for many applications. They also don't offer any
protection against other classes of side channel attacks.
In practice, partially protected, or even unprotected implementations
are not as bad as it sounds. Exploiting these side channels requires
an attacker that is able to submit many plaintexts/ciphertexts and
perform accurate measurements. Noisy measurements can still be
exploited, but require a significant amount of attempts. Wether this
is exploitable or not depends on the platform, application and the
attacker's proximity.
So, some libraries made the choice of minimal mitigations and some
use better mitigations in spite of the performance hit. It's a
tradeoff (security vs performance), and there's no one-size-fits all
implementation.
What applies to AES applies to other cryptographic primitives.
For example, RSA signatures are very sensible to fault attacks,
regardless of them using the CRT or not. A mitigation is to verify
every produced signature. That also comes with a performance cost.
Wether to do it or not depends on wether fault attacks are part of
the threat model or not.
Thanks to Zig's comptime, we can try to address these different
requirements.
This PR adds a `side_channels_protection` global, that can later
be complemented with `fault_attacks_protection` and possibly other
knobs.
It can have 4 different values:
- `none`: which doesn't enable additional mitigations.
"Additional", because it only disables mitigations that don't have
a big performance cost. For example, checking authentication tags
will still be done in constant time.
- `basic`: which enables mitigations protecting against attacks in
a common scenario, where an attacker doesn't have physical access to
the device, cannot run arbitrary code on the same thread, and cannot
conduct brute-force attacks without being throttled.
- `medium`: which enables additional mitigations, offering practical
protection in a shared environement.
- `full`: which enables all the mitigations we have.
The tradeoff is that the more mitigations we enable, the bigger the
performance hit will be. But this let applications choose what's
best for their use case.
`medium` is the default.
Currently, this only affects software AES, but that setting can
later be used by other primitives.
For AES, our implementation is a traditional table-based, with 4
32-bit tables and a sbox.
Lookups in that table have been replaced by function calls. These
functions can add a configurable noise level, making cache-based
attacks more difficult to conduct.
In the `none` mitigation level, the behavior is exactly the same
as before. Performance also remains the same.
In other levels, we compress the T tables into a single one, and
read data from multiple cache lines (all of them in `full` mode),
for all bytes in parallel. More precise measurements and way more
attempts become necessary in order to find correlations.
In addition, we use distinct copies of the sbox for key expansion
and encryption, so that they don't share the same L1 cache entries.
The best known attacks target the first two AES round, or the last
one.
While future attacks may improve on this, AES achieves full
diffusion after 4 rounds. So, we can relax the mitigations after
that. This is what this implementation does, enabling mitigations
again for the last two rounds.
In `full` mode, all the rounds are protected.
The protection assumes that lookups within a cache line are secret.
The cachebleed attack showed that it can be circumvented, but
that requires an attacker to be able to abuse hyperthreading and
run code on the same core as the encryption, which is rarely a
practical scenario.
Still, the current AES API allows us to transparently switch to
using fixslicing/bitslicing later when the `full` mitigation level
is enabled.
* Software AES: use little-endian representation.
Virtually all platforms are little-endian these days, so optimizing
for big-endian CPUs doesn't make sense any more.
Apple M1/M2 have an EOR3 instruction that can XOR 2 operands with
another one, and LLVM knows how to take advantage of it.
However, two EOR can't be automatically combined into an EOR3 if
one of them is in an assembly block.
That simple change speeds up ciphers doing an AES round immediately
followed by a XOR operation on Apple Silicon.
Before:
aegis-128l mac: 12534 MiB/s
aegis-256 mac: 6722 MiB/s
aegis-128l: 10634 MiB/s
aegis-256: 6133 MiB/s
aes128-gcm: 3890 MiB/s
aes256-gcm: 3122 MiB/s
aes128-ocb: 2832 MiB/s
aes256-ocb: 2057 MiB/s
After:
aegis-128l mac: 15667 MiB/s
aegis-256 mac: 8240 MiB/s
aegis-128l: 12656 MiB/s
aegis-256: 7214 MiB/s
aes128-gcm: 3976 MiB/s
aes256-gcm: 3202 MiB/s
aes128-ocb: 2835 MiB/s
aes256-ocb: 2118 MiB/s
The TurboSHAKE paper just got published:
https://eprint.iacr.org/2023/342.pdf
and unlike the previous K12 paper, suggests 0x1F instead of 0x01
as the default value for "D".
* Fix SHA3 with streaming
Leftover bytes should be added to the buffer, not to the state.
(or, always to the state; we can and probably should eventually get
rid of the buffer)
Fixes#14851
* Add a test for SHA-3 with streaming
Make the Keccak permutation public, as it's useful for more than
SHA-3 (kMAC, SHAKE, TurboSHAKE, TupleHash, etc).
Our Keccak implementation was accepting f as a comptime parameter,
but always used 64-bit words and 200 byte states, so it actually
didn't work with anything besides f=1600.
That has been fixed. The ability to use reduced-round versions
was also added in order to support M14 and K12.
The state was constantly converted back and forth between bytes
and words, even though only a part of the state is actually used
for absorbing and squeezing bytes. It was changed to something
similar to the other permutations we have, so we can avoid extra
copies, and eventually add vectorized implementations.
In addition, the SHAKE extendable output function (XOF) was
added (SHAKE128, SHAKE256). It is required by newer schemes,
such as the Kyber post-quantum key exchange mechanism, whose
implementation is currently blocked by SHAKE missing from our
standard library.
Breaking change: `Keccak_256` and `Keccak_512` were renamed to
`Keccak256` and `Keccak512` for consistency with all other
hash functions.