* crypto.core.aes: process 6 block in parallel instead of 8 on aarch64
At least on Apple Silicon, this is slightly faster than 8 blocks.
* AES: add parallel blocks for tigerlake, rocketlake, alderlake, zen3
...instead of hard-coding it to 20.
- This is consistent with the ChaCha implementation
- NaCl and libsodium, that this API is designed to interop with,
also support 8 and 12 round variants. The 12 round variant, in
particular, provides the same security level as the 20 round variant,
but is obviously faster.
- scrypt currently uses its own non optimized version of Salsa, just
because it use 8 rounds instead of 20. This will help remove code
duplication.
No behavior nor public API changes. The Salsa20 and XSalsa20 still
represent the 20-round variant.
PR #13101 recently renamed the "i386" architecture to "x86", and it
seems the specific CPU model got swept up in that. "x86" is an umbrella
term that describes a family of CPUs, and the "i386" is the oldest
supported model under that umbrella.
Rewrite GHASH to use 128-bit multiplication over non-reversed
integers, and up to 8 blocks aggregated reduction.
lib/std/crypto/benchmark.zig results:
Xeon E5:
Before: 1604 MiB/s
After: 4005 MiB/s
Apple M1:
Before: 2769 MiB/s
After: 6014 MiB/s
This also makes AES-GCM faster by the way.
Instead of making the memory alignment functions more complicated, I
added more API documentation for their existing semantics.
closes#12118closes#12135
* std.os.uefi: integer backed structs, add tests to catch regressions
device_path_protocol now uses extern structs with align(1) fields because
the transition to integer backed packed struct broke alignment
added comptime asserts that device_path_protocol structs do not violate
alignment and size specifications
Make the test use the minimum length and set MAX_NAME_BYTES to the maximum so that:
- the test will work on any host platform
- *and* the MAX_NAME_BYTES will be able to hold the max file name component on any host platform
Each u16 within a file name component can be encoded as up to 3 UTF-8 bytes, so we need to use MAX_NAME_BYTES to account for all possible UTF-8 encoded names.
Fixes#8268
Comptime code can't execute assembly code, so we need some way to
force comptime code to use the generic path. This should be replaced
with whatever is implemented for #868, when that day comes.
I am seeing that the result for the hash is incorrect in stage1 and
crashes stage2, so presumably this never worked correctly. I will follow
up on that soon.
This gets us most of the way back to the performance I had when
I was using the LLVM intrinsics:
- Intel Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz:
190.67 MB/s (w/o intrinsics) -> 1285.08 MB/s
- AMD EPYC 7763 (VM) @ 2.45 GHz:
240.09 MB/s (w/o intrinsics) -> 1360.78 MB/s
- Apple M1:
216.96 MB/s (w/o intrinsics) -> 2133.69 MB/s
Minor changes to this source can swing performance from 400 MB/s to
1400 MB/s or... 20 MB/s, depending on how it interacts with the
optimizer. I have a sneaking suspicion that despite LLVM inheriting
GCC's extremely strict inline assembly semantics, its passes are
rather skittish around inline assembly (and almost certainly, its
instruction cost models can assume nothing)