added adapter to AnyWriter and GenericWriter to help bridge the gap
between old and new API
make std.testing.expectFmt work at compile-time
std.fmt no longer has a dependency on std.unicode. Formatted printing
was never properly unicode-aware. Now it no longer pretends to be.
Breakage/deprecations:
* std.fs.File.reader -> std.fs.File.deprecatedReader
* std.fs.File.writer -> std.fs.File.deprecatedWriter
* std.io.GenericReader -> std.io.Reader
* std.io.GenericWriter -> std.io.Writer
* std.io.AnyReader -> std.io.Reader
* std.io.AnyWriter -> std.io.Writer
* std.fmt.format -> std.fmt.deprecatedFormat
* std.fmt.fmtSliceEscapeLower -> std.ascii.hexEscape
* std.fmt.fmtSliceEscapeUpper -> std.ascii.hexEscape
* std.fmt.fmtSliceHexLower -> {x}
* std.fmt.fmtSliceHexUpper -> {X}
* std.fmt.fmtIntSizeDec -> {B}
* std.fmt.fmtIntSizeBin -> {Bi}
* std.fmt.fmtDuration -> {D}
* std.fmt.fmtDurationSigned -> {D}
* {} -> {f} when there is a format method
* format method signature
- anytype -> *std.io.Writer
- inferred error set -> error{WriteFailed}
- options -> (deleted)
* std.fmt.Formatted
- now takes context type explicitly
- no fmt string
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon
inspection, I was dissatisfied with the implementation. This commit
removes several mechanisms:
* Removes the "invalid byte" compile error note.
* Dramatically simplifies tokenizer recovery by making recovery always
occur at newlines, and never otherwise.
* Removes UTF-8 validation.
* Moves some character validation logic to `std.zig.parseCharLiteral`.
Removing UTF-8 validation is a regression of #663, however, the existing
implementation was already buggy. When adding this functionality back,
it must be fuzz-tested while checking the property that it matches an
independent Unicode validation implementation on the same file. While
we're at it, fuzzing should check the other properties of that proposal,
such as no ASCII control characters existing inside the source code.
Other changes included in this commit:
* Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This
function has an awkward API that is too easy to misuse.
* Make `utf8Decode2` and friends use arrays as parameters, eliminating a
runtime assertion in favor of using the type system.
After this commit, the crash found by fuzzing, which was
"\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea"
no longer causes a crash. However, I did not feel the need to add this
test case because the simplified logic eradicates most crashes of this
nature.
Before this commit, the WTF-16 command line string would be converted to WTF-8 in `init`, and then a second buffer of the WTF-8 size + 1 would be allocated to store the parsed arguments. The converted WTF-8 command line would then be parsed and the relevant bytes would be copied into the argument buffer before being returned.
After this commit, only the WTF-8 size of the WTF-16 string is calculated (without conversion) which is then used to allocate the buffer for the parsed arguments. Parsing is then done on the WTF-16 slice directly, with the arguments being converted to WTF-8 on-the-fly.
This has a few (minor) benefits:
- Cuts the amount of memory allocated by ArgIteratorWindows in half (or better)
- Makes the total amount of memory allocated by ArgIteratorWindows predictable, since, before, the upfront `wtf16LeToWtf8Alloc` call could end up allocating more-memory-than-necessary temporarily due to its internal use of an ArrayList. Now, the amount of memory allocated is always exactly `calcWtf8Len(cmd_line) + 1`.
Deprecated aliases that are now compile errors:
- `std.fs.MAX_PATH_BYTES` (renamed to `std.fs.max_path_bytes`)
- `std.mem.tokenize` (split into `tokenizeAny`, `tokenizeSequence`, `tokenizeScalar`)
- `std.mem.split` (split into `splitSequence`, `splitAny`, `splitScalar`)
- `std.mem.splitBackwards` (split into `splitBackwardsSequence`, `splitBackwardsAny`, `splitBackwardsScalar`)
- `std.unicode`
+ `utf16leToUtf8Alloc`, `utf16leToUtf8AllocZ`, `utf16leToUtf8`, `fmtUtf16le` (all renamed to have capitalized `Le`)
+ `utf8ToUtf16LeWithNull` (renamed to `utf8ToUtf16LeAllocZ`)
- `std.zig.CrossTarget` (moved to `std.Target.Query`)
Deprecated `lib/std/std.zig` decls were deleted instead of made a `@compileError` because the `refAllDecls` in the test block would trigger the `@compileError`. The deleted top-level `std` namespaces are:
- `std.rand` (renamed to `std.Random`)
- `std.TailQueue` (renamed to `std.DoublyLinkedList`)
- `std.ChildProcess` (renamed/moved to `std.process.Child`)
This is not exhaustive. Deprecated aliases that I didn't touch:
+ `std.io.*`
+ `std.Build.*`
+ `std.builtin.Mode`
+ `std.zig.c_translation.CIntLiteralRadix`
+ anything in `src/`
ensureTotalCapacityPrecise only satisfies the assumptions made in the ArrayListImpl functions (that there's already enough capacity for the entire converted string if it's all ASCII) when the ArrayList has no items, otherwise it would hit illegal behavior.
This commit changes how we represent comptime-mutable memory
(`comptime var`) in the compiler in order to implement the intended
behavior that references to such memory can only exist at comptime.
It does *not* clean up the representation of mutable values, improve the
representation of comptime-known pointers, or fix the many bugs in the
comptime pointer access code. These will be future enhancements.
Comptime memory lives for the duration of a single Sema, and is not
permitted to escape that one analysis, either by becoming runtime-known
or by becoming comptime-known to other analyses. These restrictions mean
that we can represent comptime allocations not via Decl, but with state
local to Sema - specifically, the new `Sema.comptime_allocs` field. All
comptime-mutable allocations, as well as any comptime-known const allocs
containing references to such memory, live in here. This allows for
relatively fast checking of whether a value references any
comptime-mtuable memory, since we need only traverse values up to
pointers: pointers to Decls can never reference comptime-mutable memory,
and pointers into `Sema.comptime_allocs` always do.
This change exposed some faulty pointer access logic in `Value.zig`.
I've fixed the important cases, but there are some TODOs I've put in
which are definitely possible to hit with sufficiently esoteric code. I
plan to resolve these by auditing all direct accesses to pointers (most
of them ought to use Sema to perform the pointer access!), but for now
this is sufficient for all realistic code and to get tests passing.
This change eliminates `Zcu.tmp_hack_arena`, instead using the Sema
arena for comptime memory mutations, which is possible since comptime
memory is now local to the current Sema.
This change should allow `Decl` to store only an `InternPool.Index`
rather than a full-blown `ty: Type, val: Value`. This commit does not
perform this refactor.
Windows paths now use WTF-16 <-> WTF-8 conversion everywhere, which is lossless. Previously, conversion of ill-formed UTF-16 paths would either fail or invoke illegal behavior.
WASI paths must be valid UTF-8, and the relevant function calls have been updated to handle the possibility of failure due to paths not being encoded/encodable as valid UTF-8.
Closes#18694Closes#1774Closes#2565
Ill-formed UTF-8 byte sequences are replaced by the replacement character (U+FFFD) according to "U+FFFD Substitution of Maximal Subparts" from Chapter 3 of the Unicode standard, and as specified by https://encoding.spec.whatwg.org/#utf-8-decoder
The function returns the vector length, not the byte size of the vector or the bit size of individual elements. This distinction is very important and some usages of this function in the stdlib operated under these incorrect assumptions.
Use inline to vastly simplify the exposed API. This allows a
comptime-known endian parameter to be propogated, making extra functions
for a specific endianness completely unnecessary.
This reverts commit 0c99ba1eab, reversing
changes made to 5f92b070bf.
This caused a CI failure when it landed in master branch due to a
128-bit `@byteSwap` in std.mem.
Originally inspired by Go's `utf8.Valid` function. Includes some test cases from Go's test suite.
Further optimized to be faster in all tested cases (short/long ascii/UTF8), in all release modes.
Takes advantage of SIMD for the ASCII fast path.
* std.unicode: Add more UTF-16 decoding functions
This mostly makes parts of Utf16LeIterator reusable
* std.json: Fix decoding of UTF-16 surrogate pairs
Before this commit, there were 524,288 codepoints that would get decoded improperly. After this commit, there are 0.
Fixes#16828
Most of this migration was performed automatically with `zig fmt`. There
were a few exceptions which I had to manually fix:
* `@alignCast` and `@addrSpaceCast` cannot be automatically rewritten
* `@truncate`'s fixup is incorrect for vectors
* Test cases are not formatted, and their error locations change
There are still a few occurrences of "stage1" in the standard library
and self-hosted compiler source, however, these instances need a bit
more careful inspection to ensure no breakage.
I hit the "quotes in an RSP file" issue when trying to compile gRPC using
"zig cc". As a fun exercise, I decided to see if I could fix it myself.
I'm fully open to this code being flat-out rejected. Or I can take feedback
to fix it up.
This modifies (and renames) _ArgIteratorWindows_ in process.zig such that
it works with arbitrary strings (or the contents of an RSP file).
In main.zig, this new _ArgIteratorGeneral_ is used to address the "TODO"
listed in _ClangArgIterator_.
This change closes#4833.
**Pros:**
- It has the nice attribute of handling "RSP file" arguments in the same way it
handles "cmd_line" arguments.
- High Performance, minimal allocations
- Fixed bug in previous _ArgIteratorWindows_, where final trailing backslashes
in a command line were entirely dropped
- Added a test case for the above bug
- Harmonized the _ArgIteratorXxxx._initWithAllocator()_ and _next()_ interface
across Windows/Posix/Wasi (Moved Windows errors to _initWithAllocator()_
rather than _next()_)
- Likely perf benefit on Windows by doing _utf16leToUtf8AllocZ()_ only once
for the entire cmd_line
**Cons:**
- Breaking Change in std library on Windows: Call
_ArgIterator.initWithAllocator()_ instead of _ArgIterator.init()_
- PhaseMage is new with contributions to Zig, might need a lot of hand-holding
- PhaseMage is a Windows person, non-Windows stuff will need to be double-checked
**Testing Done:**
- Wrote a few new test cases in process.zig
- zig.exe build test -Dskip-release (no new failures seen)
- zig cc now builds gRPC without error
Because ArrayList.initCapacity uses 'precise' capacity allocation, this should save memory on average, and definitely will save memory in cases where ArrayList is used where a regular allocated slice could have also be used.