mirrors/zig - "Borealis" Git by INX: Hosted by INX "Xenon".

mirror of https://codeberg.org/ziglang/zig.git synced 2025-12-08 14:54:42 +00:00

Author	SHA1	Message	Date
Veikka Tuominen	fd77f2cfed	std: update usage of std.testing	2021-05-08 15:15:30 +03:00
Andrew Kelley	8e6c2b7a47	Merge remote-tracking branch 'origin/master' into ast-memory-layout	2021-02-24 15:08:23 -07:00
Josh Wolfe	8b9434871e	Avoid concept of a "Unicode character" in documentation and error messages (#8059 )	2021-02-24 08:26:13 -05:00
Isaac Freund	f3ee10b454	zig fmt: fix comments ending with EOF after decls Achieve this by reducing the amount of special casing to handle EOF so that the already correct logic for normal comments does not need to be duplicated.	2021-02-22 18:32:37 +01:00
Veikka Tuominen	e2289961c6	snake_case Token.Tag	2021-02-12 02:12:00 +02:00
Andrew Kelley	272a0ab359	zig fmt: implement "line comment followed by top-level comptime"	2021-02-01 20:11:55 -07:00
Andrew Kelley	20554d32c0	zig fmt: start reworking with new memory layout * start implementation of ast.Tree.firstToken and lastToken * clarify some ast.Node doc comments * reimplement renderToken	2021-02-01 17:23:49 -07:00
Andrew Kelley	bf8fafc37d	stage2: tokenizer does not emit line comments anymore only std.zig.render cares about these, and it can find them in the original source easily enough.	2021-01-31 21:57:48 -07:00
Andrew Kelley	4dca99d3f6	stage2: rework AST memory layout This is a proof-of-concept of switching to a new memory layout for tokens and AST nodes. The goal is threefold: * smaller memory footprint * faster performance for tokenization and parsing * most importantly, a proof-of-concept that can be also applied to ZIR and TZIR to improve the entire compiler pipeline in this way. I had a few key insights here: * Underlying premise: using less memory will make things faster, because of fewer allocations and better cache utilization. Also using less memory is valuable in and of itself. * Using a Struct-Of-Arrays for tokens and AST nodes, saves the bytes of padding between the enum tag (which kind of token is it; which kind of AST node is it) and the next fields in the struct. It also improves cache coherence, since one can peek ahead in the tokens array without having to load the source locations of tokens. * Token memory can be conserved by only having the tag (1 byte) and byte offset (4 bytes) for a total of 5 bytes per token. It is not necessary to store the token ending byte offset because one can always re-tokenize later, but also most tokens the length can be trivially determined from the tag alone, and for ones where it doesn't, string literals for example, one must parse the string literal again later anyway in astgen, making it free to re-tokenize. * AST nodes do not actually need to store more than 1 token index because one can poke left and right in the tokens array very cheaply. So far we are left with one big problem though: how can we put AST nodes into an array, since different AST nodes are different sizes? This is where my key observation comes in: one can have a hash table for the extra data for the less common AST nodes! But it gets even better than that: I defined this data that is always present for every AST Node: * tag (1 byte) - which AST node is it * main_token (4 bytes, index into tokens array) - the tag determines which token this points to * struct{lhs: u32, rhs: u32} - enough to store 2 indexes to other AST nodes, the tag determines how to interpret this data You can see how a binary operation, such as `a * b` would fit into this structure perfectly. A unary operation, such as `a` would also fit, and leave `rhs` unused. So this is a total of 13 bytes per AST node. And again, we don't have to pay for the padding to round up to 16 because we store in struct-of-arrays format. I made a further observation: the only kind of data AST nodes need to store other than the main_token is indexes to sub-expressions. That's it. The only purpose of an AST is to bring a tree structure to a list of tokens. This observation means all the data that nodes store are only sets of u32 indexes to other nodes. The other tokens can be found later by the compiler, by poking around in the tokens array, which again is super fast because it is struct-of-arrays, so you often only need to look at the token tags array, which is an array of bytes, very cache friendly. So for nearly every kind of AST node, you can store it in 13 bytes. For the rarer AST nodes that have 3 or more indexes to other nodes to store, either the lhs or the rhs will be repurposed to be an index into an extra_data array which contains the extra AST node indexes. In other words, no hash table needed, it's just 1 big ArrayList with the extra data for AST Nodes. Final observation, no need to have a canonical tag for a given AST. For example: The expression `foo(bar)` is a function call. Function calls can have any number of parameters. However in this example, we can encode the function call into the AST with a tag called `FunctionCallOnlyOneParam`, and use lhs for the function expr and rhs for the only parameter expr. Meanwhile if the code was `foo(bar, baz)` then the AST node would have to be `FunctionCall` with lhs still being the function expr, but rhs being the index into `extra_data`. Then because the tag is `FunctionCall` it means `extra_data[rhs]` is the "start" and `extra_data[rhs+1]` is the "end". Now the range `extra_data[start..end]` describes the list of parameters to the function. Point being, you only have to pay for the extra bytes if the AST actually requires it. There's no limit to the number of different AST tag encodings. Preliminary results: 15% improvement on cache-misses * 28% improvement on total instructions executed * 26% improvement on total CPU cycles * 22% improvement on wall clock time This is 1/4 items on the checklist before this can actually be merged: * [x] parser * [ ] render (zig fmt) * [ ] astgen * [ ] translate-c	2021-01-30 20:16:59 -07:00
LemonBoy	dd973fb365	std: Use {s} instead of {} when printing strings	2021-01-02 17:12:57 -07:00
Frank Denis	6c2e0c2046	Year++	2020-12-31 15:45:24 -08:00
Travis	d7f9128b5d	add error message to zig side of tokenizing/parsing	2020-10-29 12:03:45 -05:00
Travis	960b5b518f	updated zig tokenizer to handle .*** and added tests	2020-10-29 12:03:45 -05:00
Tadeo Kondrak	069fbb3c01	Add opaque type syntax	2020-10-06 22:08:24 -06:00
LemonBoy	5c6cd5e2c9	stage{1,2}: Fix parsing of range literals stage1 was unable to parse ranges whose starting point was written in binary/octal as the first dot in '...' was incorrectly interpreted as decimal point. stage2 forgot to reset the literal type to IntegerLiteral when it discovered the dot was not a decimal point. I've only stumbled across this bug because zig fmt keeps formatting the ranges without any space around the ...	2020-09-28 14:16:26 -04:00
Vexu	1174cb1517	stage2: fix tokenizer float bug	2020-09-03 15:05:47 +03:00
Andrew Kelley	4a69b11e74	add license header to all std lib files add SPDX license identifier copyright ownership is zig contributors	2020-08-20 16:07:04 -04:00
Vexu	c2fb4bfff3	add 'anytype' to self-hosted parser	2020-07-11 17:41:16 +03:00
Alexandros Naskos	aa1a727284	Allow carriare return in comments	2020-06-02 00:56:05 -04:00
Vexu	a47257d9b0	fix std.zig rejecting literal tabs in comments	2020-06-01 14:37:36 +03:00
Ryan Liptak	a9dd79d293	Use std.ComptimeStringMap in zig's tokenizer Should have no performance change, the implementation was just made generic/re-usable	2020-05-26 21:36:58 -07:00
Andrew Kelley	ba41a9d5d7	different strategy for tokenizing keywords throughput: 279 MiB/s => 347 MiB/s	2020-05-26 18:59:49 -04:00
Andrew Kelley	8df0841d6a	stage2 parser: token ids in their own array To prevent cache misses, token ids go in their own array, and the start/end offsets go in a different one. perf measurement before: 2,667,914 cache-misses:u 2,139,139,935 instructions:u 894,167,331 cycles:u perf measurement after: 1,757,723 cache-misses:u 2,069,932,298 instructions:u 858,105,570 cycles:u	2020-05-22 12:34:12 -04:00
Vexu	2296906e2a	modernize std.zig.tokenizer	2020-05-13 23:08:42 +03:00
Tadeo Kondrak	7ada59f873	remove nakedcc/stdcallcc/async fn/extern fn fnproto	2020-05-05 09:37:28 -06:00
Tadeo Kondrak	8d5636ebe4	Rename noasync to nosuspend in self-hosted, add rewriter	2020-05-05 05:17:33 -06:00
LemonBoy	0a2519fafb	stage2: Allow \t in string literals Lift the ban on literal tab chars in string literals as they have nothing to do with code formatting.	2020-04-22 12:48:45 -04:00
momumi	2d18178c27	minor fixes and more tests for _ separators * Make the tokenizer spit out an Invalid token on the first invalid character found in the number literal. * More parsing and tokenizer tests for number literals * fix invalid switch statement in ir.zig	2020-03-23 09:21:34 +10:00
momumi	29324e6f39	fix formatting in tokenizer tests	2020-03-22 12:41:11 +10:00
momumi	138dab4524	add number `_` separators for stage 2 tokenizer	2020-03-22 10:35:19 +10:00
LemonBoy	59a243ce24	std: Remove now-superflous hack	2020-02-16 19:53:53 +01:00
LemonBoy	563d9ebfe5	Implement the callconv() annotation	2020-01-02 18:53:16 +01:00
Lachlan Easton	fe0e8c87b7	Tokenizer: Copy optional tokens prior to being set to null #3737 (#3910 ) * Tokenizer: Copy optional tokens prior to being set to null #3737 * Add TODO comments, reminder to audit copying optional pattern.	2019-12-16 11:01:02 -05:00
Vexu	1f84586836	translate-c-2 avoid collisions with zig keywords	2019-12-14 17:00:10 +02:00
Andrew Kelley	8b2622cdd5	std.fmt.format: tuple parameter instead of var args	2019-12-08 22:53:51 -05:00
Andrew Kelley	bf3ac66150	remove type coercion from array values to references * Implements #3768. This is a sweeping breaking change that requires many (trivial) edits to Zig source code. Array values no longer coerced to slices; however one may use `&` to obtain a reference to an array value, which may then be coerced to a slice. * Adds `IrInstruction::dump`, for debugging purposes. It's useful to call to inspect the instruction when debugging Zig IR. * Fixes bugs with result location semantics. See the new behavior test cases, and compile error test cases. * Fixes bugs with `@typeInfo` not properly resolving const values. * Behavior tests are passing but std lib tests are not yet. There is more work to do before merging this branch.	2019-11-27 03:37:50 -05:00
Andrew Kelley	00878a15d7	zig fmt: support sentinel-terminated pointer syntax	2019-11-23 22:21:25 -05:00
Andrew Kelley	fd6020c4e2	update tests, better error messages, update self-hosted tokenizer	2019-11-21 20:43:41 -05:00
Andrew Kelley	874b34a30f	self-hosted tokenizer handles \r in string literals	2019-11-20 21:03:17 -05:00
Andrew Kelley	ed619245e9	remove dead code shebang support from self-hosted tokenizer	2019-11-20 20:18:19 -05:00
Vexu	b92f42d1f4	implemented container doc comments in stage 2	2019-11-14 16:25:06 +02:00
Andrew Kelley	ae0a219d1f	stop accepting deprecated `use` keyword closes #2591	2019-11-11 13:02:39 -05:00
Andrew Kelley	e0db54e89d	update the codebase to use `@as`	2019-11-08 15:57:24 -05:00
LemonBoy	cf6fb89ced	Add small tokenizer test case for #3468	2019-11-07 02:46:10 -05:00
Brendan Hansknecht	c1e8fdf812	add token for parsing pointer dereference	2019-11-06 14:03:21 -05:00
Nick Erdmann	ae7392e504	unicode character literals	2019-10-07 08:18:16 +02:00
Andrew Kelley	ed36dbbd9c	mv std/ lib/ that's all this commit does. further commits will fix cli flags and such. see #2221	2019-09-25 23:35:41 -04:00

47 commits