Commit graph

328 commits

Author SHA1 Message Date
Isaac Freund
0f3fa4d654 zig fmt: array types 2021-02-05 11:36:19 -08:00
Isaac Freund
6f3b93e2e8 zig fmt: struct and anon array initialization 2021-02-05 10:51:45 -08:00
Isaac Freund
3e960cfffe zig fmt: float literal with exponent 2021-02-05 10:51:45 -08:00
Andrew Kelley
7069459a76 zig fmt: implement struct init 2021-02-04 19:59:06 -07:00
Andrew Kelley
8e46d06650 zig fmt: implement fn protos and defers 2021-02-04 16:38:29 -07:00
Asherah Connor
4428acf0f7 zig fmt: deref, unwrap optional 2021-02-04 10:49:45 -08:00
Andrew Kelley
725adf8332 zig fmt: builtin calls and array access 2021-02-03 22:12:11 -07:00
Andrew Kelley
f5279cbada zig fmt: implement top-level fields 2021-02-03 17:02:12 -07:00
Andrew Kelley
1a83b29bea zig fmt: implement if, call, field access, assignment 2021-02-02 21:05:53 -07:00
Andrew Kelley
0c6b98b825 zig fmt: implement simple test with doc comments 2021-02-01 21:31:41 -07:00
Andrew Kelley
272a0ab359 zig fmt: implement "line comment followed by top-level comptime" 2021-02-01 20:11:55 -07:00
Andrew Kelley
20554d32c0 zig fmt: start reworking with new memory layout
* start implementation of ast.Tree.firstToken and lastToken
 * clarify some ast.Node doc comments
 * reimplement renderToken
2021-02-01 17:23:49 -07:00
Andrew Kelley
4dca99d3f6 stage2: rework AST memory layout
This is a proof-of-concept of switching to a new memory layout for
tokens and AST nodes. The goal is threefold:

 * smaller memory footprint
 * faster performance for tokenization and parsing
 * most importantly, a proof-of-concept that can be also applied to ZIR
   and TZIR to improve the entire compiler pipeline in this way.

I had a few key insights here:

 * Underlying premise: using less memory will make things faster, because
   of fewer allocations and better cache utilization. Also using less
   memory is valuable in and of itself.
 * Using a Struct-Of-Arrays for tokens and AST nodes, saves the bytes of
   padding between the enum tag (which kind of token is it; which kind
   of AST node is it) and the next fields in the struct. It also improves
   cache coherence, since one can peek ahead in the tokens array without
   having to load the source locations of tokens.
 * Token memory can be conserved by only having the tag (1 byte) and byte
   offset (4 bytes) for a total of 5 bytes per token. It is not necessary
   to store the token ending byte offset because one can always re-tokenize
   later, but also most tokens the length can be trivially determined from
   the tag alone, and for ones where it doesn't, string literals for
   example, one must parse the string literal again later anyway in
   astgen, making it free to re-tokenize.
 * AST nodes do not actually need to store more than 1 token index because
   one can poke left and right in the tokens array very cheaply.

So far we are left with one big problem though: how can we put AST nodes
into an array, since different AST nodes are different sizes?

This is where my key observation comes in: one can have a hash table for
the extra data for the less common AST nodes! But it gets even better than
that:

I defined this data that is always present for every AST Node:

 * tag (1 byte)
   - which AST node is it
 * main_token (4 bytes, index into tokens array)
   - the tag determines which token this points to
 * struct{lhs: u32, rhs: u32}
   - enough to store 2 indexes to other AST nodes, the tag determines
     how to interpret this data

You can see how a binary operation, such as `a * b` would fit into this
structure perfectly. A unary operation, such as `*a` would also fit,
and leave `rhs` unused. So this is a total of 13 bytes per AST node.
And again, we don't have to pay for the padding to round up to 16 because
we store in struct-of-arrays format.

I made a further observation: the only kind of data AST nodes need to
store other than the main_token is indexes to sub-expressions. That's it.
The only purpose of an AST is to bring a tree structure to a list of tokens.
This observation means all the data that nodes store are only sets of u32
indexes to other nodes. The other tokens can be found later by the compiler,
by poking around in the tokens array, which again is super fast because it
is struct-of-arrays, so you often only need to look at the token tags array,
which is an array of bytes, very cache friendly.

So for nearly every kind of AST node, you can store it in 13 bytes. For the
rarer AST nodes that have 3 or more indexes to other nodes to store, either
the lhs or the rhs will be repurposed to be an index into an extra_data array
which contains the extra AST node indexes. In other words, no hash table needed,
it's just 1 big ArrayList with the extra data for AST Nodes.

Final observation, no need to have a canonical tag for a given AST. For example:
The expression `foo(bar)` is a function call. Function calls can have any
number of parameters. However in this example, we can encode the function
call into the AST with a tag called `FunctionCallOnlyOneParam`, and use lhs
for the function expr and rhs for the only parameter expr. Meanwhile if the
code was `foo(bar, baz)` then the AST node would have to be `FunctionCall`
with lhs still being the function expr, but rhs being the index into
`extra_data`. Then because the tag is `FunctionCall` it means
`extra_data[rhs]` is the "start" and `extra_data[rhs+1]` is the "end".
Now the range `extra_data[start..end]` describes the list of parameters
to the function.

Point being, you only have to pay for the extra bytes if the AST actually
requires it. There's no limit to the number of different AST tag encodings.

Preliminary results:

 * 15% improvement on cache-misses
 * 28% improvement on total instructions executed
 * 26% improvement on total CPU cycles
 * 22% improvement on wall clock time

This is 1/4 items on the checklist before this can actually be merged:

 * [x] parser
 * [ ] render (zig fmt)
 * [ ] astgen
 * [ ] translate-c
2021-01-30 20:16:59 -07:00
Tadeo Kondrak
0b5f3c2ef9
Replace @TagType uses, mostly with std.meta.Tag 2021-01-30 22:26:44 +02:00
Jay Petacat
a9b505fa77 Reduce use of deprecated IO types
Related: #4917
2021-01-07 23:48:58 -08:00
Andrew Kelley
974c008a0e convert more {} to {d} and {s} 2021-01-02 19:03:14 -07:00
LemonBoy
dd973fb365 std: Use {s} instead of {} when printing strings 2021-01-02 17:12:57 -07:00
Frank Denis
6c2e0c2046 Year++ 2020-12-31 15:45:24 -08:00
LemonBoy
fa6449dac0 zig fmt: Fix alignment of initializer elements
Resetting `column_counter` is not needed as the effective column number
is calculated by taking that value modulo `row_size`.

Closes #7289
2020-12-11 02:34:44 -05:00
Vexu
be71994fb1
zig fmt: improve var decl initializer formatting 2020-12-09 13:47:22 +02:00
Vexu
a63fd34c50
return a valid node even if invalid deref was used 2020-10-29 19:20:15 +02:00
Travis
d7f9128b5d add error message to zig side of tokenizing/parsing 2020-10-29 12:03:45 -05:00
Lachlan Easton
4496a6c9cc zig fmt: Special case un-indent comma after multiline string in param list 2020-09-18 20:34:00 +10:00
Lachlan Easton
1aacedf6e1 zig fmt: Fix regression in ArrayInitializers 2020-09-18 20:34:00 +10:00
Lachlan Easton
40b6e86a99 zig fmt: fix #6171 2020-09-18 20:34:00 +10:00
Lachlan Easton
206a8cf670 zig fmt: fix comments and multiline literals in function args 2020-09-18 20:34:00 +10:00
Lachlan Easton
291482a031 zig fmt: Don't consider width of expressions containing multiline string literals when calculating padding for array initializers. fixes #3739
Changes some of the special casing for multiline string literals.
2020-09-18 20:34:00 +10:00
Lachlan Easton
e1bd271192 zig fmt: Allow trailing comments to do manual array formatting. close #5948 2020-09-18 20:34:00 +10:00
Lachlan Easton
9f0821e688 zig fmt: Fix erroneously commented out code, add passing test case to close #5722 2020-09-18 20:34:00 +10:00
Lachlan Easton
ea6181aaf6 zig fmt: Add test for nesting if expressions 2020-09-18 20:34:00 +10:00
Lachlan Easton
601331833a Add passing test. close #5343 2020-09-09 21:54:42 +10:00
Lachlan Easton
283d441c19 zig fmt: fix #3978, fix #2748 2020-09-09 21:54:42 +10:00
Lachlan Easton
bb848dbeee zig fmt: Patch rename stream to ais (auto indenting stream) & other small refactors 2020-09-02 20:16:28 +10:00
Lachlan Easton
bc24b86d82 zig fmt: Fix regression not covered by testing 2020-09-01 13:19:34 +10:00
Lachlan Easton
029ec456bc zig fmt: Set indent_delta to 2 when rendering inline asm 2020-08-31 23:39:50 +10:00
Lachlan Easton
5aca3baea6 zig fmt: Remove dynamic stack from auto-indenting-stream 2020-08-31 23:39:50 +10:00
Lachlan Easton
a72b9d403d Refactor zig fmt indentation. Remove indent from rendering code and have a stream handle automatic indentation 2020-08-29 13:35:00 +10:00
Andrew Kelley
4a69b11e74 add license header to all std lib files
add SPDX license identifier
copyright ownership is zig contributors
2020-08-20 16:07:04 -04:00
Vexu
f962315363
fix missing parser error for missing comma before eof
Closes #5952
2020-07-30 13:10:55 +03:00
Andrew Kelley
804b51b179 stage2: VarDecl and FnProto take advantage of TrailerFlags API
These AST nodes now have a flags field and then a bunch of optional
trailing objects. The end result is lower memory usage and consequently
better performance. This is part of an ongoing effort to reduce the
amount of memory parsed ASTs take up.

Running `zig fmt` on the std lib:
 * cache-misses: 2,554,321 => 2,534,745
 * instructions: 3,293,220,119 => 3,302,479,874
 * peak memory: 74.0 MiB => 73.0 MiB

Holding the entire std lib AST in memory at the same time:

  93.9 MiB => 88.5 MiB
2020-07-15 02:07:30 -07:00
Vexu
1a989ba39d
fix parser tests and add test for anytype conversion 2020-07-11 21:20:50 +03:00
Vexu
010c58e303
fix zig fmt out of bounds on empty file 2020-05-30 23:07:51 +03:00
Jakub Konka
e61e8c94be Reenable zig parser tests disabled targeting Wasm
I'm not sure why I disabled them when landing extended Wasm/WASI
support, but they pass the parser tests just fine now, so I'm gonna
go ahead and re-enable them.
2020-05-26 21:01:54 -04:00
Vexu
e07b467c7c
fix missing compile error on while/for missing block 2020-05-25 23:25:06 +03:00
Andrew Kelley
8df0841d6a stage2 parser: token ids in their own array
To prevent cache misses, token ids go in their own array, and the
start/end offsets go in a different one.

perf measurement before:
         2,667,914      cache-misses:u
     2,139,139,935      instructions:u
       894,167,331      cycles:u

perf measurement after:
         1,757,723      cache-misses:u
     2,069,932,298      instructions:u
       858,105,570      cycles:u
2020-05-22 12:34:12 -04:00
Andrew Kelley
93384f7428 use singly linked lists for std.zig.parse
std.ast uses a singly linked list for lists of things. This is a
breaking change to the self-hosted parser API.

std.ast.Tree has been separated into a private "Parser" type which
represents in-progress parsing, and std.ast.Tree which has only
"output" data. This means cleaner, but breaking, API for parse results.
Specifically, `tokens` and `errors` are no longer SegmentedList but a
slice.

The way to iterate over AST nodes has necessarily changed since lists of
nodes are now singly linked lists rather than SegmentedList.

From these changes, I observe the following on the
self-hosted-parser benchmark from ziglang/gotta-go-fast:

throughput: 45.6 MiB/s => 55.6 MiB/s
maxrss: 359 KB => 342 KB

This commit breaks the build; more updates are necessary to fix API
usage of the self-hosted parser.
2020-05-19 21:22:52 -04:00
Jakub Konka
3d267bab71 Re-enable refAllDecls gen and check in std.zig 2020-05-18 21:05:29 +02:00
Vexu
28d449b38d
fix zig fmt regression 2020-05-17 15:13:19 +03:00
Vexu
081ffe24cf
fix infinite loop with invalid comptime 2020-05-16 19:23:59 +03:00
Vexu
ed62081d38
recover from missing semicolon after if stmt 2020-05-16 12:29:01 +03:00