Commit graph

47 commits

Author SHA1 Message Date
Veikka Tuominen
fd77f2cfed std: update usage of std.testing 2021-05-08 15:15:30 +03:00
Andrew Kelley
8e6c2b7a47 Merge remote-tracking branch 'origin/master' into ast-memory-layout 2021-02-24 15:08:23 -07:00
Josh Wolfe
8b9434871e
Avoid concept of a "Unicode character" in documentation and error messages (#8059) 2021-02-24 08:26:13 -05:00
Isaac Freund
f3ee10b454
zig fmt: fix comments ending with EOF after decls
Achieve this by reducing the amount of special casing to handle EOF so
that the already correct logic for normal comments does not need to be
duplicated.
2021-02-22 18:32:37 +01:00
Veikka Tuominen
e2289961c6
snake_case Token.Tag 2021-02-12 02:12:00 +02:00
Andrew Kelley
272a0ab359 zig fmt: implement "line comment followed by top-level comptime" 2021-02-01 20:11:55 -07:00
Andrew Kelley
20554d32c0 zig fmt: start reworking with new memory layout
* start implementation of ast.Tree.firstToken and lastToken
 * clarify some ast.Node doc comments
 * reimplement renderToken
2021-02-01 17:23:49 -07:00
Andrew Kelley
bf8fafc37d stage2: tokenizer does not emit line comments anymore
only std.zig.render cares about these, and it can find them in the
original source easily enough.
2021-01-31 21:57:48 -07:00
Andrew Kelley
4dca99d3f6 stage2: rework AST memory layout
This is a proof-of-concept of switching to a new memory layout for
tokens and AST nodes. The goal is threefold:

 * smaller memory footprint
 * faster performance for tokenization and parsing
 * most importantly, a proof-of-concept that can be also applied to ZIR
   and TZIR to improve the entire compiler pipeline in this way.

I had a few key insights here:

 * Underlying premise: using less memory will make things faster, because
   of fewer allocations and better cache utilization. Also using less
   memory is valuable in and of itself.
 * Using a Struct-Of-Arrays for tokens and AST nodes, saves the bytes of
   padding between the enum tag (which kind of token is it; which kind
   of AST node is it) and the next fields in the struct. It also improves
   cache coherence, since one can peek ahead in the tokens array without
   having to load the source locations of tokens.
 * Token memory can be conserved by only having the tag (1 byte) and byte
   offset (4 bytes) for a total of 5 bytes per token. It is not necessary
   to store the token ending byte offset because one can always re-tokenize
   later, but also most tokens the length can be trivially determined from
   the tag alone, and for ones where it doesn't, string literals for
   example, one must parse the string literal again later anyway in
   astgen, making it free to re-tokenize.
 * AST nodes do not actually need to store more than 1 token index because
   one can poke left and right in the tokens array very cheaply.

So far we are left with one big problem though: how can we put AST nodes
into an array, since different AST nodes are different sizes?

This is where my key observation comes in: one can have a hash table for
the extra data for the less common AST nodes! But it gets even better than
that:

I defined this data that is always present for every AST Node:

 * tag (1 byte)
   - which AST node is it
 * main_token (4 bytes, index into tokens array)
   - the tag determines which token this points to
 * struct{lhs: u32, rhs: u32}
   - enough to store 2 indexes to other AST nodes, the tag determines
     how to interpret this data

You can see how a binary operation, such as `a * b` would fit into this
structure perfectly. A unary operation, such as `*a` would also fit,
and leave `rhs` unused. So this is a total of 13 bytes per AST node.
And again, we don't have to pay for the padding to round up to 16 because
we store in struct-of-arrays format.

I made a further observation: the only kind of data AST nodes need to
store other than the main_token is indexes to sub-expressions. That's it.
The only purpose of an AST is to bring a tree structure to a list of tokens.
This observation means all the data that nodes store are only sets of u32
indexes to other nodes. The other tokens can be found later by the compiler,
by poking around in the tokens array, which again is super fast because it
is struct-of-arrays, so you often only need to look at the token tags array,
which is an array of bytes, very cache friendly.

So for nearly every kind of AST node, you can store it in 13 bytes. For the
rarer AST nodes that have 3 or more indexes to other nodes to store, either
the lhs or the rhs will be repurposed to be an index into an extra_data array
which contains the extra AST node indexes. In other words, no hash table needed,
it's just 1 big ArrayList with the extra data for AST Nodes.

Final observation, no need to have a canonical tag for a given AST. For example:
The expression `foo(bar)` is a function call. Function calls can have any
number of parameters. However in this example, we can encode the function
call into the AST with a tag called `FunctionCallOnlyOneParam`, and use lhs
for the function expr and rhs for the only parameter expr. Meanwhile if the
code was `foo(bar, baz)` then the AST node would have to be `FunctionCall`
with lhs still being the function expr, but rhs being the index into
`extra_data`. Then because the tag is `FunctionCall` it means
`extra_data[rhs]` is the "start" and `extra_data[rhs+1]` is the "end".
Now the range `extra_data[start..end]` describes the list of parameters
to the function.

Point being, you only have to pay for the extra bytes if the AST actually
requires it. There's no limit to the number of different AST tag encodings.

Preliminary results:

 * 15% improvement on cache-misses
 * 28% improvement on total instructions executed
 * 26% improvement on total CPU cycles
 * 22% improvement on wall clock time

This is 1/4 items on the checklist before this can actually be merged:

 * [x] parser
 * [ ] render (zig fmt)
 * [ ] astgen
 * [ ] translate-c
2021-01-30 20:16:59 -07:00
LemonBoy
dd973fb365 std: Use {s} instead of {} when printing strings 2021-01-02 17:12:57 -07:00
Frank Denis
6c2e0c2046 Year++ 2020-12-31 15:45:24 -08:00
Travis
d7f9128b5d add error message to zig side of tokenizing/parsing 2020-10-29 12:03:45 -05:00
Travis
960b5b518f updated zig tokenizer to handle .*** and added tests 2020-10-29 12:03:45 -05:00
Tadeo Kondrak
069fbb3c01
Add opaque type syntax 2020-10-06 22:08:24 -06:00
LemonBoy
5c6cd5e2c9 stage{1,2}: Fix parsing of range literals
stage1 was unable to parse ranges whose starting point was written in
binary/octal as the first dot in '...' was incorrectly interpreted as
decimal point.

stage2 forgot to reset the literal type to IntegerLiteral when it
discovered the dot was not a decimal point.

I've only stumbled across this bug because zig fmt keeps formatting the
ranges without any space around the ...
2020-09-28 14:16:26 -04:00
Vexu
1174cb1517
stage2: fix tokenizer float bug 2020-09-03 15:05:47 +03:00
Andrew Kelley
4a69b11e74 add license header to all std lib files
add SPDX license identifier
copyright ownership is zig contributors
2020-08-20 16:07:04 -04:00
Vexu
c2fb4bfff3
add 'anytype' to self-hosted parser 2020-07-11 17:41:16 +03:00
Alexandros Naskos
aa1a727284 Allow carriare return in comments 2020-06-02 00:56:05 -04:00
Vexu
a47257d9b0
fix std.zig rejecting literal tabs in comments 2020-06-01 14:37:36 +03:00
Ryan Liptak
a9dd79d293 Use std.ComptimeStringMap in zig's tokenizer
Should have no performance change, the implementation was just made generic/re-usable
2020-05-26 21:36:58 -07:00
Andrew Kelley
ba41a9d5d7 different strategy for tokenizing keywords
throughput: 279 MiB/s => 347 MiB/s
2020-05-26 18:59:49 -04:00
Andrew Kelley
8df0841d6a stage2 parser: token ids in their own array
To prevent cache misses, token ids go in their own array, and the
start/end offsets go in a different one.

perf measurement before:
         2,667,914      cache-misses:u
     2,139,139,935      instructions:u
       894,167,331      cycles:u

perf measurement after:
         1,757,723      cache-misses:u
     2,069,932,298      instructions:u
       858,105,570      cycles:u
2020-05-22 12:34:12 -04:00
Vexu
2296906e2a
modernize std.zig.tokenizer 2020-05-13 23:08:42 +03:00
Tadeo Kondrak
7ada59f873
remove nakedcc/stdcallcc/async fn/extern fn fnproto 2020-05-05 09:37:28 -06:00
Tadeo Kondrak
8d5636ebe4
Rename noasync to nosuspend in self-hosted, add rewriter 2020-05-05 05:17:33 -06:00
LemonBoy
0a2519fafb stage2: Allow \t in string literals
Lift the ban on literal tab chars in string literals as they have
nothing to do with code formatting.
2020-04-22 12:48:45 -04:00
momumi
2d18178c27 minor fixes and more tests for _ separators
* Make the tokenizer spit out an Invalid token on the first invalid
  character found in the number literal.
* More parsing and tokenizer tests for number literals
* fix invalid switch statement in ir.zig
2020-03-23 09:21:34 +10:00
momumi
29324e6f39 fix formatting in tokenizer tests 2020-03-22 12:41:11 +10:00
momumi
138dab4524 add number _ separators for stage 2 tokenizer 2020-03-22 10:35:19 +10:00
LemonBoy
59a243ce24 std: Remove now-superflous hack 2020-02-16 19:53:53 +01:00
LemonBoy
563d9ebfe5 Implement the callconv() annotation 2020-01-02 18:53:16 +01:00
Lachlan Easton
fe0e8c87b7 Tokenizer: Copy optional tokens prior to being set to null #3737 (#3910)
* Tokenizer: Copy optional tokens prior to being set to null #3737

* Add TODO comments, reminder to audit copying optional pattern.
2019-12-16 11:01:02 -05:00
Vexu
1f84586836
translate-c-2 avoid collisions with zig keywords 2019-12-14 17:00:10 +02:00
Andrew Kelley
8b2622cdd5
std.fmt.format: tuple parameter instead of var args 2019-12-08 22:53:51 -05:00
Andrew Kelley
bf3ac66150
remove type coercion from array values to references
* Implements #3768. This is a sweeping breaking change that requires
   many (trivial) edits to Zig source code. Array values no longer
   coerced to slices; however one may use `&` to obtain a reference to
   an array value, which may then be coerced to a slice.

 * Adds `IrInstruction::dump`, for debugging purposes. It's useful to
   call to inspect the instruction when debugging Zig IR.

 * Fixes bugs with result location semantics. See the new behavior test
   cases, and compile error test cases.

 * Fixes bugs with `@typeInfo` not properly resolving const values.

 * Behavior tests are passing but std lib tests are not yet. There
   is more work to do before merging this branch.
2019-11-27 03:37:50 -05:00
Andrew Kelley
00878a15d7
zig fmt: support sentinel-terminated pointer syntax 2019-11-23 22:21:25 -05:00
Andrew Kelley
fd6020c4e2
update tests, better error messages, update self-hosted tokenizer 2019-11-21 20:43:41 -05:00
Andrew Kelley
874b34a30f
self-hosted tokenizer handles \r in string literals 2019-11-20 21:03:17 -05:00
Andrew Kelley
ed619245e9
remove dead code shebang support from self-hosted tokenizer 2019-11-20 20:18:19 -05:00
Vexu
b92f42d1f4
implemented container doc comments in stage 2 2019-11-14 16:25:06 +02:00
Andrew Kelley
ae0a219d1f
stop accepting deprecated use keyword
closes #2591
2019-11-11 13:02:39 -05:00
Andrew Kelley
e0db54e89d
update the codebase to use @as 2019-11-08 15:57:24 -05:00
LemonBoy
cf6fb89ced
Add small tokenizer test case for #3468 2019-11-07 02:46:10 -05:00
Brendan Hansknecht
c1e8fdf812 add token for parsing pointer dereference 2019-11-06 14:03:21 -05:00
Nick Erdmann
ae7392e504
unicode character literals 2019-10-07 08:18:16 +02:00
Andrew Kelley
ed36dbbd9c
mv std/ lib/
that's all this commit does. further commits will fix cli flags and
such.

see #2221
2019-09-25 23:35:41 -04:00
Renamed from std/zig/tokenizer.zig (Browse further)