Commit adab5f11 authored by James R. Wilcox's avatar James R. Wilcox
Browse files

start writing S04 notes

parent 61170a21
# Fuzzing Trefoil v2
As we've discussed before, the Trefoil v2 interpreter processes its input in
several phases:
- tokenization
- PST parsing
- PST->AST conversion/parsing
- interpretation
The idea with fuzzing is to generate random inputs in an attempt to discover
errors. We can generate random inputs at different levels of structure, for
example, generating a random string character by character, or generating a
random sequence of tokens, etc. These levels of structure roughly correspond to
the phases of the interpreter's processing. For example, if we generate the
input character by character, we are unlikely to be lucky enough to write a long
program. On the other hand, if we use a very structured fuzzer that only
generates syntactically correct programs, then we will not be testing the error
cases of the parser very deeply. These differences in coverage mean that "more
structured" fuzzers are not "better". Instead, it is useful to have multiple
fuzzers at different levels of structure, so that all the phases of our
interpreter are covered.
This week we will look at four fuzzers:
- character fuzzer
- token fuzzer
- token fuzzer with balanced parentheses
- grammar fuzzer
## Character fuzzer
Similar to one of our Trefoil v1 fuzzers, this fuzzer generates inputs character
by character. Running `test01CharFuzz` from the provided code for this week
results in a report whose last line looks like this (split across multiple lines
here for readability):
```
Paren:26419
Abstract:153665
Unbound Variables:810541
Unbound Functions:4947
Other Runtime:0
StackOverflows:0
Programs:4428
Total:1000000
```
The report indicates at the bottom that 1 million strings were generated. The
other lines describe what happened when those 1 million "programs" were
executed.
- 96% of the results are unbound variable errors ("`Unbound Variables`"). This
happens any time the program starts with a character that is a valid variable
name (not a parenthesis, semicolon, or integer), which happens a lot.
- 2.5% of the results are parenthesized syntax errors, due to unbalanced
parentheses (either closing a paren that was never open, or reaching end of
input without closing all parens).
- 0.5% of the results are unbound functions. This happens when the fuzzer
generates an open parenthesis as the first character and then generates a
non-keyword function name, which is in all likelihood not found.
- Another 0.5% of the results have abstract syntax errors ("`Abstract`").
In this fuzzer, this primarily happens when the first character of the input
is a built-in operator (e.g. `+`, `=`, etc.). The staff solution requires
variable names not to conflict with built-in operators, so this is considered
an abstract syntax error by us. (This is described in the spec, but was
missing from the starter code, so your solution is free to not do this and may
report these as unbound variables instead, which is totally fine.)
- The last 0.5% of results are valid programs that run to completion. Almost all
of these succeed in large part to being mostly commented out.
All in all, this fuzzer does not get very deep into the interpreter.
## Token fuzzer
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment