[OnGoing]Crafting Interpreters:Notes

type

status

date

slug

summary

Intro

Breif & My Aim

The book Crafting Interpreters teaches you how to write two complete interpreters, one using Java and the other uses C.

Now you might ask, why would I do it? Well, first of all, I personally found learning how these lines of code work are pretty enjoyable. And the engineer’s way to figure out how something works is to build one.

Okay let’s spend less time doubting yourself and stop constantly thinking why am I doing this.

A “Language” Explained

Just have a read of this chapter of the book, knowing what you are doing.

A Simple Summary:

Front End:

Scanning → Parsing (sorting?) → Static analysis (identify the identifier (name) and bind with variable, type check ~~if language is statically typed~~) And these stuffs are stored right back as attributes on the syntax tree or to the symbol table (values it associates with each key tells what the identifier refers to). The book also suggests that we could transform the tree (from parsing) into an entirely new data structure, it will be introduced in the next section.

Middle End:

Intermediate representation acts as interface between source and destination forms.

This allows you to write one front end for each source languages and one back end for each target architecture then mix and match them instead of writing compilers for all combinations.

Optimization:

If some expressions always evaluate to the exact same value, we can do the evaluation at compile time and replace the code for expression with its result.

Back End:

Generates assembly-like instructions for CPU to run.

Generate instructions for real or virtual CPU?

Virtual machine: Compiler produces “bytecode”. After bytecode is done, either write a little mini compiler for each target architecture that converts bytecode to native code for specific machine OR write a virtual machine supporting virtual architecture at runtime which will be slower.

Runtime:

Needs services while running, e.g. garbage collector to reclaim unused bits.

Shortcuts and Alternate Routes:

Single-pass compilers: Without allocating any syntax trees or other IRs, produces output code directly in the parser. No intermediate data structures to store global info. & don’t revisit any previously parsed part of the code. (Pascal & C)

Tree-walk interpreters: Executing code right after parsing into a tree. To run the program, interpreter travel across the syntax tree one branch and leaf at a time, evaluating each note as it goes.

Transpilers(source-to-source compiler): After writing front end of the language, produce a string of valid source code for some other high level language then use the existing compilation tools to execute.

Just-in-time compilation: When the program is loaded at the user’s machine, compile it to native code for their computer’s architecture.

Okay that was A LOT to go through…

Lox

Dynamic typing

Variables can store values of any type, a single variable can store values of different types at different times.

Automatic memory management

Two main tech. for managing memory:

Reference counting

Tracing garbage collection

Objects: Classes or prototypes

Class: “instances and classes. Instances store the state for each object and have a reference to the instance’s class. Classes contain the methods and inheritance chain. To call a method on an instance, there is always a level of indirection.”

“Prototype-based languages merge these two concepts. There are only objects—no classes—and each individual object may contain state and methods. Objects can directly inherit from each other (or “delegate to” in prototypal lingo)”

A Tree-Walk Interpreter

Finally! Coding! Yay!

Scanning

Update 23/11/2024

Lox.java

Finished framework writing, it is bit tricky to find out the use of some lines, but thanks to stackoverflow I did worked it out and write loads of comments on my code. AND you’re not believing it, I had never come across Java before, I had used python, learnt a bit about Kotlin and C… Therefore, tough time for me reading these documents and searching on stackoverflow.

Update 24/11/2024

TokenType.java (Stores tokentypes) Token.java (Structure of token and turn it to string type)

Scan through list of characters and group them together into smallest sequences.

e.g. [var] [language] [=] [”lox”] [;]

Chomsky Hierarchy: Classifying formal languages and grammars based on complexity and generative power. I was a bit confused on this. So I searched them on the Internet.

https://www.reddit.com/r/explainlikeimfive/comments/wr06uk/eli5_chomsky_hierarchy/?rdt=38778
You can think of the Chomsky hierarchy as a classification of games for making sentences. Each game is as follows: you have symbols and text, and rules for rewriting something you've written into some other string of symbols and text. From a starting symbol, you apply the rules however you like until you end up with only text, and that is one of the sentences produced by this system (called a grammar).
For example, let's say I have the symbols A and B, and the pieces of text are a and b. My starting symbol is A, and the rules are (1) A -> Ba, (2) B -> Ba, and (3) B -> b. So let's try to make some sentences.
Here's a slightly more complex set of rules: same symbols and text, but now the rules are (1) A -> aA, (2) A -> B, (3) aB -> b. I can do something like this
I hope these examples helped a little bit by showing you how broad the rules are - whenever you have a partially derived sentence, you can pick any bit of it that matches the left hand side of a rule, and transform it into the right hand side of a rule. It can take some work to figure out all the possible sentences that some set of rules can produce.

The Chomsky hierarchy is one way to classify sets of these rules depending on their relative power. I'll try to give an intuitive explanation of each level based on the wiki, based on looking at these rules as capabilities that you have to construct sentences.

Type-3 grammars only know what's going on at the rightmost end of your derivation. You could never use these, for example, to describe palindromes, because to build a palindrome from left to right you need to keep track of what was going on on the left hand side, and type-3 grammars only know about the current symbol. However, you can use them to produce strings that have a length that's a multiple of 5, by keeping track of every chunk of 5 you create.

Type-2 grammars are more flexible - they know what's going on at every symbol that hasn't yet been turned into a piece of text, which can be on the right, left, or in the middle. You can use them to build a palindrome, by keeping your symbol in the middle and emitting mirrored text to either side (e.g. A -> aAa, B -> bBb, A -> B, B -> A). However, it still doesn't know about stuff that's happening far away, so you can't use it to describe even something like abc, aabbcc, aaabbbccc, ... - the sentences with three matching parts. This is because whatever you use to match up a's and b'c will lose track of whatever you're using to match up b's and c's.

Type-1 grammars are even more flexible - instead of only the symbol, the rule can also 'look at' some of the surrounding text and symbols. Since you can look at context, you can now actually do that aaabbbccc example above, by leaving behind context while matching up a and b that allows you to further match up c. It's honestly somewhat hard, when you're thinking of sentences, to make something that you can't do with a type-1 grammar - however, note that you cannot delete any text with this grammar, so if you have short sentences which nevertheless take a huge amount of work to verify, you won't be able to do this in a type-1 grammar. The most natural examples are often facts about arbitrarily large games (e.g. best first move on a chessboard of any size), and if you look at sentence production as a game, this includes facts about Chomsky grammars themselves (e.g. printing all the type-3 grammars that produce the same sentences can't be done by a type-1 grammar).

Type-0 grammars have no restrictions, meaning they can also rewrite existing text. At this point you essentially have a computer - you have memory that you can access and change at will to produce some end result. Type-0 grammars are as powerful as any kind of computation we know about (at least if you don't consider complexity - we don't have anything that can produce more kinds of languages, but we do have systems that are more efficient at doing so). There are things type-0 grammars can't do, and they are similar in nature to the things type-1 grammars can't do. Type-1 grammars can't print small sentences that take a huge amount of work, and type-0 grammars can't print sentences that take an infinite amount of work. For example, while you can now solve the previous problem of printing all the type-3 grammars that produce the same sentences, the next step of this (all the type-2 grammars that produce the same sentences) can't even be done here, because (very loosely speaking) there's no theoretical shortcut to keeping track of the infinite strings that each grammar is producing.

Scanner.java (the scanner itself)

After writing the single character lexemes and their token types in private void scanToken(), I’m tired and done for today… It’s not hard to understand the code itself but theories behind it…