verona Typed AST C++

The compiler currently uses an "untyped" AST, which is essentially an S-expression. This makes working on it very painful and code hard to maintain, as there is no central definition of what the shape of the tree needs to be.

We should instead replace it with an explicit and typed AST, akin to what the original prototype had.

The plan for now is: 1. Define the typed AST 2. Add a translation pass from the untyped AST to the typed one 3. Change the AST->MLIR lowering to operate on the typed AST. 4. Incrementally shift existing passes from the untyped AST to the typed one. This needs to happen in reverse compared to the execution order. 5. Make the parser produce a typed AST. There are different ways we can achieve this, either by using peglib's semantic actions, or by writing the parser by hand, or we never do this step and keep the untyped->typed conversion "forever" (until the next time we rewrite the compiler). 6. Delete all untyped AST code.

There's a couple of open questions on the design of the AST itself:

  • How are variants of expressions, types, class members, ..., represented?
    • Abstract class with concrete subclasses.
    • typedef std::variant<FooExpr, BarExpr, ...> Expr;
      • This is what the prototype's IR used.
  • How are variants traversed, aka the expression problem:
    • Add virtual methods to each variant.
    • Use a visitor pattern
    • std::visit (assuming we chose std::variant)
    • LLVM TypeSwitch
    • dynamic_cast
    • Use a cast + visitor pattern. This allows the visitor to be templated over additional arguments and a return type.
  • How are source locations represented? Do all nodes carry a location, or only select ones?
  • How are symbol tables and symbol references represented?
    • In particular, I think that after a dedicated resolution pass, all symbol references have direct pointers to the defining node. Ulterior passes can directly access that pointer and use it as a simple unique identifier, rather than reimplementing their own lookup tables.
  • Are strings in the AST interned?
  • Can we reuse any of the original prototype's AST, visitor infrastructure and name resolution pass?

cc @rengolin @sylvanc @mjp41

Asked Oct 08 '21 14:10
avatar plietar
plietar

6 Answer:

Hi Paul, thanks for the detailed description, it's an excellent starting point.

You can see how the MLIR generator uses the AST here. I have extracted all functionality there exactly for this reason, so we can design an easier-to-use AST library (or a fully typed AST).

Answer to some questions:

* Should we set up [LLVM-style RTTI](https://llvm.org/docs/HowToSetUpLLVMStyleRTTI.html)?

Probably. I'd like to get rid of exceptions and dynamic RTTI from the compiler and that's a good and well documented way of doing so. But I'm biased by having worked with LLVM for too long, so I'm open to new ideas.

We'll need to traverse the AST from both the parser and the MLIR generator, which may be completely different infrastructures. Making them similar could simplify the traversal algorithms, but we need to be careful on how much that'll make other infrastructure more complex.

  • How are source locations represented? Do all nodes carry a location, or only select ones?

Every node has location that was set by the PEG parser. Both AST and MLIR require source locations to be propagated.

New nodes are created "from" old ones, taking their source location.

Ast foo = ast::node(bar, "name");
assert(foo.path == bar.path && foo.line == bar.line && foo.column == bar.column);

Those locations are passed to MLIR via:

auto path = AST::getPath(ast);
return builder.getFileLineColLoc(Identifier::get(path.file, context), path.line, path.column);

We must maintain the same requirements to avoid losing the location half-way through.

  • In particular, I think that after a dedicated resolution pass, all symbol references have direct pointers to the defining node. Ulterior passes can directly access that pointer and use it as a simple unique identifier, rather than reimplementing their own lookup tables.

I'm not sure what you mean here. Class types have a get_def function to retrieve their declared types, but I'm not sure variables can point to their declarations.

But in the MLIR we can't use the declarations in a nested symbol table, which will need to update the SSA values that correspond to the latest scoped value for each variable.

  • Can we reuse any of the original prototype's AST, visitor infrastructure and name resolution pass?

That'll depend if the AST we use ends up similar to that one, than most likely yes. But we'll have to introduce it slowly, so probably not all of it will be relevant initially.

1
Answered Nov 10 '20 at 14:54
avatar  of rengolin
rengolin

I am not yet very familiar with this project, but I wonder if it would be possible to represent the AST as an MLIR dialect? Or is there any technical reason that would make this difficult?

1
Answered Nov 10 '20 at 16:39
avatar  of tobiasgrosser
tobiasgrosser

@tobiasgrosser we considered this a while ago, #101. @rengolin summarised our thinking at the time in: https://github.com/microsoft/verona/issues/101#issuecomment-629149425

1
Answered Nov 10 '20 at 16:55
avatar  of mjp41
mjp41

In particular, I think that after a dedicated resolution pass, all symbol references have direct pointers to the defining node. Ulterior passes can directly access that pointer and use it as a simple unique identifier, rather than reimplementing their own lookup tables.

I'm not sure what you mean here. Class types have a get_def function to retrieve their declared types, but I'm not sure variables can point to their declarations.

But in the MLIR we can't use the declarations in a nested symbol table, which will need to update the SSA values that correspond to the latest scoped value for each variable.

Take expressions for example, there will be a LocalRef node which references an existing variable declaration. When manipulating the AST (eg. when lowering to MLIR), we often need to find the corresponding declaration.

Currently in the generator we have a series of String -> Value maps to find the alloca from the variable's name, one per scope. Using names is pretty inefficient, full of footguns, and more complicated than it needs to be. In particular you need to implement all the non-trivial scoping rules. We already have a pass which implements the name resolution logic (ref.cc), but we're throwing all of that information away.

If the LocalRef had a pointer to the variable declaration, we could hash the pointer and have a single LocalDecl* -> Value map. There is no more problem shadowing or multiple declarations with the same name anymore, since the pointer's value is unique to each declaration. There is no need to worry about scoping rules and nested symbol tables. All this information is directly available to you.

In Clang's AST for example, the LocalRef node holds a pointer to a ValueDecl.

1
Answered Nov 12 '20 at 14:40
avatar  of plietar
plietar

If the LocalRef had a pointer to the variable declaration, we could hash the pointer and have a single LocalDecl* -> Value map. There is no more problem shadowing or multiple declarations with the same name anymore, since the pointer's value is unique to each declaration. There is no need to worry about scoping rules and nested symbol tables. All this information is directly available to you.

As discussed in other issues, this only works for things parsed or generated by the AST. Nodes, variables and functions generated on the MLIR side (lambdas, literals, behaviours) will all need to interact with this namespace and won't have AST nodes to point to.

Furthermore, scope is not only used to find the declaration of a variable, but also updates, which depend on the CFG and bring the correct SSA value in the local context. If we only look at the declaration we'll be ignoring any updates to variables and always using the original declared value.

Finally, updates on a local context will shadow the original declaration regardless if the map uses a string or a LocalDecl* to track updates.

1
Answered Nov 12 '20 at 15:00
avatar  of rengolin
rengolin

Nodes, variables and functions generated on the MLIR side (lambdas, literals, behaviours) will all need to interact with this namespace and won't have AST nodes to point to.

I don't know what "this namespace" is. An AST node cannot reference an MLIR concept, so there is no issue with making AST nodes have pointers to other AST nodes.

If we only look at the declaration we'll be ignoring any updates to variables and always using the original declared value.

Of course. But we're only using the definition pointer as an identifier in the map. We can still update the map when we need to.

Finally, updates on a local context will shadow the original declaration regardless if the map uses a string or a LocalDecl* to track updates.

Sorry I don't understand what this means.

1
Answered Nov 12 '20 at 15:16
avatar  of plietar
plietar