verona Type representation, sizeof, structure layout C++

The MLIR representation is taking shape, but we's still using mostly opaque instructions and types, because the Verona types are not complete yet (in MLIR). However, to start lowering classes and members, we need to know how to represent them at a lower level than just an opaque MLIR dialect.

Right now, we don't seem to have any particular type defined. But in the grammar, we have some constants: * Int (which will be either U or S, with bit-size from 8 to 128) * Float (with bit-size from 8 to 128?) * Octal/Hex, which I'm assuming will default to Int (but could also be float or pointer type) * String (multi-byte characters? which? all?)

We'll also need to represent pointers/references, which will have additional capability/range tags in it, so I'm assuming it will be at least 128-bits (like CHERI).

There are also tuples and structures, which will bring ABI issues into light. We need to decide how we pack them, what is the optimal padding between elements and derived class storage, vtable layout and offset, etc. If we want to control packing, we'll need a special keyword for that, otherwise we must guarantee correct behaviour at all times. For example, if we don't pack anything, it would be interesting to have tuples and structs with the same layout, so we can instantiate one from the other with a simple memcpy.

Finally, there's type unions and intersections. Depending on what they mean if done with two different concrete types (like int and float, int or float), we'll need to create a storage for the largest value and pad the smallest so reads and writes are aligned, etc. Capabilities would probably not have a concrete representation in memory, other than pointers (that already count on that space being used), so this may not be an issue for numeric type representation.

It would be nice to write down the set of concrete decisions on type sizes, padding, alignment for all types we support in Verona, so that issues on the MLIR / LLVM layers can take that into account when implement lowering of load/store patterns.

@davidchisnall @sylvanc @mjp41 @Theodus @plietar

Ref: #259.

Asked Oct 08 '21 14:10
avatar rengolin
rengolin

4 Answer:

It may be worth reviewing some of the intersection/constraint-based layout work from the BigBang project at JHU; https://pl.cs.jhu.edu/projects/big-bang/papers/types-for-flexible-objects.pdf (paper) and https://pl.cs.jhu.edu/projects/big-bang/dissertations/safe-fast-and-easy--towards-scalable-scripting-languages.pdf (thesis) are probably the right things to rummage through for ideas.

1
Answered Sep 07 '20 at 18:39
avatar  of nwf
nwf

For union types, we wanted to explicitly not make those part of the public ABI, to enable dense tagged unions. For example, on CHERI a union of a pointer and an integer may use the tag bit as a discriminator, on conventional 64-bit architectures may use one of the low bits if it can guarantee alignment of the pointer. A union of a 64-bit float and a 32-bit integer should automatically use NaN boxing and so on. The rules for this will end up being both quite complex and quite dependent on the target's native types.

I'd also like to be able to use the zero value. For example Foo* | None, where None is a singleton, is how you represent the equivalent of a nullable type in Pony. Pony will use the address of the None singleton, but this requires every pattern match to be a register-register compare and requires additional instructions to materialise the address of the None singleton (on CHERI, that will require the loader to construct a capability and will then be the equivalent of a GOT-relative access. In contrast, comparison to 0 is a single instruction with a single data dependency on most architectures (even RISC-V has a single-instruction branch if [not] equal to zero) because C functions with if (x == NULL) guards are one of the most common data-dependent branch structures. So, for example, Foo* | Bar* | None would ideally be represented as 0 indicates None, low bit 0 but some of the high bits non-zero indicates that the remaining bits are a pointer to Foo and low bit 1 indicates the remaining bits are a pointer to Bar.

Ideally, I'd like to lower this to the target ABI as late as possible, because we're going to end up with a load of bit twiddling that is likely to make analyses harder.

I'm not sure what an intersection type on concrete types would be. I32 & F64, for example, should give the empty set.

In terms of constants in the source, I think we probably need a longer discussion. I would like for these to be treated as a distinct family of types that can be used only for initialisation. For example, 42 can be used to initialise an I64 or a U8 (or an F32 or whatever) but it is not that type until after type inference.

String literals are more important to get right here. Baking a specific string type into the language is usually a mistake. I would like a compile-time iterable sequence of UTF-8 characters (probably) that can be used to initialise a standard-library string type, but which must be explicitly used to construct a concrete type if it is used with something that expects a concrete type, but that's a longer discussion. In the standard library, given our intended set of uses, we are going to want a lightweight ASCII-and-no-complex-collation string and a rich unicode string, with the default being selected per project.

1
Answered Sep 08 '20 at 09:02
avatar  of davidchisnall
davidchisnall

For union types, we wanted to explicitly not make those part of the public ABI, to enable dense tagged unions. (...) I'd also like to be able to use the zero value.

These make sense to me.

Ideally, I'd like to lower this to the target ABI as late as possible, because we're going to end up with a load of bit twiddling that is likely to make analyses harder.

How late?

We can: 1. Do everything we need to in MLIR before we lower to LLVM and let the compiler deal with the IR. 2. Keep some pieces in Verona dialect (type handling, for ex.) and lower directly to LLVM IR with special treatment. 3. Add late passes to LLVM to do that lowering at the LLVM IR level.

(1) is still pretty high level, while (3) will add complexities we may not want or need. (2) may be about "just right".

I'm not sure what an intersection type on concrete types would be. I32 & F64, for example, should give the empty set.

In that sense the object with that type will be unassignable and useless, so maybe we should forbid that in the semantics and add to the things the type checker does?

In terms of constants in the source, I think we probably need a longer discussion. I would like for these to be treated as a distinct family of types that can be used only for initialisation. For example, 42 can be used to initialise an I64 or a U8 (or an F32 or whatever) but it is not that type until after type inference.

@sylvanc was discussing implementing constants like structures with special behaviour. IIUC, Int could have a getU8, getU16... getS128, and used accordingly. Float would have a similar structure. Octal and Hex would probably have to have both functionalities, and maybe even addresses.

String literals are more important to get right here. Baking a specific string type into the language is usually a mistake. I would like a compile-time iterable sequence of UTF-8 characters (probably) that can be used to initialise a standard-library string type, but which must be explicitly used to construct a concrete type if it is used with something that expects a concrete type, but that's a longer discussion. In the standard library, given our intended set of uses, we are going to want a lightweight ASCII-and-no-complex-collation string and a rich unicode string, with the default being selected per project.

I agree strings should be implemented in the library, mostly because getting it right is really hard, and baking that into the compiler would be a nightmare. But we still need a "compile-time character sequence" of some sort inside the compiler. That's the one I was referring to.

Do we want literal strings to be a type on its own, independent of character types? Or do we want them to be a list of characters?

What would be the impact in real terms of not having an ASCii string representation and going straight to UTF-8 for everything?

Most performance critical programs don't need to output strings in hot code, and most programs that need to output strings in hot code are for user consumption (ex. PHP), which then needs to be wide.

1
Answered Sep 08 '20 at 11:13
avatar  of rengolin
rengolin

How late?

I'm not sure.

We can:

Do everything we need to in MLIR before we lower to LLVM and let the compiler deal with the IR. Keep some pieces in Verona dialect (type handling, for ex.) and lower directly to LLVM IR with special treatment. Add late passes to LLVM to do that lowering at the LLVM IR level. (1) is still pretty high level, while (3) will add complexities we may not want or need. (2) may be about "just right".

I think 3 will probably cause pain, but either 1 or 2 sounds plausible.

Do we want literal strings to be a type on its own, independent of character types? Or do we want them to be a list of characters?

My feeling is that "foo" is a thing that has a length and iterators that give you characters and can be used to construct a string type.

What would be the impact in real terms of not having an ASCii string representation and going straight to UTF-8 for everything?

Everything that deals with strings needs to be able to parse UTF-8 and deal with variable-width characters. For kernel code, that's a lot of complexity in anything that handles strings, which should be delegated to userspace code.

Additionally, baking UTF-8 in as the default will annoy around 2/3 of the world's population. UTF-8 is both less space-efficient and less time-efficient than UTF-16 for most non-alphabetic languages. For the rich standard library string, we want an interface that understands unicode code points and glyphs but concrete implementations that can handle UTF-8 or UTF-16 (or UTF-32) code.

1
Answered Sep 08 '20 at 12:01
avatar  of davidchisnall
davidchisnall