verona Type representation, sizeof, structure layout C++
The MLIR representation is taking shape, but we's still using mostly opaque instructions and types, because the Verona types are not complete yet (in MLIR). However, to start lowering classes and members, we need to know how to represent them at a lower level than just an opaque MLIR dialect.
Right now, we don't seem to have any particular type defined. But in the grammar, we have some constants:
* Int (which will be either U
or S
, with bit-size from 8 to 128)
* Float (with bit-size from 8 to 128?)
* Octal/Hex, which I'm assuming will default to Int (but could also be float or pointer type)
* String (multi-byte characters? which? all?)
We'll also need to represent pointers/references, which will have additional capability/range tags in it, so I'm assuming it will be at least 128-bits (like CHERI).
There are also tuples and structures, which will bring ABI issues into light. We need to decide how we pack them, what is the optimal padding between elements and derived class storage, vtable layout and offset, etc. If we want to control packing, we'll need a special keyword for that, otherwise we must guarantee correct behaviour at all times. For example, if we don't pack anything, it would be interesting to have tuples and structs with the same layout, so we can instantiate one from the other with a simple memcpy
.
Finally, there's type unions and intersections. Depending on what they mean if done with two different concrete types (like int and float
, int or float
), we'll need to create a storage for the largest value and pad the smallest so reads and writes are aligned, etc. Capabilities would probably not have a concrete representation in memory, other than pointers (that already count on that space being used), so this may not be an issue for numeric type representation.
It would be nice to write down the set of concrete decisions on type sizes, padding, alignment for all types we support in Verona, so that issues on the MLIR / LLVM layers can take that into account when implement lowering of load/store patterns.
@davidchisnall @sylvanc @mjp41 @Theodus @plietar
Ref: #259.
4 Answer:
It may be worth reviewing some of the intersection/constraint-based layout work from the BigBang project at JHU; https://pl.cs.jhu.edu/projects/big-bang/papers/types-for-flexible-objects.pdf (paper) and https://pl.cs.jhu.edu/projects/big-bang/dissertations/safe-fast-and-easy--towards-scalable-scripting-languages.pdf (thesis) are probably the right things to rummage through for ideas.
For union types, we wanted to explicitly not make those part of the public ABI, to enable dense tagged unions. For example, on CHERI a union of a pointer and an integer may use the tag bit as a discriminator, on conventional 64-bit architectures may use one of the low bits if it can guarantee alignment of the pointer. A union of a 64-bit float and a 32-bit integer should automatically use NaN boxing and so on. The rules for this will end up being both quite complex and quite dependent on the target's native types.
I'd also like to be able to use the zero value. For example Foo* | None
, where None
is a singleton, is how you represent the equivalent of a nullable type in Pony. Pony will use the address of the None
singleton, but this requires every pattern match to be a register-register compare and requires additional instructions to materialise the address of the None
singleton (on CHERI, that will require the loader to construct a capability and will then be the equivalent of a GOT-relative access. In contrast, comparison to 0 is a single instruction with a single data dependency on most architectures (even RISC-V has a single-instruction branch if [not] equal to zero) because C functions with if (x == NULL)
guards are one of the most common data-dependent branch structures. So, for example, Foo* | Bar* | None
would ideally be represented as 0 indicates None
, low bit 0 but some of the high bits non-zero indicates that the remaining bits are a pointer to Foo
and low bit 1 indicates the remaining bits are a pointer to Bar
.
Ideally, I'd like to lower this to the target ABI as late as possible, because we're going to end up with a load of bit twiddling that is likely to make analyses harder.
I'm not sure what an intersection type on concrete types would be. I32 & F64
, for example, should give the empty set.
In terms of constants in the source, I think we probably need a longer discussion. I would like for these to be treated as a distinct family of types that can be used only for initialisation. For example, 42 can be used to initialise an I64
or a U8
(or an F32
or whatever) but it is not that type until after type inference.
String literals are more important to get right here. Baking a specific string type into the language is usually a mistake. I would like a compile-time iterable sequence of UTF-8 characters (probably) that can be used to initialise a standard-library string type, but which must be explicitly used to construct a concrete type if it is used with something that expects a concrete type, but that's a longer discussion. In the standard library, given our intended set of uses, we are going to want a lightweight ASCII-and-no-complex-collation string and a rich unicode string, with the default being selected per project.
For union types, we wanted to explicitly not make those part of the public ABI, to enable dense tagged unions. (...) I'd also like to be able to use the zero value.
These make sense to me.
Ideally, I'd like to lower this to the target ABI as late as possible, because we're going to end up with a load of bit twiddling that is likely to make analyses harder.
How late?
We can: 1. Do everything we need to in MLIR before we lower to LLVM and let the compiler deal with the IR. 2. Keep some pieces in Verona dialect (type handling, for ex.) and lower directly to LLVM IR with special treatment. 3. Add late passes to LLVM to do that lowering at the LLVM IR level.
(1) is still pretty high level, while (3) will add complexities we may not want or need. (2) may be about "just right".
I'm not sure what an intersection type on concrete types would be.
I32 & F64
, for example, should give the empty set.
In that sense the object with that type will be unassignable and useless, so maybe we should forbid that in the semantics and add to the things the type checker does?
In terms of constants in the source, I think we probably need a longer discussion. I would like for these to be treated as a distinct family of types that can be used only for initialisation. For example, 42 can be used to initialise an
I64
or aU8
(or anF32
or whatever) but it is not that type until after type inference.
@sylvanc was discussing implementing constants like structures with special behaviour. IIUC, Int
could have a getU8
, getU16
... getS128
, and used accordingly. Float
would have a similar structure. Octal
and Hex
would probably have to have both functionalities, and maybe even addresses.
String literals are more important to get right here. Baking a specific string type into the language is usually a mistake. I would like a compile-time iterable sequence of UTF-8 characters (probably) that can be used to initialise a standard-library string type, but which must be explicitly used to construct a concrete type if it is used with something that expects a concrete type, but that's a longer discussion. In the standard library, given our intended set of uses, we are going to want a lightweight ASCII-and-no-complex-collation string and a rich unicode string, with the default being selected per project.
I agree strings should be implemented in the library, mostly because getting it right is really hard, and baking that into the compiler would be a nightmare. But we still need a "compile-time character sequence" of some sort inside the compiler. That's the one I was referring to.
Do we want literal strings to be a type on its own, independent of character types? Or do we want them to be a list of characters?
What would be the impact in real terms of not having an ASCii string representation and going straight to UTF-8 for everything?
Most performance critical programs don't need to output strings in hot code, and most programs that need to output strings in hot code are for user consumption (ex. PHP), which then needs to be wide.
How late?
I'm not sure.
We can:
Do everything we need to in MLIR before we lower to LLVM and let the compiler deal with the IR. Keep some pieces in Verona dialect (type handling, for ex.) and lower directly to LLVM IR with special treatment. Add late passes to LLVM to do that lowering at the LLVM IR level. (1) is still pretty high level, while (3) will add complexities we may not want or need. (2) may be about "just right".
I think 3 will probably cause pain, but either 1 or 2 sounds plausible.
Do we want literal strings to be a type on its own, independent of character types? Or do we want them to be a list of characters?
My feeling is that "foo"
is a thing that has a length and iterators that give you characters and can be used to construct a string type.
What would be the impact in real terms of not having an ASCii string representation and going straight to UTF-8 for everything?
Everything that deals with strings needs to be able to parse UTF-8 and deal with variable-width characters. For kernel code, that's a lot of complexity in anything that handles strings, which should be delegated to userspace code.
Additionally, baking UTF-8 in as the default will annoy around 2/3 of the world's population. UTF-8 is both less space-efficient and less time-efficient than UTF-16 for most non-alphabetic languages. For the rich standard library string, we want an interface that understands unicode code points and glyphs but concrete implementations that can handle UTF-8 or UTF-16 (or UTF-32) code.
Read next
- keras Parallelize option for bidirectional layer - Python
- AdaptiveCards [iOS][custom actions] [selectAction/inlineAction does not support custom actions] C++
- scylla Storage proxy exception due count queries - Cplusplus
- Make Wakatime endpoint URL configurable - JavaScript github-readme-stats
- WLED Seeking DEV to help port LED effect to WLED (Paid Bounty) - Cplusplus
- Blazorise DataGrid editing nested fields C#
- TypeError: typechain_1.glob is not a function on compile - TypeScript TypeChain
- Cataclysm-DDA Expand procedural lab support to more general "facilities" - Cplusplus