Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: ADR for string as alias or type #3

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Architecture Decision Record - Primitive bytes and strings
# Architecture Decision Record - Primitive bytes

- **Status**: Draft
- **Owner:** Tristan Menzel
Expand All @@ -18,8 +18,7 @@ Algorand Python has specific [Bytes and String types](https://algorandfoundation

## Requirements

- Support bytes AVM type and a string type that supports ASCII UTF-8 strings
- Use idiomatic TypeScript expressions for string expressions
- Support bytes AVM type
- Semantic compatibility between AVM execution and TypeScript execution (e.g. in unit tests)

## Principles
Expand Down Expand Up @@ -50,7 +49,7 @@ const b3 = b1 + b1

Whilst binary data is often a representation of a utf-8 string, it is not always - so direct use of the string type is not a natural fit. It doesn't allow us to represent alternative encodings (b16/b64) and the existing api surface is very 'string' centric. Much of the api would also be expensive to implement on the AVM leading to a bunch of 'dead' methods hanging off the type (or a significant amount of work implementing all the methods). The signatures of these methods also use `number` which is [not a semantically relevant type](./2024-05-21_primitive-integer-types.md).

Achieving semantic compatability with EcmaScript's `String` type would also be very expensive as it uses utf-16 encoding underneath whilst an ABI string is utf-8 encoded. A significant number of ops (and program size) would be required to convert between the two. If we were to ignore this and use utf-8 at runtime, apis such as `.length` would return different results. For example `"😄".length` in ES returns `2` whilst utf-8 encoding would yield `1` codepoint or `4` bytes, similarly indexing and slicing would yield different results.
Achieving semantic compatability with EcmaScript's `String` type would also be very expensive as it uses utf-16 encoding underneath whilst an ABI string is utf-8 encoded. A significant number of ops (and program size) would be required to convert between the two. If we were to ignore this and use utf-8 at runtime, apis such as `.length` would return different results. For example `"😄".length` in ES returns `2` whilst utf-8 encoding would yield `1` codepoint or `4` bytes, similarly indexing and slicing would yield different results. We would also need a way to specify non-utf-8 bytes values. Eg. from base16 or base64.

The Uint8Array type is fit for purpose as an encoding mechanism but the API is not as friendly as it could be for writing declarative contracts. The `new` keyword feels unnatural for something that is ostensibly a primitive type. The fact that it is mutable also complicates the implementation the compiler produces for the AVM.

Expand All @@ -72,13 +71,12 @@ To differentiate between ABI `string` and AVM `byteslice`, a branded type, `byte

Additional functions can be used when wanting to have string literals of a specific encoding represent a string or byteslice.


The downsides of using `string` are listed in Option 1.


### Option 3 - Define a class to represent Bytes

A `Bytes` class and `Str` (Name TBD) class are defined with a very specific API tailored to operations which are available on the AVM:
A `Bytes` class is defined with a very specific API tailored to operations which are available on the AVM:

```ts
class Bytes {
Expand All @@ -93,13 +91,19 @@ class Bytes {
at(x: uint64): Bytes {
return new Bytes(this.v[x])
}

static fromHex(v: string): Bytes {

}

static fromBase64(v: string): Bytes {

}


/* etc */
}

class Str {
/* implementation */
}

```

Expand All @@ -108,7 +112,6 @@ This solution provides great type safety and requires no transpilation to run _c
```ts
const a = new Bytes("Hello")
const b = new Bytes("World")
const c = new Str("Example string")
const ab = a.concat(b)

function testValue(x: Bytes) {
Expand All @@ -127,7 +130,7 @@ To have equality checks behave as expected we would need a transpilation step to

### Option 4 - Implement bytes as a class but define it as a type + factory

We can iron out some of the rough edges of using a class by only exposing a factory method for `Bytes`/`Str` and a resulting type `bytes`/`str`. This removes the need for the `new` keyword and lets us use a 'primitive looking' type alias (`bytes` versus `Bytes`, `str` versus `Str` - much like `string` and `String`). We can use [tagged templates](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals#tagged_templates) to improve the user experience of multipart concat expressions in lieu of having the `+` operator.
We can iron out some of the rough edges of using a class by only exposing a factory method for `Bytes` and a resulting type `bytes`. This removes the need for the `new` keyword and lets us use a 'primitive looking' type alias (`bytes` versus `Bytes` - much like `string` and `String`). We can use [tagged templates](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals#tagged_templates) to improve the user experience of multipart concat expressions in lieu of having the `+` operator.

```ts

Expand Down Expand Up @@ -156,13 +159,10 @@ function testValue(x: bytes, y: bytes): bytes {
return Bytes`${x} and ${y}`
}

const f = Str`Example string`

```

Whilst we still can't accept string literals on their own, the tagged template is almost as concise.

Having `bytes` and `str` behave like a primitive value type (value equality) whilst not _actually_ being a primitive is not strictly semantically compatible with EcmaScript however the lowercase type names (plus factory with no `new` keyword) communicates the intention of it being a primitive value type and there is an existing precedence of introducing new value types to the language in a similar pattern (`bigint` and `BigInt`). Essentially - if EcmaScript were to have a primitive bytes type, this is most likely what it would look like.
Having `bytes` behave like a primitive value type (value equality) whilst not _actually_ being a primitive is not strictly semantically compatible with EcmaScript however the lowercase type names (plus factory with no `new` keyword) communicates the intention of it being a primitive value type and there is an existing precedence of introducing new value types to the language in a similar pattern (`bigint` and `BigInt`). Essentially - if EcmaScript were to have a primitive bytes type, this is most likely what it would look like.

## Preferred option

Expand All @@ -172,25 +172,6 @@ Option 1 and 2 are not preferred as they make maintaining semantic compatability

Option 4 gives us the most natural feeling api whilst still giving us full control over the api surface. It doesn't support the `+` operator, but supports interpolation and `.concat` which gives us most of what `+` provides other than augmented assignment (ie. `+=`).

We should select an appropriate name for the type representing an AVM string. It should not conflict with the semantically incompatible EcmaScript type `string`.
- `str`/`Str`:
- ✅ Short
- ✅ obvious what it is
- ✅ obvious equivalent in ABI types
- ❌ NOT obvious how it differs from EcmaScript `string`
- `utf8`/`Utf8`:
- ✅ Short
- ✅ reasonably obvious what it is
- 🤔 less obvious equivalent in ABI types
- ✅ obvious how it differs to `string`
- `utf8string`/`Utf8String`
- ❌ Verbose
- ✅ obvious equivalent in ABI types
- ✅ very obvious what it is
- ✅ obvious how it differs to `string`



## Selected option

Option 4 has been selected as the best option
115 changes: 115 additions & 0 deletions docs/architecture-decisions/2024-06-05_string-alias-or-type.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Architecture Decision Record - String as alias or type

- **Status**: Draft
- **Owner:** Tristan Menzel
- **Deciders**: Alessandro Cappellato (Algorand Foundation), Bruno Martins (Algorand Foundation), Rob Moore (MakerX)
- **Date created**: 2024-06-05
- **Date decided**: N/A
- **Date updated**: 2024-06-05

## Context

A byte array (`byte[]`) on the AVM can be one of three things
- A big-endian number larger than uint64 (which we represent with biguint)
- The bytes of a utf8 encoded string
- General binary data (eg. a hash, or address etc) which may not be valid if interpreted as utf8 code points

ARC4 specifies `byte` as just an alias for `uint8` and `string` as an alias for `byte[]`. These aliases have no effect on the encoding of data but do communicate intent to consumers of the contract. Existing client generators expose an argument aliased with `string` as a string type native to the client platform (`string` for javascript and `str` for python) whereas an argument defined as `byte[]` will be exposed those most appropriate binary type for the client platform (`Uint8Array` for javascript and `bytes` for python). Although documented as just an alias, generated clients consider `string` and `byte[]` to be distinct types.

Algorand Python uses `algopy.Bytes` to represent a byte array and `algopy.String` to represent a utf8 string. The string type has a `.bytes` property to retrieve the underlying byte array and a byte array can be re-interpreted as a string with `String.from_bytes(...)`. In addition to these types, Algorand Python also has arc4 encoded equivalents `aglopy.arc4.DynamicBytes` and `algopy.arc4.String` which represent data encoded as per the arc4 spec.

The purpose of this ADR is to decide how strings are represented in Algorand TS.


## Requirements

- It must be possible to indicate that an argument or return value for an arc4 method expects, or returns an utf8 encoded string (ie. not just any array of bytes)

## Principles

- **[AlgoKit Guiding Principles](https://github.com/algorandfoundation/algokit-cli/blob/main/docs/algokit.md#guiding-principles)** - specifically Seamless onramp, Leverage existing ecosystem, Meet devs where they are
- **[Algorand Python Principles](https://algorandfoundation.github.io/puya/principles.html#principles)**
- **[Algorand TypeScript Guiding Principles](../README.md#guiding-principals)**

## Options

### Option 1 - Alias the bytes type

Introduce an alias for the existing bytes type. The alias is nothing but an alternative name and the types of `str` and `bytes` are interchangeable. This is loosely the approach taken by TealScript currently.

```ts
type str = bytes

function myFunction(value: str): bytes {
return value // this is fine because str and bytes are the same thing
}

```

Pros:
- A value of type `str` will be assignable/comparable to any property/parameter of type `bytes` (eg. asset.name)

Cons:
- It would not be possible to evolve the api of a `str` independently from that of `bytes`. (eg. ability to index/slice/iterate chars instead of bytes)
- No semantic separation of data that is 'a utf8 encoded string' and data that is general binary
- Type aliases are not typically semantically significant in TypeScript, aliased symbols are followed back to the declaring type. Supporting this option would mean adding an exception to this behaviour.
- For example: Adding `type Banana = uint64` and then using `Banana` everywhere instead of `uint64` should not affect the compiler output or type checking.

### Option 2 - str is its own type

Introduce a new type not directly interchangeable with bytes.

```ts
export type str = {
readonly bytes: bytes

startsWith(searchString: StringCompat): boolean
endsWith(searchString: StringCompat): boolean
}

function myFunction(value: str): bytes {
return value.bytes // Use bytes property to access underlying byte array
}
```

Pros:
- Comparisons/assignments between `bytes` and `str` encourage deliberate consideration of the implications (eg. I can't accidentally return the result of a sha_256 call from a method that should return a utf8 compliant `str`. I would need to consciously use `Str.fromBytes(...)`)
- Option to evolve `str` api as required (eg. add char indexing etc, or even validation of utf8 chars)

Cons:
- Have to add `.bytes` when assigning/comparing to `bytes` or use `from_bytes` when converting a byte array to a string
- `bytes` and `str` type might look the same for now leading to the question of what's the point of a separate type, and we may _never_ add additional string specific functionality.

### Option 3 - native string

Use native javascript strings for string values with the option to go from bytes to string via `.toString()` and string to bytes via `Bytes(yourString)`. Native APIs which expect bytes can take a union of `string | bytes` to cut back on boilerplate conversions. Explicit conversion would be required when comparing `bytes` to `string`. We would not touch `String.prototype` and using most methods on the `string` type will be a compiler error. If we need to add more string utility methods at a later date, these would be static methods (eg. `someUtility(myString, arg1, arg2)`) rather than instance methods added to the prototype.

```ts

function myFunction(value: string, other: bytes) {
value.startsWith("") // allowed
value[4] // not allowed (can't index by char)
value.slice(3, 4) // not allowed
log(value)
log(`Interpolated ${value}`)
return value === other.toString() && Bytes(value) === other && other.equals(value)
}

```

Pros:
- Meets developer expectations - a string is a string
- Can use literals
- Can use `+` and `+=`

Cons:
- Some operations will feel more verbose eg. getting the byte length will require `Bytes(value).length` since `value.length` should return the number of utf-16 code units to be semantically correct.


## Preferred options

TBD

## Selected option

TBD