CREXX

REXX Language implementation

View the Project on GitHub adesutherland/CREXX

cREXX Assembler Architecture (rxas)

The rxas assembler is responsible for translating human-readable Intermediate Representation (IR) assembly (.rxas) into packed executable bytecode (.rxbin) consumed by the rxvm interpreter.

1. Assembler Pipeline

The assembler processes source files through a pipelined, pseudo-two-pass architecture:

  1. Lexical Analysis (re2c):
    • Source defined in assembler/rxasscan.re.
    • Tokenizes input into REXX Assembly primitives: registers (RREG, GREG, AREG), literal types (STRING, INT, FLOAT, DECIMAL, HEX), symbols (ID, LABEL, FUNC), and assembler directives (e.g., .locals, .globals, .expose, .meta).
    • The lexer also recognizes the interface metadata keywords used at the end of .meta records: .interface, .implements, and .member.
  2. Parsing (Lemon):
    • Grammar defined in assembler/rxasgrmr.y.
    • Enforces the structural integrity of the .rxas file (headers, function definitions, variable declarations, and instruction sequences).
    • The parser actions invoke Builder API functions directly (e.g., rxasgen*, rxaslabl, rxasproc), translating syntax rules into buffered internal data structures.
    • Instruction names are derived from binutils/include/rxops.h, so new VM opcodes become assembler mnemonics through that shared table. Recent core runtime additions include the object/interface instructions and the sock* TCP socket instructions.
  3. In-Memory Buffering & Constant Pooling:
    • Handled in assembler/rxasassm.c.
    • Instructions and operand slots are appended sequentially into a dynamic array of bin_code elements.
    • Complex and pooled literal types (Strings, Floats, Decimals, Procedure Headers, Metadata) are deduplicated via AVL Trees and injected into a variable-length Constant Pool.
  4. Backpatching & Optimization (Second Pass):
    • Forward references (branches to undefined labels, calls to undefined procedures) are logged as a linked list of references.
    • At the end of the parse, backptch() traverses all trees, resolves symbol addresses, updates the binary stream, and applies peephole jump optimizations.

2. Core Internal Data Structures

The state of the assembler is held within the Assembler_Context struct, specifically within the context->binary object, which mirrors the layout of the final .rxbin file.

Instruction Stream

Instructions are flattened into an array of unions called bin_code (defined in binutils/include/rxdefs.h). An instruction is represented by an opcode element, immediately followed by elements representing its operands.

// Example buffer size handling in gen_instr()
context->binary.binary[context->binary.inst_size].instruction.opcode = opcode;
context->binary.binary[context->binary.inst_size++].instruction.no_ops = operands;

// Operand elements
context->binary.binary[context->binary.inst_size++].index = get_reg_number(...);
context->binary.binary[context->binary.inst_size++].iconst = token->integer;

Constant Pool

Strings, pooled float literals, procedure mappings, debug metadata, and exported symbols are packed into const_pool. This is a sequential buffer of dynamically sized records. Every record starts with a chameleon_constant header dictating its type and byte size. Types include: STRING_CONST, FLOAT_CONST, PROC_CONST, EXPOSE_REG_CONST, EXPOSE_PROC_CONST, META_FUNC, META_INLINE, META_REG, etc.

The serialized expose_head chain includes both EXPOSE_REG_CONST and EXPOSE_PROC_CONST records. Runtime linking and other module-local walkers now rely on that chain instead of scanning the whole constant pool.

The interface/callable-contract work extends that same metadata path rather than introducing a second binary header mechanism. In addition to META_CLASS and META_ATTR, the assembler now serializes:

Cross-file compiler inlining also uses the metadata path. Callable signatures remain in META_FUNC; inline-body templates are carried separately in META_INLINE, emitted in RXAS as:

.meta "fully.qualified.callable"=".inline" "I4;..."

The I4 payload is the compiler-owned inline transport described in compiler/docs/inlining_design.md. rxas stores it as META_INLINE, and rxdas must emit it back to the same logical .meta ... ".inline" "I4;..." spelling so source, RXAS, and binary import paths do not drift. Linked final images normally strip META_INLINE; library artifacts preserve it for downstream rxc optimisation.

Metadata-only modules are valid. For example, an interface contract file may compile to .rxas containing .meta records and no function bodies; rxas must still emit a .rxbin so import and runtime factory resolution can load the contract metadata.

For interface methods, the member-kind string now distinguishes:

Symbol Tracking (AVL Trees)

To deduplicate constants and resolve identifiers in O(log N) time, the assembler leverages a custom AVL tree implementation (avl_tree.h). Active trees include:

3. Two-Pass Resolution (Backpatching)

Assembly requires a two-pass approach because a jump or call can reference a label or procedure defined further down in the file. rxasassm.c handles this elegantly by building a struct backpatching header for every symbol encountered:

struct backpatching {
    int defined;          // 1 if definition encountered, 0 if only referenced
    size_t index;         // Final resolved target index in the binary or const_pool
    struct backpatching_references *refs; // Linked list of unresolved usages
};

struct backpatching_references {
    size_t index;         // The instruction operand index in the bin_code array
    Assembler_Token *token; 
    struct backpatching_references *link;
};

When the EOF is reached, the backptch(Assembler_Context *context) function orchestrates the final resolution:

  1. optimise_labels(): Performs peephole optimization. If a label simply targets an unconditional branch (br), the target of the label is recursively collapsed to point directly at the final destination, saving execution cycles.
  2. backpatch_procedures(): Walks the proc_constants_tree. Raises errors for any procedures referenced but never defined. Resolves valid procedures.
  3. backpatch_labels(): Walks the label_constants_tree. Updates the placeholder bin_code indices mapped in the refs linked list with the actual instruction offsets.

4. Binary Emission (.rxbin)

Once parsing and backpatching conclude without errors, the assembler flushes the Assembler_Context state into a packed .rxbin file.

The structural output consists of:

  1. Magic Header & Version: Validates the file format.
  2. Global Counters and Section Sizes: e.g., Number of global registers (context->binary.globals), expanded section sizes, stored section sizes, and section flags.
  3. The Constant Pool: The assembled context->binary.const_pool, optionally blob-compressed on disk if that reduces size.
  4. The Bytecode Stream: The resolved context->binary.binary instruction slots, optionally packed on disk as a logical opcode/operand token stream.

As of format version 002 and later, float operands are no longer stored inline as raw double payloads in operand slots. Instead, the operand slot carries an index to a deduplicated FLOAT_CONST record in the constant pool.

As of the current 003 layout, rxas still builds the normal in-memory bin_code[] and raw constant pool first. The section compaction step happens when write_module() serializes the module:

5. Current Interface-Dispatch Additions

The current interface runtime slice relies on three assembler-visible opcodes and the metadata records above:

srcfproc now supports both the default * surface and named factory selectors. Provider selection is a VM concern: the assembler simply emits the opcode and the interface/class metadata needed for runtime lookup.

For call ... , rArgs, dcall, and srcfproc ... , rArgs, the trailing register operand names the argument-count register. The actual argument values are taken from the contiguous registers immediately after that count register. That hidden contiguous argument block is semantically part of the instruction use set, so optimiser passes must treat those opcodes as barriers unless they fully model the implicit register consumption.

The peephole optimiser also consumes instruction metadata from binutils/include/rxops.h. FLOW_JUMP, FLOW_COND, and FLOW_TERM block NO_HAZARD rule skips automatically. FLG_OPT_BARRIER is for FLOW_NEXT instructions that still must not be skipped, such as calls, signal handler configuration, explicit signal/check instructions, and opaque argument-block operations. FLG_IMPLICIT_REG_USE is for linear instructions whose register effects are not represented as normal operands. For example, inc0, dec0, inc1, and friends are still linear execution, but the optimiser treats them as using the corresponding fixed local register when checking whether an intervening instruction is relevant to a rule.

Signal support adds action-aware and dynamic-name forms:

Unknown dynamic names raise INVALID_SIGNAL_CODE. Literal signal "NAME" forms are still assembled directly against the static signal table.

Optimiser rule operands use lowercase r for a captured register. Uppercase R, G, and A match literal local/global/argument register numbers. The assembler uses this to express rules such as inc r0 -> inc0 without adding mnemonic-specific C code to the optimiser engine.

typeof, istype, and asserttype are object-contract operations. Compiler generated code uses them for object casts/tests/introspection; scalar typeof/is cases are folded earlier by rxc.

Interface default methods use that same path. The assembler does not introduce new opcodes for them; it simply carries META_MEMBER kind method final and emits the interface method body as an ordinary procedure. The VM method registry then decides whether srcmethod should bind class.member directly or fall back to the interface’s emitted default-body procedure.

At the current Level B stage, the runtime selection rule is:

6. Core Socket Instructions

Core TCP sockets are exposed as normal RXAS mnemonics generated from binutils/include/rxops.h. They use VM-managed integer handles rather than OS file descriptors or pointers, so programs cannot accidentally retain a native socket after the VM context is freed. All socket instructions are FLG_OPT_BARRIER because they perform external I/O or mutate context-owned handle state.

The current instruction surface is intentionally raw TCP with optional client TLS connect support:

sockconnect, sockconnecttls, and sockbind keep the socket handle in their first operand and record the result in the handle’s status slot; use sockstatus or sockerror immediately after those operations. The higher-level rxsocket.rexx wrapper turns these into function return codes for ordinary Level B code.

sockconnecttls is the portable client TLS connect primitive. It connects to rHost:rPort, starts TLS before any application bytes are exchanged, and uses rHost for SNI and certificate name verification. sockstarttls remains the lower-level true STARTTLS primitive for protocols that exchange clear-text bytes before TLS; backends that cannot upgrade an existing connection in place return a negative unsupported status instead of reconnecting behind the caller.

Both TLS instructions are present in all builds. When no TLS backend is compiled in, they record a negative socket status; sockstarttls also returns that code in rRc. Backends are selected at CMake configure time. Fresh builds default to CREXX_ENABLE_TLS=NETWORK on Apple platforms, CREXX_ENABLE_TLS=OPENSSL on non-Windows Unix-like platforms, and CREXX_ENABLE_TLS=SCHANNEL on Windows. The Network backend uses Network.framework, Security.framework, CoreFoundation.framework, and the system trust store for sockconnecttls. The OpenSSL backend supports both direct TLS connect and true STARTTLS. The SChannel backend uses Windows SChannel/SSPI and the Windows trust store, and supports both direct TLS connect and true STARTTLS.