Skip to content
Snippets Groups Projects
demo.c 49.4 KiB
Newer Older
  • Learn to ignore specific revisions
  • Michal Vlasák's avatar
    Michal Vlasák committed
    // Michal Vlasák, FIT CTU, 2023
    //
    // An example of a practical use of DynASM, the dynamic assembler from the
    // LuaJIT project by Mike Pall:
    //
    //         https://luajit.org/dynasm.html
    //
    // An indespensable resource for use of DynASM, except for the official sources
    // is the unofficial documentation by Peter Cawley and includes a tutorial and
    // a reference for the DynASM API and the x86/x86-64 instructions:
    //
    //         https://corsix.github.io/dynasm-doc/
    //
    // A great blog post introducing JITs and featuring DynASM has been written by
    // Josh Haberman:
    //
    //         https://blog.reverberate.org/2012/12/hello-jit-world-joy-of-simple-jits.html
    //
    //
    // This example demonstrates a template JIT compilation of a program for a very
    // simple stack machine. The bytecode design, an example program and the
    // original interpreter implemenation is due to Martin Dørum:
    //
    //         https://mort.coffee/home/fast-interpreters/
    
    
    // Because of glibc versions before 2.37 by mistake didn't expose
    // `MAP_ANONYMOUS` by default, and needed at least _BSD_SOURCE. But since 2.20,
    // _BSD_SOURCE is deprecated and _DEFAULT_SOURCE should be used instead. Not
    // ideal. See also feature_test_macros(7).
    #define _BSD_SOURCE
    #define _DEFAULT_SOURCE
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    
    // First, some generic includes.
    #include <stdio.h>
    #include <stdlib.h>
    #include <stdint.h>
    #include <assert.h>
    
    // These defines are very not not portable. But the rest of our program depends
    // on bigger details of the x86-64 architecture anyways, so assuming that int is
    // 32 bit signed integer (which dynasm also assumes) is least of our concerns.
    typedef unsigned char u8;
    typedef int i32;
    typedef unsigned int u32;
    
    // We need mmap and mprotect (on POSIX systems) or VirtualAlloc and
    // VirtualProtect (on Windows). See their later use in this file.
    #if _WIN32
    #include <windows.h>
    #else
    #include <sys/mman.h>
    #if !defined(MAP_ANONYMOUS) && defined(MAP_ANON)
    #define MAP_ANONYMOUS MAP_ANON
    #endif
    #endif
    
    
    // For Macs we need to use MAP_JIT, we wrap it in `MAP_JIT_VALUE`, so we can be
    // platform independent in the rest of the code. Also on Macs we have to
    // explicitly invalidate the instruction cache for the processor to see the new
    // executable code. This is a noop on x86-64 (otherwise we would also need to
    // consider the issue on other operating systems), but let's hope that just this
    
    // is enough to work on ARM64 based Macs with Rosetta 2.
    
    #if defined(__APPLE__) && defined(__MACH__)
    #define MAP_JIT_VALUE MAP_JIT
    void sys_icache_invalidate(void *start, size_t len);
    #else
    #define MAP_JIT_VALUE 0
    #endif
    
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    // Before we include dynasm includes, we need to (or rather want to) define a
    // few macros that these headers use. Skip ahead until you need to / want to
    // understand this. They allow us to customize the signatures and bodies of all
    // dynasm functions. Dynasm will translate assembly snippets in our C source by
    // replacing them with calls to `dasm_put`, with `Dst` as the first argument,
    // and then some other arguments we don't care about now.
    //
    //         dasm_put(Dst, index, varargs...)
    //
    // Dynasm functions have signatures and bodies which look something like this:
    //
    //         void
    //         dasm_X(Dst_DECL, ...)
    //         {
    //                 Dst_REF = ...;
    //                 ...
    //                 dasm_State *D = Dst_REF;
    //                 ...
    //                 dasm_Y(Dst);
    //                 ...
    //         }
    //
    // I.e. two more macros (`Dst_DECL` and `Dst_REF`) are used. The macro
    // `Dst_DECL` will become part of the function signature and `Dst_REF` needs to
    // be able to evaluate to an lvalue of `dasm_State *`, since dynasm function may
    // assign to it.
    //
    // 1) Our dynasm state will be local to only one function, so a local variable
    // holding the `dasm_State *` is sufficient there.
    //
    // 2) Because the dynasm functions need lvalue of `dasm_State *` we will need a
    // level of indirection. For convenience we will have another local variable
    // which will hold `dasm_State **`, i.e. a double pointer, whose dereference
    // evaluates to an lvalue of `dasm_State *`.
    //
    // 3) Since dasm functions also use `Dst` internally, `Dst`, `Dst_DECL` and
    // `Dst_REF` will necessarily have to use the same variable name. We pick the
    // name `ds` to snad for "DynASM state".
    //
    // 4) To ensure consistent use, all our functions needing `dasm_State` will also
    // use `Dst_DECL` in their signature and `Dst` to refer to the state, i.e. for
    // example to call other dynasm functions which expect the same parameter. The
    // canonical DynASM functions will then be able to extract `dasm_State *` and
    // operate on that structure.
    //
    // This seemingly complicated scheme introduced by DynASM has a nice property -
    // the state passed to DynASM functions is controlled by us. Other macros allow
    // more customization of e.g. allocation in DynASM and by providing our state,
    // we can customize how DynASM allocates memory. Also in real system, we would
    // have more state than `dasm_State *` anyways, so our use could look like this
    // instead:
    //
    //         typedef struct {
    //                 Heap heap;
    //                 Program program;
    //                 dasm_State *ds;
    //         } State;
    //
    //         #define Dst       state
    //         #define Dst_DECL  State *state
    //         #define Dst_REF   (sate->ds)
    //
    #define Dst       ds
    #define Dst_DECL  dasm_State **ds
    #define Dst_REF   (*ds)
    
    // Also last chance before we include the DynASM headers is to optionally enable
    // some checks from DynASM. We'll have them enabled unconditionally, in practice
    // we wouldn't have them enabled in a (sufficiently tested) release build.
    #define DASM_CHECKS
    
    // Now we can finally include the DynASM headers. The first one contains just
    // the declarations of functions we can use or fallback definitions of macros,
    // some of which we chose to override above. This header file is generic and
    // applies to all dynasm backends regardless of architecture.
    #include "dasm_proto.h"
    
    // The second header file we include is specific for our architecture. Since we
    // aim the x86-64 we need to include that. While the preprocessor has a separate
    // file for the 64 bit version of x86, there is only one header common to both.
    #include "dasm_x86.h"
    
    // The `.arch` line tells the dynasm.lua prepreprocessor to load dasm_x64.lua,
    // which knows the details of the x64 (x86-64) architecture. For this reason it
    // has has to one of the first thing the preprocessor sees. But, this directive
    // will also translate directly to some output: a check that the version of the
    // preprocessor (dynasm.lua) is consistent with the header file (dasm_proto.h).
    // So it can't be before the includes above.
    //|.arch x64
    
    // The above directive translates to something like this:
    //
    //         #if DASM_VERSION != 10500
    //         #error "Version mismatch between DynASM and included encoding engine"
    //         #endif
    
    // Dynasm allows us to write snippets of assembly after `|` ("pipe"), or in our
    // version after `//|`. These snippets are preprocessed into calls to the
    // `dasm_put` function which appends the code to an internal dasm buffer, which
    // after several passess (done by functions which will be introduced in due
    // time), we can extract and have an executable series of bytes, i.e. code
    // generated at runtime. The biggest devises of dynasm are:
    //
    //  1) We write the assembly snippets directly into our C source, which makes it
    //     quite readable.
    //
    //  2) We can mix the assembly with regular C expressions. These expressions are
    //     written statically in the source (i.e. `4 * offset`), but their values
    //     can be different at different times `dynasm_put` is called (that is, when
    //     the C execution gets to our code between `//|`). This is nice when we
    //     have things with static _shapes_, but that can take advantage of having
    //     different constants being embedded in the assembly. For an example a
    //     bytecode `OP_GET_LOCAL index` instruction that sets the local at index
    //     `index` to the value at the top of the stack, probably has an index
    //     following it in the instruction stream. From the perspective of the
    //     interpreter, the index has to be loaded everytime any `OP_GET_LOCAL`
    //     instruction is executed, since different `GET_LOCAL` instructions can
    //     have different indices. But if we were to translate that instruction into
    //     assembly, we are always translating a concrete `GET_LOCAL` with a
    //     concrete index, so what we want to generate generally will be the same
    //     code, but with a different constant.
    //
    //     Suppose that locals are 8 bytes and that base pointer to locals is in
    //     the register rbx, and that we want to push the value to the stack:
    //
    //   if (op == OP_GET_LOCAL) {
    //           int index = READ_INDEX(ip + 1);
    //           | mov rax, [rbx + (index * 8)]
    //           | push rax
    //   }
    //
    //   // Here `(index * 8)` is a regular C expression which will become the
    //   argument to `dasm_put`, i.e. `dasm_put(..., (index * 8))`, other arguments
    //   to `dasm_put` need to somehow encode the rest of the snippet, i.e. the
    //   instruction bytes:
    //
    //           mov REG, [REG + CONST]
    //           push REG
    //
    //
    //
    // `.actionlist` results in static array of unsigned chars (i.e. bytes), which
    // constitute a bytecode for the runtime portion of dynasm which performs the
    // puts, links and encodes at runtime. Different `dasm_put` calls use different
    // portions of this buffer. There can be only one `.actionlist` per file.
    //
    // The bytecode contains bytes that are translated literally into the output,
    // these are bytes that directly correspond to encodings of instructions.
    //
    // The "encoding time constants", i.e. the C expressions are not put into the
    // action list. First of all, these are C expressions, that dynasm doesn't know
    
    // much about (it is a simple text based preprocessor), but also naturally these
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    // expressions evaluate to different values at runtime. Instead, the bytecode
    // contains also dynasm instructions, which tell dynasm that for example, now it
    // should take the next expression (passed to `dasm_put` as argument) or that it
    // is now encoding a forward jump with a 4 byte offset, which should be reserved
    // in the first pass, but resolved in a subsequent pass.
    //
    //|.actionlist our_dasm_actions
    //
    // The above directive translates to something like this:
    //
    //
    //         static const unsigned char our_dasm_actions[85] = {
    //                 83,85,72,137,229,72,187,237,237,255,249,204,255,104,237,255,
    //                 89,88,72,1,200, 80,255,72,184,237,237,72,191,237,237,94,252,
    //                 ...
    //         };
    //
    //  Generally most bytes correspond to instruction bytes directly, while the
    //  high bytes (233+ at the moment) are DASM instructions.
    //
    
    // Labels are a well known to anybody who has seen any assembly code. They allow
    // us to refer to positions in the code (or data, ...) by human readable names,
    // instead of the relative offsets or absolute positions the machine code needs.
    // DynASM allows us to have a fixed set of global labels, and give us an array
    // of pointers (`void *`) to them. So not only can we refer to these labels in
    // DynASM snippets and DynASM will translate the use of those labels to
    // probably relative offsets of jumps / calls, but we will also have absolute
    // addresses of these labels and can e.g. call them directly, by casting these
    // pointers to function pointers and calling them.
    //
    // For example we can define an "identity" function that just returns its first
    // (and only argument). According to the AMD 64 System V ABI, the first
    // arguments is passed in register rdi and the return value is supposed to be in
    // the register rax, so our function marked with the global "MY_IDENTITY" label
    // looks like this:
    //
    //         |->MY_IDENTITY:
    //         |
    //         |  mov rax, rdi
    //         |  ret
    //
    // We can refer this function from any other snippet and DynASM will translate
    // the relative call offset as appropriate:
    //
    //         | mov rdi, 5
    //         | call ->MY_IDENTITY
    //
    // The situation with offsets is not that easy though (they are only 32 bit,
    // even on x86-64 where the address space is 64 bit), but we'll come back to
    // that later.
    //
    // Or after processing it, we can access take the address of the function
    // (`void *`), cast it appropriately and call it. The addresses of all globals
    // are stored in a single globals array, which we need to allocate ourselves.
    // The global functions are referred to by the means of their symbolic names
    // through a C enum, with a prefix. We set this prefix by use of the `.globals`
    // dynasm preprocessor directive, which will translate to the enum with all
    // global label names (with the prefix).
    //
    //         void *our_dasm_labels[DASM_LBL__MAX];
    //         dasm_setupglobal(Dst, our_dasm_labels, DASM_LBL__MAX);
    //         // processing the snippet with the global label, as well as linking
    //         // and encoding of the code are omitted, but after that we could do:
    //         long (*identity)(long argument) = (long (*)(long)) dasm_labels[DASM_LBL_MY_IDENTITY];
    //         long five = identity(5);
    //
    // Above, `identity` is a pointer to function taking one argument of the type
    // `long` and returning a `long`. This is the type of the identity function we
    // defined above (since `long` is 64 bit on x86-64). We cast the `void *` from
    // the globals array and store it in `identity` so that we can then call it.
    //
    // (In standard C casting between data pointers and function pointers is not
    // allowed, and indeed it doesn't work on some architectures, but we are fine
    // on x86-64.)
    //
    // Here we tell the dynasm preprocessor what prefix we want to use for the enum,
    // and it will also translate to that enum:
    //
    //|.globals DASM_LBL_
    //
    // The above translates to something like:
    //         enum {
    //                 DASM_LBL_MY_IDENTITY,
    //                 DASM_LBL__MAX
    //         };
    
    
    // DynASM support multiple sectinons, though they are all currently limited to
    // executable code, and "data" or "bss" sections are not supported. We will do
    // with just one section for "code" we declare it here, which causes the
    // preprocessor to define a couple of section related macros -- one for each
    // section to give it an index, and one for the total number of sections that
    // will came handy later, when we call `dasm_init` which needs to know the
    // maximal number of sections we want to use.
    //|.section code
    //
    // The above translates to something like:
    //
    //         #define DASM_SECTION_CODE   0
    //         #define DASM_MAXSECTION     1
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    
    // In these days of virtual memory, most of the memory we can allocate won't be
    // executable. So we need to allocate our own pages of memory, where we will put
    
    // our code and make them executable. However, security-wise, having memory
    // which is both writable and executable is far from ideal, so it is better to
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    // allocate writable pages, put our code into them and then make them
    // executable, but not writable. That way the pages are never both writable and
    // executable at the same time (this is known as W^X and even strictly enforced
    // by some operating systems).
    //
    
    // Apart from the checks the below function is mostly a copy of the one from
    // Peter Cawley's DynASM tutorial, we just do more checking with asserts and try
    // to support Macs.
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    //
    // What the function does, is that it receives a dynasm state, with some
    // snippets already put into it with `dasm_put` which is essentially a first
    // pass over the code, where we are just pasting together snippets and DynASM is
    // interpolating them with the evaluations of the used expressions. `dasm_link`
    // is the second pass we need to do on the pending code. The call to the
    // function essentially tells dynasm that here our assembly pasting ends. It
    // will do a pass over the code determining the relative offsets of labels, etc.
    // and most importantly it will already figure out the final length of the
    // code and return it to use through an output parameter. We use that size to
    // allocate enough writable pages from the operating system and tell DynASM to
    // encode the code there. This will not only allow us to mark those pages
    // executable and have the code in an executable area, but also allows DynASM to
    // know the final destination of the code, so it can finalize relative offsets
    // to e.g. global labels, which (in general) are not in the pending piece of
    // code. After we have the code in our buffer, we ask the operating to change
    // the protection of the pages there to make the area non-writable, but
    // executable. Finally we return a pointer (`void *`) to the start of the
    // buffer. This leaves our function generic and we don't have to limit our
    // linking and encoding to a single function signature -- the caller can cast in
    // any way they like.
    static void *
    
    our_dasm_link_and_encode(Dst_DECL)
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    {
    
    	size_t size;
    	void* code;
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    	assert(dasm_checkstep(Dst, 0) == DASM_S_OK);
    
    	assert(dasm_link(Dst, &size) == DASM_S_OK);
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    #ifdef _WIN32
    
    	code = VirtualAlloc(0, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    #else
    
    	code = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_JIT_VALUE, -1, 0);
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    #endif
    
    	assert(dasm_encode(Dst, code) == DASM_S_OK);
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    #ifdef _WIN32
    
    	DWORD original;
    	VirtualProtect(code, size, PAGE_EXECUTE_READ, &original);
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    #else
    
    	mprotect(code, size, PROT_READ | PROT_EXEC);
    #endif
    #if defined(__APPLE__) && defined(__MACH__)
    	sys_icache_invalidate(code, size);
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    #endif
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    }
    
    
    // All instructions implicitly bump the instruction pointer by their length
    // after they are executed, unless stated otherwise.
    enum op {
    	// Take 4 bytes from the instruction stream, interpret them as little
    	// endian two's complement signed integer and push them on to the stack.
    	OP_CONSTANT,
    
    	// Pop `b` and `a` respectively from the stack, push `a + b` on to the
    	// stack.
    	OP_ADD,
    
    	// Pop a value from top of the stack and print it.
    	OP_PRINT,
    
    	// Take a value from the input stream and advance the input position.
    	OP_INPUT,
    
    	// Pop a value from top of the stack and don't use it for anything.
    	OP_DISCARD,
    
    	// Take 4 bytes from the instruction stream, interpret them as little
    	// endian two's complement signed integer to be used as an offset
    	// relative to the top of the stack (i.e. offset 0 corresponds to the
    	// 32 bit value on top of the stack, offset 1 corresponds to the 32 bit
    	// value 1 below the top of the stack). Push the value found on that
    	// offset (relative to the top of the stack) on top of the stack.
    	OP_GET,
    
    	// Take 4 bytes from the instruction stream, interpret them as little
    	// endian two's complement signed integer to be used as an offset
    	// relative to the top of the stack (i.e. offset 0 corresponds to the
    	// 32 bit value on top of the stack, offset 1 corresponds to the 32 bit
    	// value 1 below the top of the stack). Pop a value from top of the
    	// stack. Assign the value to the slot at the offset (relative to the
    	// top of the stack).
    	OP_SET,
    
    	// Pop `b` and `a` respectively from top of the stack. Compare them and
    	// push a positive value to the top of the stack if `a` is bigger then
    	// `b`, negative value if `b` is bigger than `a` and 0 if `a` is equal
    	// to `b`.
    	OP_CMP,
    
    	// Take 4 bytes from the instruction stream, interpret them as little
    	// endian two's complement signed integer to be used as an offset in the
    	// instruction stream. Pop a value from top of the stack. If the value
    	// is positive, then apply the offset (in bytes) to the current
    	// instruction pointer. Note that the offset is relative to the start
    	// of the `OP_JGT` instruction, i.e. the instruction pointer is only
    	// bumped by the length of the instruction in case the conditional
    	// jump isn't executed.
    	OP_JGT,
    
    	// Halt the execution of the program.
    	OP_HALT,
    };
    
    static void *
    compile(u8 *program, size_t program_len)
    {
    	// Here is the promised local variable holding the `dasm_State *`
    	// itself. Though as defined with the macros above, we and other
    	// functions will generally use it with the name `ds` and expect a
    	// double pointer.
    	dasm_State *dasm_state;
    	dasm_State **ds = &dasm_state;
    
    	// Each state has to be initialized with a call to `dasm_init`. As
    	// promised, we use the macro `Dst`, instead of referring to `ds`
    	// directly. This gives us more consistency with rest of the functions
    	// and calls and is flexible. In any case the name `Dst` is referred to
    	// by the `dasm_put` calls generated by the preprocessor. The second
    	// argument tells DynASM the number of sections we will be using. We
    
    	// only need 1, the section for all our code, but we also diligently used
    	// the `.section` directive above and told DynASM what names we wanted
    	// to give to our sections (we chose "code" for our single one) and thus
    	// now have available the macro DASM_MAXSECTION generated by the
    	// preprocessor.
    	dasm_init(Dst, DASM_MAXSECTION);
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    
    	// We are not using global labels and in a simple template JIT likely
    	// won't find the need, but we sill need to setup the globals, because
    	// local labels can only be used when global labels are setup. And local
    	// labels come handy pretty quickly (our snippets may need to contain
    	// loops for example).
    	void *our_dasm_labels[DASM_LBL__MAX];
    	dasm_setupglobal(Dst, our_dasm_labels, DASM_LBL__MAX);
    
    	// Now that we have our dynasm state initialized, we potentially want to
    	// reuse it to assemble multiple pastings of templates and not just one.
    	// Calling `dasm_init` and `dasm_free` everytime is an option, but we
    	// can just reuse the same state. We just have to initialize each
    	// "trace" with a call to `dasm_setup`, which we have to do even if we
    	// want to process a single "trace" like we are doing in this example.
    	// So here it is. We have to provide the byte array prepared by the
    	// preprocessor. There can only be one `.actions` directive per file,
    	// which means that all `dasm_put` calls in one file are based on that
    	// single actions array (called `our_dasm_actions` in this case), so we
    	// really don't want to pass anything other here. (One could
    	// probably have different files with different `.actions`, but
    	// "templates" from different files couldn't be mixed since DASM stores
    	// the current actions in its internal state and it can't be changed in
    	// any other way than `dasm_setup`, which resets the state.)
    	dasm_setup(Dst, our_dasm_actions);
    
    	// We want to translate jumps from the bytecode to jumps in the assembly
    	// code and want dynasm to figure out the (relative) offsets as needed
    	// by the instructions. For example our `OP_JGT` instruction has an
    	// immediate operand which is the (byte) offset to apply in case a jump
    	// should be taken. The machine code encodes similarly a relative offset
    	// for a (conditional) jump, but it will be a different one than the one
    	// in the bytecode. Since there is a potentially arbitrary number of
    	// instructions in the bytecode we want to compile, we can't possibly
    	// use local labels (1:, ..., 9:), because there are only 9 of them and
    	// already want to be able to use them for local control flow (in one or
    	// possible more templates pasted together) such as conditional
    	// execution or loops. Global labels are also not suitable, because
    	// these have symbolic names and we would have a hard time encoding
    	// things like "instruction 5 needs to jump 8 bytes back, which is a
    	// beginning of another instruction".
    	//
    	// DynASM has us covered and offers "pc labels", also called "dynamic
    	// labels", which are just positive integers (internally used as
    	// indices). These labels are dynamic, because the integers are
    	// determined at the time of `dasm_put`, i.e. they can be arbitrary C
    	// expressions that can evaluate to different values at different times
    	// the snippet is pasted (also called "encoding-time constants"). So
    	// again same shape (snippet), but with different values. We will be
    	// using these labels in a really straightforward way - we'll want to
    	// have a dynamic label for each instruction. When we encounter a jump
    	// instruction, we can just point the jump to the dynamic label of the
    	// destination. This works fine for backward as well as forward jumps,
    	// as with all labels, DynASM is able to resolve relative offsets,
    	// because it does multiple passes on the code. This is even simpler
    	// than our compiler, which had to "fixup" forward jumps!
    	//
    	// Because the dynamic labels are indices to a DynASM internal array, we
    	// need to "preallocate" the ones we'll need, and if we ever need more,
    	// we'll have to allocate more. This is done with call to `dasm_growpc`.
    	// After call to `dasm_growpc(Dst, N)`, dynamic labels 0 through at
    	// least N - 1 will be available to us.
    	//
    	// Our bytecode is a compact serialization - instructions have different
    	// lengths and jumps are based on byte offsets. We will go for a simple
    	// route and instead of somehow numbering the instructions from 0 to
    	// the number of instructions (and then figuring out the forward jump
    	// destinations in a second pass), we'll just have a dynamic label for
    
    	// each _byte_. Then even for forward jumps we are able to calculate
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    	// their byte offset in the instruction stream, where we will be put an
    	// appropriate dynamic label later, and DynASM will resolve it in its
    	// second pass. Having a label for each _byte_ is potentially
    	// really wasteful as a lot of instructions have lengthy immediates, but
    	// it's really simple. (The dynamic labels and the backing array are
    	// also reused for multiple runs, if we don't `dasm_free` immediately,
    	// but just `dasm_setup` before each run).
    	dasm_growpc(Dst, program_len);
    
    	// Now we have a fully initialized DynASM state for this round of
    	// pasting together some assembly snippets. Remember that the lines with
    	// assembly preceded with `//|` will be translated to calls to
    	// `dasm_put(...)`.  So indeed what we are just doing is by controlling
    	// the C execution we determine which assembly snippets we want to paste
    	// together. With DynASM we can form arbitrary pieces of code, but we
    	// somehow need to execute the code. We of course know how to execute
    	// some other code in assembly - jump to it or call it. Both essentially
    	// do a similar thing - through a relative offset or absolute address
    	// stored in a register they change the instruction pointer (and call
    	// additionally pushes to the stack the previous instruction pointer
    	// which pointed to the _next_ instruction, so after we return from the
    	// call, we can just resume the execution).
    
    	// But we want to also somehow jump to our code from C. Unlike the
    	// popular belief, the semantics of "goto statements" in C are not such
    	// low level as a jumps in assembly, so we'll have to do with function
    	// calls. All functions in C respect something that is called the ABI
    	// of the platform, the "Application Binary Interface". The set of rules
    	// for low level code, so that e.g. functions compiled with one compiler
    	// can call functions compiled with another compiler. The ABI is usually
    	// specific to both the current processor architecture (e.g. x86-64),
    	// because the instruction set is really constraining us in what our
    	// "binaries" may look like, as well as the operating system (e.g.
    	// Linux), because not only we somehow need to communicate with the
    	// operating system, but they historically evolved differently and e.g.
    	// have different needs, so the ABI differs. Importantly for us, the ABI
    	// defines sizes of C integer types (e.g. `long` is 64-bit on x86-64
    	// Linux, but 32-bit on x86-64 Windows) as well as how parameters and
    	// return values are passed. For simplicity we will care only about
    	// the System V x86-64 ABI, used notably by Linux and other operating
    	// systems.
    	//
    	//     https://gitlab.com/x86-psABIs/x86-64-ABI
    	//     https://gitlab.com/x86-psABIs/x86-64-ABI/-/jobs/artifacts/master/raw/x86-64-ABI/abi.pdf?job=build
    	//
    	// Notably, Windows has a custom x86-64 ABI:
    	//
    	//     https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170
    	//
    	// See also a blog post by Eli Bendersky:
    	//
    	//     https://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64
    
    	// Here is the System V ABI 101 (imprecise, but enough to get going):
    	//
    	// First 6 arguments are passed in registers (left-to-right):
    	//
    	//     1. rdi
    	//     2. rsi
    	//     3. rdx
    	//     4. rcx
    	//     5. r8
    	//     6. r9
    	//
    	// The return value is expected to be in the register rax. If there are
    	// more than 6 arguments they pushed on to the stack right-to-left. Even
    	// arguments that are smaller than 64 bits use the full registers to be
    	// passed. Small-ish structs are passed as if all the struct members
    	// were passed as arguments separately.
    	//
    	// These registers are _callee_ saved (i.!e. the function that called us
    	// expects the values in these registers to be preserved, so we, the
    	// callee, need to preserve the registers by not using them or restoring
    
    	// them to their original values):
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    	//
    	//     rbp ("base pointer")
    	//     rsp ("stack pointer")
    	//     rbx
    	//     r12
    	//     r13
    	//     r14
    	//     r15
    	//
    	// These registers are _caller_ saved (i.e. we as the callee can use
    	// them freely, since if the caller needs the their values they need to
    	// save the values anyway, on the other hand if we call some other
    	// function, we are the caller and the function can possibly change any
    	// of the registers, so we have to save the values if we need them):
    	//
    	//     rdi
    	//     rsi
    	//     rdx
    	//     rcx
    	//     r8
    	//     r9
    	//
    	//     rax
    	//     r10
    	//     r11
    	//
    	// Notably all registers used for parameter passing are caller saved.
    	// Obviously the return value register is also caller saved, but note
    	// that the callee is free to change it also in the case there is no
    	// return value ("the functions returns void").
    
    	// Now is a good time to decide on how and where to store the data we
    	// need. We won't need the program at runtime, since it will be
    	// hardcoded into the code (including the immediates, etc.), but we will
    	// need the stack (since we will just translate the bytecode for stack
    	// machine into assembly, not change its workings substantially) and
    	// also a pointer to the current input position.
    	//
    	// The pointer to inputs will be passed to us as an argument - this is
    	// the runtime data that our program will process. The `OP_INPUT`
    	// instructions take one 32 bit number from the input and advance the
    	// input pointer (by 4 bytes). We assume that the caller provided
    	// sufficient number of inputs for the program at hand, so we don't
    	// necessarily need the length of the input nor check it at runtime.
    	// Since it is our only argument it will be in rdi, but that is a caller
    	// saved register and when we later execute `OP_PRINT`, we are the
    	// caller, so we need to save the register. We could either `push` and
    	// `pop` rdi around each call, or move the argument to a callee saved
    	// register, which we can save once in the entry to the function and
    	// restore once at the end of the function. This makes for a much
    	// simpler book keeping, though as always, if we save multiple callee
    	// saved registers with `push`, we have to restore them with `pop` in
    	// the _reverse_ order.
    	//
    	// For stack, we'll just use the x86-64 (call) stack. `OP_PUSH` will
    	// translate directly to `push` and `OP_POP` directly to `pop`. These
    	// instructions change the value of `rsp`, which though a "general
    	// purpose register" is usually used as the stack pointer exactly
    	// because some instructions can manipulate it compactly (`call` is
    	// another example of an instruction that implicitly changes the stack
    	// pointer by pushing the return address). So for us `rsp` will
    	// represent pointer to the top of the stack, and we will have to
    	// embrace its constraints. In particular `rsp` points to the position
    	// where the current top of the stack _is_, not one past it like other
    	// schemes do, and the stack grows downwards. So top of the stack is at
    	// offset zero from top of the stack, and if we want to push a value to
    	// the stack, we first need to decrement the stack pointer and then put
    	// the value at the stack pointer position, while to pop we need to
    	// read the value from the current stack pointer and then increment it.
    
    	// I.e. pre-decrement push and post-increment pop (we would get another
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    	// combination if the stack pointer pointed one past the top of the
    	// stack, or the stack grew upwards). While we have to keep that in
    	// mind, this is what the `push` / `pop` instructions will do for us,
    	// and we won't have to do it manually). See also another blog post by
    	// Eli Bendersky, which has illustrations:
    	//
    	//     https://eli.thegreenplace.net/2011/02/04/where-the-top-of-the-stack-is-on-x86/
    	//
    	// (Note that the blog post concerns x86 whose stack operations are 32
    	// bit, while on x86-64 stack operations are 64 bit, e.g. push
    	// instructions decrement the stack by 8).
    
    	// Here comes the entry to our function, the "prologue". Here we usually
    	// want to save all the callee saved registers that we intend to use.
    	// Usually prologue starts with "push rbp" and "mov rbp, rsp", which
    	// establish the "stack frame" (also called "call frame"), see the
    	// linked blog posts by Eli Bendersky for details. Instead we start by
    	// pushing rbx. We will use this callee save register to store the input
    	// pointer -- while the caller passes us the input pointer as the first
    	// argument in rdi, it would get clobbered once we call other function
    	// ourselves. Storing it in a callee saved register and
    	// storing/restoring the register on entry/exit keeps the management
    	// simple. Callee The thing is that the execution of the program may
    	// leave some values on the stack. By resetting the rbp we could get
    	// easily rid of them, though it would need for us to address
    	// relatively to the rsp to restore rbx if it were pushed after rbp.
    	// So we simply push it before (and this restore it with pop after we
    	// restore the base pointer).
    
    	//| push rbx
    	//| push rbp
    	//| mov rbp, rsp
    	//| mov rbx, rdi
    
    	// Now we will go through all instructions and translate them one by
    	// one. For each instruction we have a snippet of assembly, that will be
    	// interpolated with value computed from embedded C expressions
    	// ("encoding-time constants").
    	u8 *instrptr = program;
    	u8 *end = program + program_len;
    
    
    	// This macro is a helper, which reads a 4 byte immediate operand from
    	// a position one byte behind the instruction pointer (which skips the
    	// opcode).
    	#define OPERAND() ((i32) ( \
    		((u32)instrptr[1] << 0) | \
    		((u32)instrptr[2] << 8) | \
    		((u32)instrptr[3] << 16) | \
    		((u32)instrptr[4] << 24)))
    
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    	while (instrptr < end) {
    
    		// Read the current opcode, which distinguishes the current
    		// instruction.
    		enum op op = (enum op)*instrptr;
    
    		// Put here a label corresponding to this instructions byte
    		// offset. Any preceding or following jumps to this instruction
    		// will find this label. Note that we cast the offset to an
    		// `int`. This is because DynASM expects all arguments to
    		// `dasm_put` to `int`s. As mentioned above, in this ABI `int`
    		// is 32 bit, so all values passed to DynASM are limited to the
    		// 32 bit range. This is mostly OK, since in the x86-64 most
    		// immediates encoded in the instruction stream are also limited
    		// to 4 bytes. This 32 bit limitation is thus a limitation of
    		// our target architecture, not necessarily of DynASM, and we
    		// have to somehow deal with that.
    		//
    		// Note that in this case, it's unlikely that we would have
    		// programs larger than 4 gigabytes, i.e. offset would not fit
    		// into 32 bits. But even shorter programs could produce x86-64
    		// executable code larger than 4 gigabytes, and there 32 bit
    		// relative offsets for jumps and calls would also not be
    		// enough.
    		//
    		// Also at the start of the instruction, there is a (commented
    		// out) `int3` instruction. Comment with `//!` (i.e. two forward
    		// slashes and exclamation mark) is not a special syntax
    		// understood by DynASM, just something different than `//|`,
    		// but still a C comment. The `int3` instruction is a "trap to
    		// the debugger". So if you were to uncomment it and run this
    		// program in debugger, the debugger would stop before each
    		// executed (jitted) bytecode instruction. This can be very
    		// useful for debugging, also if paired with a print of some
    		// debug information before/after each trap.
    
    		int offset = (int) (instrptr - program);
    		//|=> offset:
    		//! int3
    
    		// The above  will be translated to roughly:
    		//
    		//         dasm_put(..., offset);
    		//
    		// We could have also said:
    		//
    		//         |=> (int) (instrptr - program)
    		//
    		// Which would translate to:
    		//
    		//         dasm_put(..., (int) (instrptr - program));
    		//
    		// There is nothing much more magic about the embedded C
    		// expressions - DynASM handles them textually, and they are
    		// evaluated at runtime like ordinary C expressions and passed
    		// as arguments to `dasm_put`. Note that officially the
    		// evaluation order of arguments is undefined in C, so it might
    		// a good idea to explicitly evaluate the expressions before
    		// using them in assembly snippets, if they can have side
    		// effects, and even if they don't.
    
    		switch (op) {
    		case OP_CONSTANT: {
    			// If the next instruction is `OP_CONSTANT` we need to
    			// append the code for it, we do it by having an
    			// assembly snippet here, which translates to
    			// `dasm_put`. The instruction has an immediate operand
    			// which we read with a macro and store it in a variable
    			// before passing it to the snippet. While the
    			// `OP_CONSTANT` instruction is generic and an
    			// interpreter has to load the constant from bytecode
    			// everytime, this particular `OP_CONSTANT` will always
    			// have this one same operand, so we can just
    
    			// embed it in the code. The below translates to `push 4`
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    			// if the operand in the instruction stream is
    			// the number 4. This will be pattern matched by DynASM
    			// to be a push of 32 bit immediate (see the unofficial
    			// DynASM instruction reference or other x86-64
    			// instruction references). This is perfect for us,
    			// since we want to push a 32 bit value. If we wanted
    			// to push 64 bit value, we would first have to somehow
    			// get the value to a register (see later) and then use
    			// `push REG`, because there just isn't an instruction
    			// that is able to push 64 bit immediates. DynASM won't
    			// do that for us automatically. What's worse the
    			// argument (passed as vararg) is expected to be an int,
    			// so passing larger values will only result in silent
    			// truncation. Beware!
    
    			i32 operand = OPERAND();
    			//| push operand
    
    			// This instruction is 5 bytes long, we have to advance
    			// the instruction pointer by 5 bytes and switch on the
    			// opcode to decide on what to compile next.
    
    			instrptr += 5; break;
    		}
    		case OP_ADD: {
    			// Pop the second argument, pop the first argument, add
    			// the second to the first and push their result. The
    			// operations here are 64 bit. Since stacks work with 64
    			// bit values, the `pop`s and `push`es have to be 64 bit,
    			// but the arithmetic doesn't -- we could just as well
    			// have written `add eax, ecx` to operate on 32 bit
    
    			// parts of the registers (which will zero out the
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    			// upper 32 bits). It would have save us a byte in the
    			// encoding of the instructions and be slightly faster
    			// in runtime (adding 32 bits is faster than adding 64),
    			// but we don't care about such small improvements.
    
    			//| pop rcx
    			//| pop rax
    			//| add rax, rcx
    			//| push rax
    
    			instrptr += 1; break;
    		}
    		case OP_PRINT: {
    			// Printing is a little tricky. Barring system calls, we
    			// have to call an external function to do the print.
    			// This could be either our own function, or
    			// even one from external library, like `printf`.
    			// There is one problem though normal calls are based on
    			// signed 32 bit relative offsets, while the x86-64
    			// space is 64 bit, so call targets can be much further
    			// than +-2 GiB. Here we find ourselves at runtime,
    			// where both the static and dynamic loader/linker
    			// have already figured out where everything in the
    			// memory is, after all, our program is already
    			// running. So for example the (64 bit) address of the
    			// `printf` function is already known. But since we have
    			// not yet mmaped pages for our code, we don't know
    			// whether it will be close to `printf`!. DynASM has the
    			// capability to handle external names through a
    			// callback which will let us handle that in the
    
    			// encoding stage where we already know the
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    			// destination of our code, but let's not go that far
    			// right now. What we can do is use an _indirect_
    			// through a register - we will store the 64 bit address
    			// of `printf` in a register, and then call it
    			// with `call REG`. Most instructions don't allow 64 bit
    			// immediate operands, but one does, DynASM calls it
    			// fittingly `mov64`, but other assemblers call it
    			// `movabs`. It can store a 64 bit immediate in a
    			// register. Perfect for us. Though there is another
    			// problem, since the parameters to `dasm_put` are all
    			// 32 bit integers (`int`), how are we going to pass it
    			// to DynASM? Well, DynASM preprocessor knows that the
    			// instruction operand is special and handles it by
    			// passing the lower 32 bits and upper 32 bits
    			// separately. Since the argument to `dasm_put` is
    			// determined by a simple text substitution, and here
    			// the argument will be evaluated twice, we should be
    			// even more careful about complex expressions and side
    			// effects.
    			//
    			// To printf, we also have to pass the format string as
    			// a second argument, we also take its address and move
    			// it into a register as 64 bit immediate. The second
    			// argument should be the number to print -- we pop that
    			// from the stack into a register. Finally we call the
    			// function with an indirect call instruction.
    			// Note that we cast the addresses to integers to avoid
    			// compiler warnings (because dynasm follows it with
    			// truncation to 32 bits or extraction of upper 32 bits
    			// respectively.
    			//
    			// It is important to note, that printf respects the
    			// ABI, so we have to presume it destroys the
    			// values in caller saved registers, we only store our
    			// state in rbx, so we our fine.
    
    			//| mov64 rax, ((uintptr_t) printf)
    			//| mov64 rdi, ((uintptr_t) "%zd\n")
    			//| pop rsi
    			//| call rax
    
    			// The above is translated to roughly the following:
    			//
    			//     dasm_put(...,
    			//         (unsigned int)(((uintptr_t) printf)),
    			//         (unsigned int)((((uintptr_t) printf))>>32),
    			//         (unsigned int)(((uintptr_t) "%zd\n")),
    			//         (unsigned int)((((uintptr_t) "%zd\n"))>>32)
    			//     )
    
    			instrptr += 1; break;
    		}
    		case OP_INPUT: {
    			// We have to read 32 bits from the place where rbx (the
    			// input pointer) points to and advance it by
    			// those 32 bits so next time we read the next input
    			// value.
    			//
    			// For reading a 32 bit value (aka dword) from memory,
    			// we could just issue a read into some 32 bit
    			// register, like `eax`, i.e. `mov eax, [rbx]`, which
    			// says "read a 32 bit value from a 64 bit address
    			// stored in the rbx register. The size of the read is
    			// determined implicitly from the (in this case
    			// destination) register size. In cases involving
    			// immediates, we usually have to specify whether we are
    			// storing an 8/16/32/64 bit immediate by including the
    			// right keyword for the size (`byte`, `word`, `dword`,
    
    			// `qword` respectively). But we can also include these
    			// keywords when they are not necessary, if we prefer
    			// that, as we do below by specifying "dword read", even
    			// though it would have been implicit from the use of
    			// `eax` as the destination.
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    
    			//| mov eax, dword [rbx]
    			//| push rax
    			//| add rbx, 4
    
    			instrptr += 1; break;
    		}
    		case OP_DISCARD: {
    			// We just `pop` into a register and don't use the
    
    			// value. That is what `OP_DISCARD` really is. Okay,
    			// since we don't do anything interesting with the
    			// value, we could just increment the stack pointer with
    			// `add rsp, 8`, which would have saved a load to `rax`,
    			// but since we don't care about preserving `rax`, and
    			// use `push` and `pop` instructions in other places, so
    			// we do it for consistency.
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    
    			//| pop rax
    
    			instrptr += 1; break;
    		}
    		case OP_GET: {
    			// For `OP_GET` we have a 32 bit immediate operand which
    			// we will have to use as an offset to the top of the
    
    			// stack. The offset is in number of elements, and in
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    			// our case each elements is 8 bytes. Since we load the
    			// both the constant 8 and the operand as well as the
    			// result of their multiplication are runtime known
    			// values, but constants from the point of view of the
    			// instruction - the multiplication is performed now and
    			// the instruction will only contain the offset
    			// as an immediate, for example if operand is 5, the
    			// instruction will have the immediate value 40 encoded
    			// in it. The way we do that below is slightly risky, we
    			// put the multiplication and read of the operand in the
    			// assembly snippet. The syntax may seem similar to the
    			// well known complex addressing modes of x86-64, where
    			// for example `[rsp + 8 * rax]` would mean of the
    			// possible addressings. The trick here is that DynASM
    			// expects to see `[rsp + rax * 8]`, so it isn't
    			// confused by the leading `8 *`. Since `8 * OPERAND()`
    
    			// is an encoding time constant, an entirely different
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    			// addressing mode will be used `[REG + IMMEDIATE]`.
    
    			// Evaluating the "constant" before the assembly snippet
    
    Michal Vlasák's avatar
    Michal Vlasák committed
    			// and storing it in a variable would have probably made
    			// it much clearer.
    
    			//| mov rax, [rsp + 8 * OPERAND()]
    			//| push rax
    
    			// The above translates roughly to:
    			//
    			//     dasm_put(..., 8 * OPERAND());
    
    			instrptr += 5; break;
    		}
    		case OP_SET: {
    			// There isn't anything interesting to `OP_SET`, we
    			// pop a value from top of the stack and then store it
    			// to an offset from (the new) top of the stack.
    
    			//| pop rax
    			//| mov [rsp + 8 * OPERAND()], rax
    
    			instrptr += 5; break;
    		}
    		case OP_CMP: {