[CSE211 Reading] MLIRSynth framework

  1. Motivation:
    • Problem Statement
      • The current compilation phase for heterogeneous devices like CPU GPU or TPU is too divergent and not high performance because of the lack of semantic translation when lowering the IR.
      • image-20231112081221616
      • MLIR is an infrastructure for developing domain-specific compilers. To aid this, MLIR provides reusable building blocks, especially the abstraction of dialects with a bunch of operators that have knowledge of cross-device memory communication and predefined and shared tools that allow us to define domain-specific languages and their compilation pipelines.
    • SoTA
      • Polygest has already filled what's done for multi-dimensional dialects like Affine IR, where we normally do auto-CGRA/GPU/TPU code generation.
      • Google's HLO can do XLA just like Jax or Chris Latner's Mojo is doing
      • LLVM Polly can do backend compilation with very good performance insight for a single machine.
      • Linalg IR (By the way Linear Algebra extensions have been accepted from the C++26 community that maps the header to this primitive IR) has the insight from the mathematical view to transform the matmul and transpose to be only one-time transpose. (together with many other mathematical optimizations) and has the best insight to clear away the linear algebra residual dead primitive.
    • Motivation
    • image-20231112080422281
      • LLVM IR/Affine IR/Linalg IR are too heterogeneous in different ways. Sa HLO is a better way of raising from C++ to ML DSLs that is super useful for TPUs. Taking from the uniformed IR to a divergent but idompediency in terms of dataflow (especially IO) and semantic to easily codegen to different dialects are super useful for current development for the compiler to TPU/GPU/CPU extensions.
      • For raising and lowering, it is actually impossible to embed the same logic with no information loss. Say, I'm writing the predefined functions for an application, For cross-platform optimization in MLIR is good for memory translation and compliance to different targets' views from a data movement perspective if you are lowering to XLA.
      • For the impossible dimensions for compatibility, debug information and performance insight, Say, a library that's been optimized may be nonsense from a lowered IR perspective. If you don't have the knowledge of both IR, it's stupid abstraction.
      • DataFlow may be completely wrong, so we need a residual IO spec generator to maintain the idompediency.
      • Compared with HAILE and Fortran MLIR, we require a lot of functionality wise upgrades.
  2. Compiler Solution: The MLIRSynth Framework - A virtual Compilation Phase abstraction:
image-20231112081336639

Heuristics: candidate set for getting the phi instruction out to match the set between target dialect and source dialect.

Soundiness: CBMC Z3 for determining the correctness statically.

To extract φ (phi) with the candidate set, the algorithm follows a bottom-up synthesis approach. Here is a summary of the process:

  1. Initialization: The algorithm starts by creating a candidate set (C) with valid candidates that produce distinct values from each other. This set includes candidates that return the arguments of the reference function (f) and simple constants.
  2. Enumeration: The algorithm iterates through the set of operations in the grammar. For each operation, it identifies sets of possible operands, attributes, and regions based on the operation signature.
  3. Candidate Generation: The algorithm generates possible candidates by taking the Cartesian product of sets of operands, attributes, and regions.
  4. Candidate Checking: Each candidate in the set is validated using a series of static checks, ordered by complexity. These checks include type correctness and additional checks via dialects verification interface.
  5. Equivalence Pruning and Validation: If the static checks succeed, the algorithm uses MLIR's execution engine to compile the candidate. It then checks φobsn by executing the candidate program (f') on a set of inputs and comparing the output value with the output value produced by the reference function (f).
  6. Specification Checking: The algorithm checks if the candidate satisfies the specifications φ_{obsn} and φ_{obsN} by comparing the outputs of the candidate and the reference function on a small finite set of inputs (In) and a large finite set of inputs (IN), respectively.
  7. Illustrative Example:
image-20231116104246652

The above is from one dialect to the other

  1. Key Results:
image-20231116105913960
image-20231116110116097

The evaluation over PollyBench on 8700k and 3990x summary tells us about the performance and effectiveness of the mlirSynth algorithm in raising programs to higher-level dialects within MLIR. The TPU performs well over LLVM for all(because LLVM is not a good IR for heterogenous accelerator) It provides information on the synthesis time, validity checks, and the impact of type information and candidate pruning on the synthesis process. The summary also mentions the performance improvement achieved by the raised programs compared to existing compilation flows, as well as the potential for further improvements and future work.

  1. Discussion and Future Directions:
  • Benefits:
    • The bottom-up enumerative synthesis approach in MLIR allows for raising dialect levels within the MLIR framework.
    • The retargetable approach is applied to Affine IR, raising it to the Linalg and HLO IRs.
    • The raised IR code, when compiled to different platforms, outperforms existing compilation flows in terms of performance.
    Implications:
    • The use of polyhedral analysis in the compilation community has been extensively explored, but MLIR-Synth offers a different approach by using polyhedral analysis to raise dialect levels instead of lowering code.
    • The synthesis process in MLIR-Synth involves type filtering, candidate evaluation, and equivalence checking, which significantly reduces synthesis time compared to a naive algorithm.
    Future Work:
    • The authors plan to raise programs to multiple target dialects and improve the synthesis search by reusing previous program space explorations.
    • They also aim to integrate model checking into the synthesis process and evaluate raising to new and emerging dialects of MLIR.
    • The scalability of the synthesis algorithm will be improved to handle larger benchmark suites.
  • The middle IR is always there for sure that is easier been developed from different angle but it's not the killer app for giving a new tool. The speed up from the tool is basically the backend that already has.

Reliable and Fast DWARF-Based Stack Unwinding @OOPSLA19

Dwarf is a bytecode format for leaving runtime debugging info based on the symbolic register and memory location, which gives a recoverable last instruction and call frame info. Given the current unwind is slow and Google traces will use frame pointer to accelerate the production fast unwind, the author provides the fix point control flow analysis based validation and synthesis.

On running every line of code, the symbolic value will be eval to locate the stack frame, it will recursively walk stack to unwind for every call frame.

By architectural advantage, we can leverage offset based on un-updated varaibles during computation like %rip or %

Continue reading "Reliable and Fast DWARF-Based Stack Unwinding @OOPSLA19"

How to do c++ reflection with pointer.

The serialization struct can be a lot, but like CRIU ways of interpreting the c struct is tedious; you need to dump and restore the Linux state, which seems too tedious; rather, I need to write a flat struct that can be reconstructed and use concept trait to require the dump and restore things.

In CRIU, they performs like this:

// SPDX-License-Identifier: MIT

syntax = "proto2";

message bpfmap_data_entry {
	required uint32	map_id			= 1;
	required uint32	keys_bytes		= 2;	/* Bytes required to store keys */
	required uint32	values_bytes		= 3;	/* Bytes required to store values */
	required uint32 count			= 4;	/* Number of key-value pairs stored */
}

In dump and restore process:

int do_collect_bpfmap_data(struct bpfmap_data_rst *r, ProtobufCMessage *msg, struct cr_img *img,
			   struct bpfmap_data_rst **bpf_hash_table)
{
	int ret;
	int table_index;

	r->bde = pb_msg(msg, BpfmapDataEntry);
	ret = bpfmap_data_read(img, r);
	if (ret < 0)
		return ret;

	table_index = r->bde->map_id & BPFMAP_DATA_HASH_MASK;
	r->next = bpf_hash_table[table_index];
	bpf_hash_table[table_index] = r;

	pr_info("Collected bpfmap data for %#x\n", r->bde->map_id);
	return 0;
}
int restore_bpfmap_data(int map_fd, uint32_t map_id, struct bpfmap_data_rst **bpf_hash_table)
{
	struct bpfmap_data_rst *map_data;
	BpfmapDataEntry *bde;
	void *keys = NULL;
	void *values = NULL;
	unsigned int count;
	LIBBPF_OPTS(bpf_map_batch_opts, opts);

	for (map_data = bpf_hash_table[map_id & BPFMAP_DATA_HASH_MASK]; map_data != NULL; map_data = map_data->next) {
		if (map_data->bde->map_id == map_id)
			break;
	}

	if (!map_data || map_data->bde->count == 0) {
		pr_info("No data for BPF map %#x\n", map_id);
		return 0;
	}

	bde = map_data->bde;
	count = bde->count;

	keys = mmap(NULL, bde->keys_bytes, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, 0, 0);
	if (keys == MAP_FAILED) {
		pr_perror("Can't map memory for BPF map keys");
		goto err;
	}
	memcpy(keys, map_data->data, bde->keys_bytes);

	values = mmap(NULL, bde->values_bytes, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, 0, 0);
	if (values == MAP_FAILED) {
		pr_perror("Can't map memory for BPF map values");
		goto err;
	}
	memcpy(values, map_data->data + bde->keys_bytes, bde->values_bytes);

	if (bpf_map_update_batch(map_fd, keys, values, &count, &opts)) {
		pr_perror("Can't load key-value pairs to BPF map");
		goto err;
	}
	munmap(keys, bde->keys_bytes);
	munmap(values, bde->values_bytes);
	return 0;

err:
	munmap(keys, bde->keys_bytes);
	munmap(values, bde->values_bytes);
	return -1;
}

The problem is the above process does not leverage the cpp feature that uses the static compilation capability to generate visitor member function for every struct that passes to protobuf automatically. However, this is not perfect because in C, stack will have some type like this

union {
    uint64 _make_it_8_byte_aligned_;
    WASMMemoryInstance memory_instances[1];
    uint8 bytes[1];
} global_table_data;

BlockAddr block_addr_cache[BLOCK_ADDR_CACHE_SIZE][BLOCK_ADDR_CONFLICT_SIZE];

/* Heap data base address */
DefPointer(uint8 *, heap_data);
/* Heap data end address */
DefPointer(uint8 *, heap_data_end);
/* The heap created */
DefPointer(void *, heap_handle);

It sounds like can not automatically pass to struct_pack. So still need a stub cpp struct for protobuf, but this time, no need to manually write protobuf, every metadata generation is compile time, and runtime only do the serialization. The above struct can be defined below in c++.

struct WAMRMemoryInstance {
    /* Module type */
    uint32 module_type;
    /* Shared memory flag */
    bool is_shared;
    /* Number bytes per page */
    uint32 num_bytes_per_page;
    /* Current page count */
    uint32 cur_page_count;
    /* Maximum page count */
    uint32 max_page_count;
    /*
     * Memory data begin address, Note:
     *   the app-heap might be inserted in to the linear memory,
     *   when memory is re-allocated, the heap data and memory data
     *   must be copied to new memory also
     */
    std::vector<uint8> memory_data;

    /* Heap data base address */
    std::vector<uint8> heap_data;
}
WAMRMemoryInstance memory_instances;

std::array<std::array<WAMRBlockAddr, BLOCK_ADDR_CACHE_SIZE>, BLOCK_ADDR_CONFLICT_SIZE> block_addr_cache;

/* Heap data base address */
std::vector<uint8> heap_data;

Then define the concept to illustrate the trait

template <typename T, typename K>
concept SerializerTrait = requires(T &t, K k) {
    { t->dump(k) } -> std::same_as<void>;
    { t->restore(k) } -> std::same_as<void>;
};

impl for every stub struct.

void dump(WASMMemoryInstance *env) {
    module_type = env->module_type;
    is_shared = env->is_shared;
    num_bytes_per_page = env->num_bytes_per_page;
    cur_page_count = env->cur_page_count;
    max_page_count = env->max_page_count;
    memory_data.resize(env->memory_data_size);
    memcpy(memory_data.data(), env->memory_data, env->memory_data_size);
    heap_data = std::vector<uint8>(env->heap_data, env->heap_data_end);
};
void restore(WASMMemoryInstance *env) {
    env->module_type = module_type;
    env->is_shared = is_shared;
    env->num_bytes_per_page = num_bytes_per_page;
    env->cur_page_count = cur_page_count;
    env->max_page_count = max_page_count;
    env->memory_data_size = memory_data.size();
    env->memory_data = (uint8 *)malloc(env->memory_data_size);
    memcpy(env->memory_data, memory_data.data(), env->memory_data_size);
    env->heap_data = (uint8 *)malloc(heap_data.size());
    memcpy(env->heap_data, heap_data.data(), heap_data.size());
    env->heap_data_end = env->heap_data + heap_data.size();
};

Reference

  1. https://github.com/alibaba/yalantinglibs/pull/122
  2. https://www.youtube.com/watch?v=myhB8ZlwOlE

Encountering `::signbit` stuff not passing to `math.h` in MacOS 12.4

TL;DR

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:317:9: error: no member named 'signbit' in the global namespace
using ::signbit;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:318:9: error: no member named 'fpclassify' in the global namespace
using ::fpclassify;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:319:9: error: no member named 'isfinite' in the global namespace
using ::isfinite;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:320:9: error: no member named 'isinf' in the global namespace
using ::isinf;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:321:9: error: no member named 'isnan' in the global namespace
using ::isnan;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:322:9: error: no member named 'isnormal' in the global namespace
using ::isnormal;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:323:9: error: no member named 'isgreater' in the global namespace
using ::isgreater;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:324:9: error: no member named 'isgreaterequal' in the global namespace
using ::isgreaterequal;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:325:9: error: no member named 'isless' in the global namespace
using ::isless;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:326:9: error: no member named 'islessequal' in the global namespace
using ::islessequal;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:327:9: error: no member named 'islessgreater' in the global namespace
using ::islessgreater;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:328:9: error: no member named 'isunordered' in the global namespace
using ::isunordered;
      ~~^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/cmath:329:9: error: no member named 'isunordered' in the global namespace
using ::isunordered;
      ~~^

When I was compiling LLVM recently I found this, it may be because my CommandLineTool is outdated as described in stackoverflow. And I reinstalled it with following code added.

using ::signbit _LIBCPP_USING_IF_EXISTS;
using ::fpclassify _LIBCPP_USING_IF_EXISTS;
using ::isfinite _LIBCPP_USING_IF_EXISTS;
using ::isinf _LIBCPP_USING_IF_EXISTS;
using ::isnan _LIBCPP_USING_IF_EXISTS;
using ::isnormal _LIBCPP_USING_IF_EXISTS;
using ::isgreater _LIBCPP_USING_IF_EXISTS;
using ::isgreaterequal _LIBCPP_USING_IF_EXISTS;
using ::isless _LIBCPP_USING_IF_EXISTS;
using ::islessequal _LIBCPP_USING_IF_EXISTS;
using ::islessgreater _LIBCPP_USING_IF_EXISTS;
using ::isunordered _LIBCPP_USING_IF_EXISTS;
using ::isunordered _LIBCPP_USING_IF_EXISTS;

_LIBCPP_USING_IF_EXISTS is defined as # define _LIBCPP_USING_IF_EXISTS __attribute__((using_if_exists)), simply pass if no defined in the global namespace.

Then the following code output error

using _Lim = numeric_limits<_IntT>;

add another header in

#include <limits>

Then comes to the std::isnan using bypassing no definition error in llvm/lib/Support/NativeFormatting.cpp.

error: expected unqualified-id for std::isnan(N)

just drop the std::

The full formula for riscv-rvv-llvm is located in https://github.com/victoryang00/homebrew-riscv, if anything above happens, do as the above specifies.

A command line string to the class constructor.

There's the demand of passing a name and doing the class construction of this name. I don't want to make the switch case on the construct. Let's do the hack!

I found the base implementation in stackoverflow.

template <class T> void* constructor() { return (void*)new T(); }

struct factory
{
   typedef void*(*constructor_t)();
   typedef std::map<std::string, constructor_t> map_type;
   map_type m_classes;

   template <class T>
   void register_class(std::string const& n)
   { m_classes.insert(std::make_pair(n, &constructor<T>)); }

   void* construct(std::string const& n)
   {
      map_type::iterator i = m_classes.find(n);
      if (i == m_classes.end()) return 0; // or throw or whatever you want
      return i->second();
   }
};

factory g_factory;

#define REGISTER_CLASS(n) g_factory.register_class<n>(#n)

The problem is it does not allow the arg passing in construction. My class accepts the arguments module.

template <class T, typename M_> void *constructor(M_ *module_) {
    return (void *)new T{reinterpret_cast<M_ *>(module_)};
}
template <typename M_> struct arg_to_pass {
    typedef void *(*constructor_t)(M_ *);
    typedef std::map<std::string, constructor_t> map_type;
    map_type m_classes;
    M_ *module;

    template <class T> void register_class(std::string const &n, M_ *&module_) {
        module = module_;
        m_classes.insert(std::make_pair(n, &constructor<T, M_>));
    }

    void *construct(std::string const &n) {
        auto i = m_classes.find(n);
        if (i == m_classes.end())
            return nullptr; // or throw or whatever you want
        return i->second(module);
    }
};

arg_to_pass<Module> pass_factory;

#define REGISTER_CLASS(n, m_) pass_factory.register_class<n>(#n, m_)

This will resolve all the problem.

高级语言 to LLVM 的解释层

最近在做编译原理课程设计的设计,看了很多到 LLVM 的编译器的想法,同时发现 Rust 类型体操作为黑魔法合集也能带给社区很多新鲜玩意,就把之前设计 Chocopy LLVM 层的一些小想法放在这,上科大的同学想玩可以加个piazza,invite code: CHOCOPY。有一部分参考 High Level Constructs to LLVM_IR, 范型的设计更多参考 rust 和 c。

Continue reading "高级语言 to LLVM 的解释层"

A weird thing in arm64 of operator << in gcc-11

I'm trying to do some log stuff in a Compiler project. When I'm trying to use the fmt::format library.

It was safe and sound to run with apple-clang 13, but when it comes to gcc-11 for the following line:

if ((x.second)->is_list_type()) {
    LOG(INFO) << fmt::format("{} : [{}]", x.first,
            ((ClassValueType *)((ListValueType *)x.second)->elementType)->className);
}

LogStream is something like:

class LogStream {
public:
    LogStream() { sstream_ = new std::stringstream(); }
    ~LogStream() = default;

    template <typename T> LogStream &operator<<(const T &val) noexcept {
        (*sstream_) << val;
        return *this;
    }

    friend class LogWriter;

private:
    std::stringstream *sstream_;
};


The operator << gets error reading the memory byte from the fmt byte, possibly because the author of GCC is not aware the pointer passed do not fit in the following ldur style of stream out. On x86 OSX machine, the GCC have some _M_is_leaked() check in the same line and on Windows MSVC, the line has reported the memory leakage for doubly linked pointer.

The compiled code is:

There's trick to maintain a compiler that have a universal error code output.