A weird thing in arm64 of operator << in gcc-11

I'm trying to do some log stuff in a Compiler project. When I'm trying to use the fmt::format library.

It was safe and sound to run with apple-clang 13, but when it comes to gcc-11 for the following line:

if ((x.second)->is_list_type()) {
    LOG(INFO) << fmt::format("{} : [{}]", x.first,
            ((ClassValueType *)((ListValueType *)x.second)->elementType)->className);
}

LogStream is something like:

class LogStream {
public:
    LogStream() { sstream_ = new std::stringstream(); }
    ~LogStream() = default;

    template <typename T> LogStream &operator<<(const T &val) noexcept {
        (*sstream_) << val;
        return *this;
    }

    friend class LogWriter;

private:
    std::stringstream *sstream_;
};


The operator << gets error reading the memory byte from the fmt byte, possibly because the author of GCC is not aware the pointer passed do not fit in the following ldur style of stream out. On x86 OSX machine, the GCC have some _M_is_leaked() check in the same line and on Windows MSVC, the line has reported the memory leakage for doubly linked pointer.

The compiled code is:

There's trick to maintain a compiler that have a universal error code output.

Lustre 文件系统使用

最近帮学长跑实验,同时也是毕业论文的实验,用的 Lustre。然后又重新读了一遍古老的PLFS、PMFS论文。同时用的是AMD超算集群,最多可以到512台node。

[scb5090@ln131%bscc-a6 ~]$ lfs quota -h -u scb5090 /public1
Disk quotas for usr scb5090 (uid 6171):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
       /public1  19.13G    450G    500G       -   50865       0       0       -
uid 6171 is using default file quota setting
[scb5090@ln131%bscc-a6 ~]$ lfs quota -h -u scb5090 /public2
lfs quota: cannot resolve path '/public2': No such file or directory (2)
[scb5090@ln131%bscc-a6 ~]$ lfs quota -h -u scb5090 /public3
Disk quotas for usr scb5090 (uid 6171):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
       /public3      0k      0k      0k       -       0       0       0       -
uid 6171 is using default block quota setting
uid 6171 is using default file quota setting

大致想干的一件事是把一个读hdf5的程序在并行文件系统上的scalability。

TwinVisor: Hardware-isolated Confidential Virtual Machines for ARM @SOSP2021

The foundation of trustzone


Here's the graph extrated from [1], essentially to tell the root of trust. A secure system depends on every part in the system to cooperate. For SGX, the Trusted Computing Base(Trusted Counter/ RDRAND/ hardware sha/ ECDSA) is the memory region allocated from a reserved memory on the DRAM called the Enclave Page Cache (EPC), which is initialized at the booting time. The EPC is currently limited to 128MB (in IceLake, was raised to 1TB with weakened HW support. Only 96 MB(24K*4KB pages) could be used, 32MB is for various metadata.) To prevent distruptions by physical attack or previledge software attack from cacheline-granularity modification, every cacheline can be assoiciated with a Message Authentication Code(MAC), but this does not prevent replay attack. To extend the trusted region of memory and do not introduce huge overheads, one solution is put the construct the merkle tree, that every cacheline of leaf is assured by MAC and root MAC is stored at EPC. Transaction Memory Abort with SGX can be leveraged to do page fault side-channel. The transaction memory page fault attack on peresistent memory is still under research.

For Riscv, we have currently 2 proposals - Keystone and Penglai for enclave and every vendor has different implementations. Keystone essentially utilize M- mode PMP limited special registers the control permissions of U- mode and S- mode accesses to a specified memory region. The number/priority of PMP could be pre-configured. and the addressing is mode of naturally aligned power-of-2 regions (NAPOT) and base and bound strategy. The machine mode is unavoidable introduce physical memory fragmentation and waste: everytime you enter another enclave, you have to call M- mode once. Good Side is S/U- Mode are both enclaved by M- mode with easy shared buffer and enclave operation throughout all modes. Penglai has upgraded a lot since its debut(from 19 first commit on Xinlai's SoC to OSDI 21). The originality for sPMP is to reduces the TCB in the machine mode and could provides guarded page table(locked cacheline), Mountable Merkle Tree and Shadow Fork to speed up. However, it introduce the double PMPs for OS to handle, and overhead of page table walk could still be high, which makes it hard to be universal.

Starting from Penglai, IPADS continuously focus on S- mode Enclaves. One of the application may be the double hypervisor in the secure/non-secure S- Mode. The Armv8.4 introduce the both secure and non-secure mode hypervisor originally to support cloud native secure hypervisor. TwinVisor is to run unmodified VM images both as normal and confidential VMs. Armv9 introduce the Confidential Compute Architecture(CCA), another similar technology. TwinVisor is an pre-opensource implementation of it.

supported trustzone extention starting from Armv7.

  1. AMBA-AXI bus extension, adding the flags secure read and write address lines: AWPROT and ARPROT.
  2. extension of controller (or extension of master), adding SCR.NS bits inside ARM Core, so that operations initiated by ARM Core can be marked as "access initiated as secure or access initiated as non-secure".
  3. TZPC extension, TZPC is added to the AXI-TO-APB side to configure the apb controller privileges (or secure controller).
  4. TZASC extension, in the DDRC (DMC) on top of the addition of a memory filter.
  5. MMU support for security extensions:
    1. TTBRx_EL0, TTBRx_EL1 extension: In Armv7, these two registers are banked for secure and non-secure attributes, that is, there is a set of such registers in the secure and non-secure worlds, so in linux and tee, each can maintain a memory page table of its own. The secureos and monitor could share the page table if they are both 64 bits.
    2. cache extension: add the (non-)secure attributes.
    3. VSTTBR_EL2 extension: Since Armv8.4, when the non-secure world uses TTBR_EL2 to translate the address, the entry attribute is checked to be secure and will be translated by itself.
  6. GIC to secure extensions. The trap is devided into group0, secure group1 and non-secure group1. The group0 and secure group1 will not trap to linux.

Proposed Attack Model

The author mentioned physical attack or previledge software attack from N-VM to S-VM, this can be prevent by controlling the transmission channel.

TACTOC attack led by Shared Pages for General-purpose Registers, check-after-load way [50] by reading register values before checking them.

Design

  • Horisontal trap: modifies the N-visor to logically deprivledge N-visor without sharing the data. Exeptional Return(ERET) is the only sensitive instruction affect trusted chain, it intercepted by TZASC and repoted to S-visor.

  • Shadow S2PT: shadow page table of VSTTBR_EL2, used in kvm, too. It has page fault with different status when in different world.

  • Split Continuous Memory Allocation: Tricks to improves utilization and speed up memory management in Twinvisor. In linux, buddy allocator used to decide a continuous memory is big enough for boot and do CMA, this is for better performance of IOMMU that require physical memory to be continuous. (This deterministic algorithm makes it easy for memory probing and memory dump by e.g. row hammer/DRAMA ).

  • Efficient world switch: change NS bit in SCR_EL3 register in EL3, side core polling and shared memory to avoid context switches

  • Shadow PV I/O: use shadow I/O rings and shdow DMA buffer to be transparent to S-VMs. reduce ring overhead by do IRQ only when WFx instructions.

Experiment

Suppose

The world switch does not happen so frequently.

Hardware

Kirin 990. (Not scalable to Big machines, because KunPeng920 is not yet Armv8.4, scability is not convincible)

Reference

  1. A Survey on RISC-V Security: Hardware and Architecture TAO LU, Marvell Semiconductor Ltd., USA
  2. MIT 6.888
  3. ShieldStore: Shielded In-memory Key-value Storage with SGX
  4. Improving the Performance and Endurance of Encrypted Non-volatile Main Memory through Deduplicating Writes
  5. RiscV Spec 1.11
  6. Armv7 TZ
  7. lwn CMA and IOMMU

Phosphor - My Pitfalls writing dependency

Currently, I'm busy writing emails for my Ph.D and taking TOEFL and taking care of the Quantum ESPRESSO library changing and MadFS Optimization, so it may waste some time. Till now, I have to apply the DTA tool of phosphor for the java order dependency project.

about surfire integration into normal tests.

  • Maven extension
    • Integration into Maven add the redirector
      • Insert phosphor plugin one class by one into.
      • Configuration to the phosphor
      • Class Visitor, Method Visitor, Adaptor Mode Visitor
    • Mutable field in the Dependency Tainter
      • Start the taint for some place attach the tainted check after the test
      • Assert the junit stuf in check=omparison.
      • Brittle assertions in check(Taint) recursively.
    • Output the tainted version into the sufire executable folder
  • Debug
    • mvn install -Dmaven.surefire.debug -f /Volumes/DataCorrupted/project/UIUC/bramble/integration-tests/pom.xml and attach the trace point.
      • Start from the maven compilation.

Brittle Assertion

This outputs only the dependency for one test introduced in Oracle Polish JPF. For dependenct between test1 and test2,

For NPE, get the pair by idflakies test first.

 JVM Asm

Reference

  1. https://www.kingkk.com/2020/08/ASM%E5%8E%86%E9%99%A9%E8%AE%B0/

NVOverlay: Enabling Efficient and Scalable High-Frequency Snapshotting to NVM

NVOverlay is a fast technique to make fast snapshot from the DRAM or Cache to make them persistent. Meantime, it utilized tracking technique, which is common to the commercially available VMWare or Virtual Box on storage. Plus, it used NVM mapping to reduce the write amplification compared with the sota logged based snapshot.(by undo(write to NVM before they are updated) or redo may add the write amplification. To specify not the XPBuffer write amplification, but the log may adds more writing data)

So-called High-frequency snapshotting is to copy all the possible data in a millisecs interval when CPU load/store to DRAM. Microservice thread may require multiple random access to MVCC of data, especially for time series ones. To better debug the thread of these load/store, the copy contents process should be fast and scalable.


here OMC means overlay memory controller

The cache coherency is considered deeply. For scalability to 4U or 8U chassis, they add a tag walk to store the local LLC tag. We know that all the LLC slice is VIPT because they are shared. For the same reason, the tag can be shared but unique to one shared space.

For a distributed system-wide problem that have to sync epoch counters bettween VDs, they used a Lamport clock to maintain the dirty cache's integrity.

Continue reading "NVOverlay: Enabling Efficient and Scalable High-Frequency Snapshotting to NVM"

炼丹大师-凸优化复习

这次期末project是之前发过的Federated Learning。是凸优化的最近发的热点,在讲演的时候,石远明问了我们如果我们claim 他们证明上的谬误是个问题的话,是一个很大的贡献。后来我们发现这个常数对bound 不是很有影响。大概这也是那帮谷歌的人糊弄过去的原因吧,但还是被我们question 住了。

总之上这课的老板是个天才,现在在参与6G理论建设。

Continue reading "炼丹大师-凸优化复习"

Proposal for *A online systematic scheduling algorithm over Distributed IO Systems.*

In the resource allocation problem in the Distributed Systems under the High Performance Computer, we don't really know which device like disk, NIC (network interface) is more likely to be worn, or not currently on duty which may trigger delaying a while to get the data ready. The current solution is random or round robin scheduling algorithm in avoidance of wearing and dynamic routing for fastest speed. We can utilize the data collected to make it automatic.

Matured system administrator may know the pattern of the parameter to tweak like stride on the distributed File Systems, network MTUs for Infiniband card and the route to fetch the data. Currently, eBPF(extended Berkeley Packets Filter) can store those information like the IO latency on the storage node, network latency over the topology into the time series data. We can use these data to predict which topology and stride and other parameter may be the best way to seek data.

The data is online, and the prediction function can be online reinforce learning. Just like k-arm bandit, the reward can be the function of latency gains and device wearing parameter. The update data can be the real time latency for disks and networks. The information that gives to the RL bots can be where the data locate on disks, which data sought more frequently (DBMS query or random small files) and what frequency the disk make fail.

Benchmarks and evaluation can be the statistical gain of our systems latency and the overall disk wearing after the stress tests.