[CSE231 Paper Reading] Log Structure Filesystem

Motivation

Computer hardware performance began to explode: CPUs were getting faster and RAM was getting bigger; however, while sequential read and write speeds of hard disks were increasing, random read and write speeds, which were limited by physical seek times, were hardly shorter than 10 ms. On the other hand, file systems at the time, whether Unix File System or FFS, had a large number of random reads and writes (at least 5 random writes are required to create a new file in FFS), thus becoming a performance bottleneck for the whole system. At the same time, because of Page cache, the authors argue that random reads are not the main problem: with more and more memory, most of the reads can be cached, so the main problem of LFS is to reduce random writes to the hard disk.

Implementation

  1. File System as a Log
  2. Segment-based bulk writing solves the random write problem, but how does LFS implement read operations? Similar to UFS/FFS, LFS stores the contents of a file within a segment, and also stores the index of the file. Specifically: in Segment, the file contents are stored in a fixed-size data block. Segment0 stores the two data blocks of file2, and the subsequent inode2 stores the indexes of these two data blocks. However, unlike UFS/FFS, LFS inodes are dynamically allocated, so LFS stores an index to the inode at the end of each Segment, called the inode map. in LFS, all inode map contents are cached in the contents, which speeds up reads.
  3. Garbage collection: As mentioned earlier, LFS requires a garbage collection mechanism designed to remove old data. In LFS, multiple segments containing obsolete data blocks are compacted into new data segments, and the old data in them is deleted.
  4. Failure recovery is obviously important for any file system to be able to recover data from a failure. Unlike UFS, which uses the fsck command to recover from a failure, LFS stores a checkpoint for the entire drive: because each Segment of LFS stores the address of the next Segment, the entire file system is organized like a chain. In Checkpoint, LFS stores the address of the first Segment and the last Segment of this chain, so the entire file system can be recovered by reading Checkpoint. lfs updates the data in Checkpoint every 30 seconds.

Experiments

Figure 3/4 shows the disk capacity utilization and fraction alive the segment cleaned with write cost are better than FFS today for uniform data and 10% hot data and 90% cold data. Figure 5 shows that in terms of segment utilization, the hot-and-cold data are better than uniform. Figure 6 shows that the segment distribution occurs when the cost-benefit policy is used to select segments to clean and live blocks grouped by age before being re-written.

Evaluation effectiveness

This evaluation is designed sophistically.

Novelty

Rosenblum is a co-founder of Vmware and Ousterhout is one of the authors of Raft. I don't have any comments on their novelty, just they are tooooo ahead of time.

SoTA Connection

The Segment-based design of LFS coincides with the physical characteristics of SSDs and is therefore widely used in SSD firmware; the memory table/compaction of LSM is in line with the memory buffer and GC of LFS, and new file systems such as btrfs are based on the append-only feature of LSM to implement copy-on-write or multiversion. newer file systems such as btrfs also implement copy-on-write or multi-version features based on the LSM append-only feature. The EXT3/4 also has redo logging with journal/ordered/writeback mode.

Also, the Parallel Log-Structure FS follows the LFS to make parallel FS tailored to N-N write which speeds up the MPI write.

Reference

  1. http://www.eecs.harvard.edu/~cs161/notes/lfs.pdf
  2. https://lwn.net/Articles/353411/
  3. https://web.stanford.edu/~ouster/cgi-bin/papers/lfs.pdf
  4. https://pages.cs.wisc.edu/~remzi/OSTEP/file-lfs.pdf
  5. https://www.usenix.org/legacy/publications/library/proceedings/sd93/seltzer.pdf