Making Checkpointing fast

Since I began working for turso officially in May of this year (relevant post), I had been spending most of my time learning the inner workings of the cloud platform, and familiarizing myself with the other codebases and libraries. When I was told a few weeks ago that I could begin spending 80% of my time working on the database itself, I was more than thrilled, as for the past 10 months, this was the project I would spend time on for fun, to give myself a break from work.

One of my first tasks was performance, specifically around our async I/O implementation (io_uring). Benchmarks and flamegraphs showed that under write-heavy workloads performance would degrade sharply after enough writes. Digging in, this wasn’t just an I/O problem, our Write-Ahead Log (WAL) needed some dire attention. The early WAL implementation was sufficient for bootstrapping correctness, but it was now time for a full revamp.

One of several design choices that turso made, is to use the WAL journal mode exclusively. Until MVCC support is finished, this will remain our primary method of transaction isolation, as well as for the usual benefit of journaling, providing durability.

A checkpoint is the process of moving committed frames from the log back into the main database file. There are different 'modes' of checkpointing, a few of which are: Restart and Truncate, which both restart the log over from the beginning (with truncate then going on to physically truncate the WAL file afterwards), and Passive, which is the default mode that will be triggered periodically by default during the pager's cache flush method.

There are a number of locks that are shared between all connections, which must be held either exclusively by a writer or a checkpointer, or by readers to signify that there is an active read transaction going on, and what the maximum frame is for that read transaction.

The checkpoint operation, sort of like a read transaction, also has it's own min and maximum frames that it needs to backfill (backfilling is the process of writing pages from the log to the database). All the current readers are evaluated and depending on the checkpoint's Mode, either the checkpoint's max_frame is decreased, or a Busy error is returned, to prevent moving frames which are not in the reader's snapshot to the database file, where the reader might see them before they should.

The (simplified) checkpoint flow

Trigger:

Either: a. At the end of a write transaction, during cacheflush process the log will check if the amount of frames in the cache is greater than a specific threshold (typically 1000 frames), and auto-trigger PASSIVE checkpoint.
b. An explicit PRAGMA wal_checkpoint(mode); is executed.

Process outline:

  1. Locking: Acquire the checkpoint locks needed for the chosen mode (e.g., checkpoint_lock and read-mark 0 for all modes, and modes that require “all frames” also hold the exclusive writer lock).
  2. Safe bound: Compute the checkpoint's max_frame by examining active readers, capping it to the lowest pinned frame among them. The previous checkpoint’s max is stored as nBackfills, which denotes the current checkpoint's min_frame.
  3. Worklist: Build the list of (page_id → latest_frame ≤ max_frame) candidates to backfill.
  4. Copy: For each selected frame, issue a read from the WAL and write the page into the main DB file at its offset.
  5. Finalize: If all possible frames were backfilled, fsync and truncate the DB as needed, and depending on mode, restart/truncate the WAL; advance nBackfill. Release locks and return counts.

So what are the performance issues here?

Copied frames one at a time, with strictly serialized reads and writes.

Issued writes in no particular page order (whatever order they came from the frame cache in).

Never looked at the in-memory page cache, so even already-hot pages were re-read from the WAL every time.

Symptoms:

Lots of syscalls, poor locality.

Average checkpoint time was roughly 8-13 ms on a representative test on my machine.

CPU stalled on I/O and cache misses, even for warm pages.

Correct, but not ideal.

One of the most exciting features we are introducing in tursodb not found in SQLite is Async I/O (via io_uring), This allows us to do things concurrently, so the previous setup of having all writes fully serialized awaiting the reads, is not exactly ideal here.

I had recently introduced using vectored I/O for any contiguous pages that are written to the DB file, so we could read in batches of pages, and then submit single pwritev syscalls with a provided array of buffers, for any groups of page ID's that were sequential. This helped increase performance, but the serialization of the reads was still a bottleneck.

Old flow:

  Start
  ├─ acquire proper locks
  ├─ compute mxSafeFrame
  ├─ build (page_id, latest_frame≤mxSafeFrame) list
  └─ → ReadPage

ReadPage
  ├─ Issue `pread` for current page
  └─ → WaitReadPage

WaitReadPage
  ├─ Check if current page still locked (awaiting I/O)
  └─ Locked? if pages =< batch_len -> FlushBatch : return I/O

FlushBatch
  ├─ Collect buffers from current batch of read pages
  ├─ Issue a `pwritev` for each contiguous run.
  ├─ Store marker to track completion of run.
  └─ → WaitFlushBatch

WaitFlushBatch
  ├─ If Batch complete:
  ├─── More reads ? ReadPage : Done
  └─ Else? return I/O
Done
  ├─ fsync db file
  ├─ advance nBackfill and increment checkpoint_seq
  ├─ (Restart/Truncate modes) if all frames backfilled, reset the log
  └─ release locks, return counts

New proposed flow:

  Start
  ├─ acquire proper locks
  ├─ compute mxSafeFrame
  ├─ build (page_id, latest_frame≤mxSafeFrame) list
  └─ → Processing

Processing
  ├─ drain completed reads into a WriteBatch
  ├─ flush WriteBatch when “good enough” (based on run-length heuristics)
  ├─ issue more reads until limits are hit
  ├─ loop until nothing is left
  └─ → Done

Done
  ├─ fsync db file
  ├─ advance nBackfill and increment checkpoint_seq
  ├─ (Restart/Truncate modes) if all frames backfilled, reset the log
  └─ release locks, return counts

How can we trust that the cached pages are the exact frames we need? SQLite reads each page from the log during checkpoint, so why diverge from this?

We encode an 'epoch' and store it on the page object in memory, a packed combination of the frame_id, and the checkpoint_seq, which is an integer stored on the WAL header that is incremented at each successful checkpoint. So in addition to ensuring that the page is not dirty, we make sure it matches the exact frame and generation. To be even more secure, when debug assertions are enabled, we still read the page from the log and we assert that it has the same contents of any page we would have accepted from the cache.

This results in excellent percentage of cache hits for write benchmarks or any write heavy loads. It is likely to be less during any read heavy workloads, as we won't have all the relevant frames in memory due to being evicted to make room for the read pages, but this still a tremendous win compared to a guaranteed 2 syscalls per-page.

So Why doesn't SQLite do this?

SQLite tends to be conservative in terms of performance optimizations, and is more likely to keep things simple. Also, since SQLite allows multi-process connections, there are many instances where the page cache will either be empty, or not shared between connections, making this optimization not nearly as useful.

New Benchmarks?: (Average time spent checkpointing during write benchmark)

Before: 7-12ms

After: 2-5ms