Skip to content
~/docs/internals/wal
DOCUMENTATION

WAL & Durability

Crash recovery and write-ahead logging

The Write-Ahead Log (WAL) ensures that committed transactions survive crashes. Before any data is considered committed, it must be written to the WAL and flushed to disk.

The WAL Principle

Rule: Log before you do

1Write all changes to WAL
2fsync() WAL to diskData is now durable
3Update in-memory deltaData is now visible
4Return success to caller
After step 2Replay WAL on restart → changes recovered
Before step 2Changes lost, but OK (transaction didn't commit)

WAL Record Format

Each operation is stored as a framed record:

WAL Record Format

Length4 bytes
Type1 byte
Flags1 byte
Reserved2 bytes
TxID8 bytes
Payloadvariable
CRC32C + Paddingalign to 8

Record Types:

BEGINCOMMITROLLBACKCREATE_NODEDELETE_NODEADD_EDGEDELETE_EDGESET_NODE_PROPDEL_NODE_PROP

Circular Buffer

The WAL is a fixed-size circular buffer. When it fills up, old (already checkpointed) data is overwritten:

Circular Buffer

64 MB
reclaimed
TAIL
free space
HEAD
HEADWhere new records are written
TAILStart of unprocessed records (for replay)

When HEAD catches up to TAIL → Trigger checkpoint to free space

Dual-Region Design

The WAL is split into primary (75%) and secondary (25%) regions:

Why Two Regions?

During checkpoint:

  • 1. Primary region is being READ to build new snapshot
  • 2. New transactions need somewhere to WRITE
Primary (75%)
Being read for checkpoint
Secondary (25%)
New writes go here

After checkpoint completes:

  • Primary is cleared (data is in new snapshot)
  • Secondary becomes the new primary
  • Writes continue without interruption

Durability Guarantees

KiteDB provides configurable durability:

Sync Modes

fulldefault

fsync every commit

Safest, slower writes

batch

fsync every N commits or T ms

Better throughput, small loss window

offdanger

No fsync (OS decides)

Fastest, data loss on crash

For most applications, full is the right choice. Use batch for high write throughput with acceptable risk.

Fast Writes (Single-File)

Recommended profile for high write throughput:

  • syncMode = Normal
  • groupCommitEnabled = true
  • groupCommitWindowMs = 2
  • beginBulk() + batch APIs for ingest (MVCC disabled)
  • Optional: increase walSizeMb (e.g., 64MB) for heavy ingest to reduce checkpoints

Durability note: Normal mode does not fsync on every commit. An OS crash can lose recent commits, but application crashes are recovered via WAL replay.

Crash Recovery

On database open, the WAL is replayed to rebuild the delta:

Recovery Process

1
Read header to find WAL boundaries
2
Scan from TAIL to HEAD
3
For each record: validate CRC32C
4
If valid → apply to delta

If invalid → stop (incomplete write)

5
Handle incomplete transactions

BEGIN without COMMIT → discard

Recovery time: O(WAL size), typically < 1 second

When Checkpoint Happens

When Checkpoint Happens

Automatic triggers:

1WAL reaches 75% capacity
2Configured time interval (e.g., every 5 minutes)
3On graceful shutdown

Manual checkpoint:

await db.optimize();

During checkpoint:

  • Reads continue (from old snapshot + delta)
  • Writes continue (to secondary WAL region)
  • No downtime

Avoiding WAL Overflow

The WAL has a fixed size once the file is created. For large ingests, use resizeWal (offline) to grow it, or rebuild into a new file. To prevent single transactions from overfilling the active WAL region, split work into smaller commits (see bulkWrite or chunked beginBulk() sessions) and consider disabling background checkpoints during ingest.

Next Steps