~/docs/internals/wal

DOCUMENTATION

WAL & Durability

Crash recovery and write-ahead logging

The Write-Ahead Log (WAL) ensures that committed transactions survive crashes. Before any data is considered committed, it must be written to the WAL and flushed to disk.

The WAL Principle

Rule: Log before you do

1Write all changes to WAL

2fsync() WAL to disk← Data is now durable

3Update in-memory delta← Data is now visible

4Return success to caller

After step 2Replay WAL on restart → changes recovered

Before step 2Changes lost, but OK (transaction didn't commit)

WAL Record Format

Each operation is stored as a framed record:

WAL Record Format

Length4 bytes

Type1 byte

Flags1 byte

Reserved2 bytes

TxID8 bytes

Payloadvariable

CRC32C + Paddingalign to 8

Record Types:

BEGINCOMMITROLLBACKCREATE_NODEDELETE_NODEADD_EDGEDELETE_EDGESET_NODE_PROPDEL_NODE_PROP

Circular Buffer

The WAL is a fixed-size circular buffer. When it fills up, old (already checkpointed) data is overwritten:

Circular Buffer

64 MB

reclaimed

TAIL

free space

HEAD

HEADWhere new records are written

TAILStart of unprocessed records (for replay)

When HEAD catches up to TAIL → Trigger checkpoint to free space

Dual-Region Design

The WAL is split into primary (75%) and secondary (25%) regions:

Why Two Regions?

During checkpoint:

1. Primary region is being READ to build new snapshot
2. New transactions need somewhere to WRITE

Primary (75%)

Being read for checkpoint

Secondary (25%)

New writes go here

After checkpoint completes:

•Primary is cleared (data is in new snapshot)
•Secondary becomes the new primary
→Writes continue without interruption

Durability Guarantees

KiteDB provides configurable durability:

Sync Modes

fulldefault

fsync every commit

Safest, slower writes

batch

fsync every N commits or T ms

Better throughput, small loss window

offdanger

No fsync (OS decides)

Fastest, data loss on crash

For most applications, full is the right choice. Use batch for high write throughput with acceptable risk.

Fast Writes (Single-File)

Recommended profile for high write throughput:

syncMode = Normal
groupCommitEnabled = true
groupCommitWindowMs = 2
beginBulk() + batch APIs for ingest (MVCC disabled)
Optional: increase walSizeMb (e.g., 64MB) for heavy ingest to reduce checkpoints

Durability note: Normal mode does not fsync on every commit. An OS crash can lose recent commits, but application crashes are recovered via WAL replay.

Crash Recovery

On database open, the WAL is replayed to rebuild the delta:

Recovery Process

1

Read header to find WAL boundaries

2

Scan from TAIL to HEAD

3

For each record: validate CRC32C

4

If valid → apply to delta

If invalid → stop (incomplete write)

5

Handle incomplete transactions

BEGIN without COMMIT → discard

Recovery time: O(WAL size), typically < 1 second

When Checkpoint Happens

Automatic triggers:

1WAL reaches 75% capacity

2Configured time interval (e.g., every 5 minutes)

3On graceful shutdown

Manual checkpoint:

await db.optimize();

During checkpoint:

✓Reads continue (from old snapshot + delta)
✓Writes continue (to secondary WAL region)
✓No downtime

Avoiding WAL Overflow

The WAL has a fixed size once the file is created. For large ingests, use resizeWal (offline) to grow it, or rebuild into a new file. To prevent single transactions from overfilling the active WAL region, split work into smaller commits (see bulkWrite or chunked beginBulk() sessions) and consider disabling background checkpoints during ingest.

Next Steps

Single-File Format – How WAL fits in the file layout
Snapshot + Delta – What checkpoint produces
MVCC & Transactions – How transactions work

./edit --remote