~/docs/internals/key-index

DOCUMENTATION

Key Index

Fast node lookups by key

Every node in KiteDB can have a string key for lookup. The key index provides O(1) average-case lookups from key to NodeID.

The Problem

Without Index

"Find user:alice"

Scan all nodes → O(n)

1M nodes × 1μs = 1 second

With Hash Index

"Find user:alice"

Hash lookup → O(1)

1M nodes = ~100ns

Index Structure

The key index uses hash buckets with linear probing:

Key Index Structure

Bucket Arrayn buckets

0

1

2

3

...

← Start offsets into entry array

Entry Arraysorted by bucket, then hash

hash64

stringId

nodeId

← bucket 0

hash64

stringId

nodeId

← bucket 1

...

hash64: xxHash64 of the key string

stringId: Index into string table (for collision resolution)

nodeId: The NodeID this key maps to

Lookup Process

1Check Delta Firstrecent changes

If key in delta.keyIndexDeleted → return null

If key in delta.keyIndex → return NodeID

2Search Snapshot Indexon disk

hash = xxHash64(key)

bucket = hash % numBuckets

start = bucketArray[bucket]

end = bucketArray[bucket + 1]

3Binary Search in Bucket

Find entry where entry.hash64 == hash

Verify: stringTable[entry.stringId] == key

Return entry.nodeId

Two-Level Lookup

The key index is split between delta (memory) and snapshot (disk):

Two-Level Lookup

Delta (in-memory)

├keyIndexMap<string, NodeID>

└keyIndexDeletedSet<string>

Snapshot (on disk)

├bucketArrayu32[]

└entries{hash64, stringId, nodeId}[]

Lookup order:

1delta.keyIndexDeleted→ If found, return null

2delta.keyIndex→ If found, return NodeID

3snapshot index→ Search hash buckets

This order ensures recent changes override old data.

Why xxHash64

✓Fast— Called on every key lookup

✓Good distribution— Minimize bucket collisions

✓Deterministic— Same key always same hash

For typical key lengths (10-100 bytes):

xxHash64

~50-100ns per hash

~10 GB/s throughput

SHA-256

~500-1000ns per hash

Cryptographic (overkill)

10x faster, and we don't need cryptographic security.

Handling Collisions

Scenario: Two keys hash to same bucket

"user:alice" → hash: 0x1234...

"user:alfred" → hash: 0x1234... (collision!)

1Both entries stored in same bucket

2On lookup, hash matches both

3stringId comparison breaks tie

4Actual string comparison confirms match

Cost: O(k) string comparisons where k = entries with same hash. With 64-bit hash: k ≈ 1

Load Factor

Load factor = entries / buckets

KiteDB uses ~50% load factor

(2x buckets as entries)

•Low collision rate
•Reasonable memory usage
•Fast lookups

With 1M keys:

Buckets:2M × 4 bytes = 8 MB

Entries:1M × 24 bytes = 24 MB

Total index:~32 MB

Lookup: 1 bucket read + 1-2 entry reads = ~100ns

Next Steps

Snapshot + Delta – How the two-level lookup fits in
Performance – Index optimization techniques

./edit --remote