Skip to content
~/docs/internals/key-index
DOCUMENTATION

Key Index

Fast node lookups by key

Every node in KiteDB can have a string key for lookup. The key index provides O(1) average-case lookups from key to NodeID.

The Problem

Without Index

"Find user:alice"

Scan all nodes → O(n)

1M nodes × 1μs = 1 second

With Hash Index

"Find user:alice"

Hash lookup → O(1)

1M nodes = ~100ns

Index Structure

The key index uses hash buckets with linear probing:

Key Index Structure

Bucket Arrayn buckets
0
1
2
3
...

← Start offsets into entry array

Entry Arraysorted by bucket, then hash
hash64
stringId
nodeId
← bucket 0
hash64
stringId
nodeId
← bucket 1
...
...
...

hash64: xxHash64 of the key string

stringId: Index into string table (for collision resolution)

nodeId: The NodeID this key maps to

Lookup Process

Lookup Process

1Check Delta Firstrecent changes

If key in delta.keyIndexDeleted → return null

If key in delta.keyIndex → return NodeID

2Search Snapshot Indexon disk

hash = xxHash64(key)

bucket = hash % numBuckets

start = bucketArray[bucket]

end = bucketArray[bucket + 1]

3Binary Search in Bucket

Find entry where entry.hash64 == hash

Verify: stringTable[entry.stringId] == key

Return entry.nodeId

Two-Level Lookup

The key index is split between delta (memory) and snapshot (disk):

Two-Level Lookup

Delta (in-memory)
keyIndexMap<string, NodeID>
keyIndexDeletedSet<string>
Snapshot (on disk)
bucketArrayu32[]
entries{hash64, stringId, nodeId}[]

Lookup order:

1delta.keyIndexDeleted→ If found, return null
2delta.keyIndex→ If found, return NodeID
3snapshot index→ Search hash buckets

This order ensures recent changes override old data.

Why xxHash64

Why xxHash64

Fast— Called on every key lookup
Good distribution— Minimize bucket collisions
Deterministic— Same key always same hash

For typical key lengths (10-100 bytes):

xxHash64

~50-100ns per hash

~10 GB/s throughput

SHA-256

~500-1000ns per hash

Cryptographic (overkill)

10x faster, and we don't need cryptographic security.

Handling Collisions

Handling Collisions

Scenario: Two keys hash to same bucket

"user:alice" → hash: 0x1234...

"user:alfred" → hash: 0x1234... (collision!)

1Both entries stored in same bucket
2On lookup, hash matches both
3stringId comparison breaks tie
4Actual string comparison confirms match

Cost: O(k) string comparisons where k = entries with same hash. With 64-bit hash: k ≈ 1

Load Factor

Load Factor

Load factor = entries / buckets

KiteDB uses ~50% load factor

(2x buckets as entries)

  • Low collision rate
  • Reasonable memory usage
  • Fast lookups

With 1M keys:

Buckets:2M × 4 bytes = 8 MB
Entries:1M × 24 bytes = 24 MB
Total index:~32 MB

Lookup: 1 bucket read + 1-2 entry reads = ~100ns

Next Steps