~/docs/internals/csr

DOCUMENTATION

CSR Format

How edges are stored for fast traversal

KiteDB uses Compressed Sparse Row (CSR) format to store graph edges. CSR is a standard format for sparse matrices that provides fast traversal with minimal memory overhead.

The Problem with Naive Edge Storage

Consider a graph with 100,000 nodes and 1 million edges.

Adjacency Matrix

A B C D

A [ 0 1 1 0 ]

B [ 0 0 0 1 ]

C [ 1 0 0 0 ]

D [ 0 0 0 0 ]

100K × 100K=10 billionentries

Actual edges:1M(0.01% used)

Wastes 99.99% of space

Linked Adjacency Lists

A→B→C→null

B→D→null

Problem: Pointer chasing. Each lookup goes to random memory.

Cache miss:~100ns

Cache hit:~1ns

1000 edges × 100ns = 100μs wasted waiting for RAM

The CSR Solution

CSR stores all edges in two flat arrays: offsets and destinations. No pointers, no wasted space.

CSR Solution

Graph:

A → B, CB → DC → AD → (none)

1

Concatenate all destinations:

destinations=

B

C

D

A

B

C

2

Record where each node's edges start:

offsets=

0

2

3

4

↑

A

B

C

D

end

How Traversal Works

Finding a node's neighbors is two array lookups:

"Who does A connect to?"

start = offsets[0] = 0

end = offsets[1] = 2

destinations[0:2] = [B, C] ✓

"Who does D connect to?"

start = offsets[3] = 4

end = offsets[4] = 4

destinations[4:4] = [] (no edges) ✓

Algorithm:

start = offsets[node]

end = offsets[node + 1]

return destinations[start:end]

Why It's Fast: Memory Layout

LINKED LIST—scattered

B

C

D

0x1000

0x5F00

0x2A00

↑ Random locations = cache misses

CSR—contiguous

B

C

D

A

0x1000

+4

+8

+C

↑ Sequential = CPU prefetcher works

After the first access, B/C/D/A are already in CPU cache.

Bidirectional Edges

KiteDB stores edges in both directions for fast traversal either way:

Out-edges (A → B)

out_offsets = [0, 2, 3, 4, 4]

out_dst = [B, C, D, A]

"Who does Alice follow?"

In-edges (A ← C)

in_offsets = [0, 1, 2, 3, 4]

in_src = [C, A, A, B]

"Who follows Alice?"

Trade-off: 2× storage, but O(1) traversal in both directions

Edge Types and Sorting

Real graphs have different edge types (follows, likes, knows). KiteDB stores edge types in a parallel array, sorted within each node:

out_dst = [B, C, D, A]

out_etype = [0, 1, 0, 0]

0 = KNOWS

1 = LIKES

Sorted by (etype, dst) within each node:

1Binary search to find specific edge type

2Early termination when past desired type

3"Get A's KNOWS edges" doesn't scan all

Edge Existence Check

To check if edge A→B exists with type KNOWS:

typescript

function hasEdge(src: NodeID, etype: EdgeType, dst: NodeID): boolean {
  const start = offsets[src];
  const end = offsets[src + 1];
  
  // Binary search for etype within [start, end)
  const typeStart = binarySearchStart(etypes, start, end, etype);
  const typeEnd = binarySearchEnd(etypes, start, end, etype);
  
  // Binary search for dst within type range
  return binarySearch(destinations, typeStart, typeEnd, dst);
}

// Complexity: O(log k) where k = number of edges from src

Performance Numbers

Operation	CSR	Linked List
Start traversal	O(1) – two array lookups	O(1) – follow pointer
Iterate k neighbors	O(k) – sequential read	O(k) – but cache misses
Edge existence	O(log k) – binary search	O(k) – linear scan
Cache behavior	Excellent – prefetcher works	Poor – random access

Next Steps

Snapshot + Delta – How CSR fits into the storage model
Key Index – How node lookups work
Performance – Optimization techniques

./edit --remote