Skip to content
~/docs/internals/csr
DOCUMENTATION

CSR Format

How edges are stored for fast traversal

KiteDB uses Compressed Sparse Row (CSR) format to store graph edges. CSR is a standard format for sparse matrices that provides fast traversal with minimal memory overhead.

The Problem with Naive Edge Storage

Consider a graph with 100,000 nodes and 1 million edges.

Adjacency Matrix

A B C D
A [ 0 1 1 0 ]
B [ 0 0 0 1 ]
C [ 1 0 0 0 ]
D [ 0 0 0 0 ]
100K × 100K=10 billionentries
Actual edges:1M(0.01% used)
Wastes 99.99% of space

Linked Adjacency Lists

ABCnull
BDnull

Problem: Pointer chasing. Each lookup goes to random memory.

Cache miss:~100ns
Cache hit:~1ns
1000 edges × 100ns = 100μs wasted waiting for RAM

The CSR Solution

CSR stores all edges in two flat arrays: offsets and destinations. No pointers, no wasted space.

CSR Solution

Graph:
A B, CB DC AD (none)
1
Concatenate all destinations:
destinations=
B
C
D
A
A
B
C
2
Record where each node's edges start:
offsets=
0
2
3
4
4
A
B
C
D
end

How Traversal Works

Finding a node's neighbors is two array lookups:

"Who does A connect to?"
start = offsets[0] = 0
end = offsets[1] = 2
destinations[0:2] = [B, C]
"Who does D connect to?"
start = offsets[3] = 4
end = offsets[4] = 4
destinations[4:4] = [] (no edges)
Algorithm:
start = offsets[node]
end = offsets[node + 1]
return destinations[start:end]

Why It's Fast: Memory Layout

LINKED LISTscattered
B
C
D
0x1000
0x5F00
0x2A00
↑ Random locations = cache misses
CSRcontiguous
B
C
D
A
0x1000
+4
+8
+C
↑ Sequential = CPU prefetcher works

After the first access, B/C/D/A are already in CPU cache.

Bidirectional Edges

KiteDB stores edges in both directions for fast traversal either way:

Out-edges (A → B)
out_offsets = [0, 2, 3, 4, 4]
out_dst = [B, C, D, A]
"Who does Alice follow?"
In-edges (A ← C)
in_offsets = [0, 1, 2, 3, 4]
in_src = [C, A, A, B]
"Who follows Alice?"
Trade-off: 2× storage, but O(1) traversal in both directions

Edge Types and Sorting

Real graphs have different edge types (follows, likes, knows). KiteDB stores edge types in a parallel array, sorted within each node:

out_dst = [B, C, D, A]
out_etype = [0, 1, 0, 0]
0 = KNOWS
1 = LIKES
Sorted by (etype, dst) within each node:
1Binary search to find specific edge type
2Early termination when past desired type
3"Get A's KNOWS edges" doesn't scan all

Edge Existence Check

To check if edge A→B exists with type KNOWS:

typescript
function hasEdge(src: NodeID, etype: EdgeType, dst: NodeID): boolean {
  const start = offsets[src];
  const end = offsets[src + 1];
  
  // Binary search for etype within [start, end)
  const typeStart = binarySearchStart(etypes, start, end, etype);
  const typeEnd = binarySearchEnd(etypes, start, end, etype);
  
  // Binary search for dst within type range
  return binarySearch(destinations, typeStart, typeEnd, dst);
}

// Complexity: O(log k) where k = number of edges from src

Performance Numbers

OperationCSRLinked List
Start traversalO(1) – two array lookupsO(1) – follow pointer
Iterate k neighborsO(k) – sequential readO(k) – but cache misses
Edge existenceO(log k) – binary searchO(k) – linear scan
Cache behaviorExcellent – prefetcher worksPoor – random access

Next Steps