November 2019
The details on how tables and indexes work. Rows,
pages, the heap and indexes are all covered, both on-disk layout and
querying.
SQL (or relational) databases are some of oldest and probably still are the
single most commonly used database technology – both in database server form
(such as postgresql) or in library form (sqlite).
Despite this, SQL databases’ internal operations can be mysterious to a
surprisingly large number of people – even those who work with them daily.
Unlike other database technologies, such key-value stores or graph
databases, relational databases are not based on a single data structure but a
combination of several different data structures. Understanding this
combination is not difficult and can help a lot when working with a SQL
database.
The relation
Programmers most often work with just two primary data types: arrays and
associative arrays.
Arrays allow for lists of objects, providing a way of
grouping multiple objects together.
Array implementation details vary widely but in general: an item can be
added to an array at the end in a constant amount of time no matter the size of
the array already. Searching for an given item is different: each entry in the
array must be checked and this takes a length of time proportional to the size
of the array at present (deletion has similar performance characteristics).
Operation | Time (worst-case) |
---|---|
Insert | Constant / O(1) |
Search/Delete | Linear / O(n) |
Associative arrays (often called “hash-tables” after the
most common implementation) allow for reference to items via a “key”.
Example hash table holding the same cats, “keyed” by their
registration number.
This key must be unique for each item but associate arrays typically allow
for fast insert, search and delete so long as the key is being used.
If the key isn’t being used for the operation, the entire data structure must
be traversed, just as with arrays.
Operation | Time (worst-case) |
---|---|
Insert | Constant / O(1) |
Search/Delete (by key) | Constant / O(1) |
Search/Delete (otherwise) | Linear / O(n) |
SQL databases use a different type: the relation.
A relation is a generalisation of both of these data types. A relation is a
set of tuples. Each tuple can hold the details of a single item but the
individual fields are split out and accessible on the relation level. Within a
relation, the tuple form is exactly the same for each tuple.
Example relation holding the same cats.
Note that properties of the cats are broken out and available at the
relation level.
Relations are useful because they expose the internal fields of what was
previously an opaque object. Fields like age and breed are now open to
operations on the relation level itself. This means that more complicated
operations are possible on the relation without having to write specific
functions to ask questions of objects.
Being able to operate on fields directly also allows for easy definition of
“indexes”, which speed up access and search within the database.
SQL databases’ implementation of relations varies a little but are mostly a
combination of two different underlying data structures: the heap
file and b-trees. Consequently the operation times
are generally logarithmic – meaning that they grow slowly as the size of the
data-set grows.
Operation | Time (worst-case) |
---|---|
Insert | Logarithmic / O(log n) |
Search/Delete | Logarithmic / O(log n) |
The rest of this discussion will be about how these two underlying data
structures work.
The row
SQL databases aren’t a perfect representation of relations. Instead of
relations and tuples they represent “tables” and “rows”. Rows are functionally
the same as tuples but tables have a quirk – duplicate rows are allowed. This
does matter for writing queries but doesn’t change the fundamentals of how
things work so I’ll treat them as the same and will use SQL terminology from
now on.
Rows are typically represented, both on-disk and in memory, as a short
header followed a concatenation of row values.
Example row holding data on Alfred.
If we assume that 4 bytes (32 bits) are used to store an int and
strings use one byte per character (as is the case in ASCII) and are
terminated in a “NUL” byte the length of row data here is 4 7 4
18 = 33 bytes. Rows are typically much longer in practice.
The size of the header varies by database system but is usually kept
as small as possible (27 bytes in postgresql).
The header will often contain information about the total length of the row.
Depending on the specific implementation it might also contain other
information, especially information about concurrency (ie: which transactions a
given row is visible in).
The rest of the row is just the values, one after another.
There are two kinds of row values: those that are fixed length and those
that are variable length. A 32-bit integer or a 64 bit float is an example of a
fixed length value those can be represented directly. Strings, binary blobs,
JSON, XML and other types can vary in length can be handled in one of two
ways.
One common method is to use a special “terminator” character (often NULL or