Storage Engine

Storage Backends Overview

Every table in Absolute DB uses one of four storage backends. The choice is made at table creation time and can be changed online. All backends share the same MVCC transaction layer, WAL, and LIRS buffer pool — allowing a single transaction to span tables on different backends.

Backend	Best For	Page Size	Syntax
B+Tree	OLTP, point lookups, range scans	4 KB	Default (no keyword needed)
LSM-Tree	Write-heavy, append-dominant	Variable	`USING LSM`
PAX Columnar	Analytics, OLAP, aggregations	64 KB	`USING COLUMNAR`
HTAP Dual-Store	Mixed OLTP + OLAP on same data	Both	`USING HTAP`

B+Tree (OLTP Default)

The default storage backend for transactional workloads. B+Tree pages are 4 KB, copy-on-write (for MVCC), and write-ahead-logged for durability. Point lookups reach 0.14 µs on the direct API path.

Bloom Filters on Leaf Pages

Every B+Tree leaf page carries a compact Bloom filter. Before reading a page from disk, the filter is checked: if the key is definitely absent, the I/O is skipped entirely. This dramatically reduces read amplification for point lookups on tables with high fill rates.

Partial Indexes

Indexes can be restricted to a subset of rows, reducing index size and improving write throughput on tables where only a fraction of rows are queried by index.

sql — B+Tree index examples

-- Standard B+Tree index
CREATE INDEX idx_orders_customer ON orders(customer_id);

-- Partial index: only index rows matching the predicate
CREATE INDEX idx_orders_pending
    ON orders(id)
    WHERE status = 'pending';

-- Expression index
CREATE INDEX idx_users_lower_email
    ON users(lower(email));

-- Composite index
CREATE INDEX idx_logs_host_ts
    ON logs(host, created_at DESC);

BRIN Index (Block Range Index)

BRIN stores only the minimum and maximum value for each range of 128 pages. For monotonically increasing columns (timestamps, auto-increment IDs), a BRIN index can be up to 1,000x smaller than a B+Tree index while still providing excellent selectivity for range queries.

sql — BRIN index

-- BRIN index for a time-series table (extremely compact)
CREATE INDEX idx_events_ts
    ON events USING BRIN(created_at);

-- Range query benefits from BRIN automatically
SELECT * FROM events
WHERE created_at BETWEEN '2026-01-01' AND '2026-01-31';

LSM-Tree (Write-Optimised)

The Log-Structured Merge-Tree backend is optimised for workloads where write throughput is the primary concern — sensor ingestion, event sourcing, audit logs, and append-dominant time-series. Writes go first to an in-memory MemTable, which is flushed to immutable SSTables (Level 0) and merged downward through levels by a background compaction thread.

sql — LSM-Tree table

-- Create a table backed by LSM-Tree
CREATE TABLE sensor_readings (
    device_id  UUID        NOT NULL,
    ts         TIMESTAMPTZ NOT NULL,
    value      DOUBLE PRECISION,
    PRIMARY KEY (device_id, ts)
) USING LSM;

-- Compaction happens automatically in the background.
-- No manual intervention required.

LSM compaction strategy is leveled: when Level N exceeds its size target, files are merged into Level N+1. Read amplification is bounded. Bloom filters on SSTables skip irrelevant files during point lookups.

PAX Columnar Storage

Partition Attributes aXross (PAX) layout stores each column contiguously within 64 KB pages. This gives SIMD-friendly memory access patterns for analytical queries — aggregations, GROUP BY, window functions — while still allowing efficient per-row access. Zone maps in each page header record per-column min/max values, enabling entire pages to be skipped when a query predicate cannot match.

Encoding	Best For	Typical Ratio
RLE (Run-Length)	Low-cardinality columns, sorted data	10–100×
Bit-packing	Small integers, flags, enum values	2–8×
Dictionary	Repeated strings (up to 256 distinct values)	4–32×
Delta + RLE	Timestamps, monotonic counters	5–20×
Gorilla delta-delta	Float time-series values	~10×

sql — columnar storage

-- Create a columnar table (optimised for analytics)
CREATE TABLE metrics (
    ts    TIMESTAMPTZ     NOT NULL,
    host  TEXT            NOT NULL,
    value DOUBLE PRECISION
) USING COLUMNAR PAGE_SIZE 65536;

-- The query planner automatically chooses columnar scans
-- for aggregation and GROUP BY queries
SELECT host, avg(value), max(value)
FROM metrics
WHERE ts >= now() - INTERVAL '24 hours'
GROUP BY host
ORDER BY avg DESC;

HTAP Dual-Store

HTAP (Hybrid Transactional/Analytical Processing) maintains both a row-store and a PAX columnar store simultaneously for the same table. A zero-copy lock-free replication ring with 4,096 entries propagates row inserts to the columnar store in real time — typical replication lag is ≤ 5 ms. No ETL pipeline is required.

The query planner routes automatically: point lookups and short-range scans go to the row-store; full-table scans and aggregations go to the columnar store. A single SQL statement can read from both stores in the same execution plan.

sql — HTAP table

CREATE TABLE orders (
    id          BIGSERIAL PRIMARY KEY,
    customer_id INTEGER,
    total       NUMERIC(12,2),
    status      TEXT,
    created_at  TIMESTAMPTZ DEFAULT now()
) USING HTAP;

-- OLTP path: single-row lookup (row-store, ~0.14 µs)
SELECT * FROM orders WHERE id = 12345;

-- OLAP path: aggregation (columnar store, SIMD-accelerated)
SELECT status, count(*), sum(total)
FROM orders
WHERE created_at >= '2026-01-01'
GROUP BY status;

Pluggable Page Sizes

Page size is configurable per table to match the I/O characteristics of the workload. Smaller pages reduce read amplification for random point lookups; larger pages amortise I/O overhead for sequential scans.

Page Size	Best For
4 KB	OLTP — random point lookups, short-range scans
64 KB	Analytics — sequential scans, columnar storage
2 MB	Bulk data — data warehouse tables, large object storage

sql — page size selection

-- OLTP table (default 4 KB)
CREATE TABLE users (id BIGSERIAL PRIMARY KEY, name TEXT);

-- Analytics table (64 KB pages)
CREATE TABLE page_views (
    ts   TIMESTAMPTZ, url TEXT, user_id BIGINT
) USING COLUMNAR PAGE_SIZE 65536;

-- Bulk data warehouse table (2 MB pages)
CREATE TABLE raw_logs (
    ts TEXT, line TEXT
) PAGE_SIZE 2097152;

LIRS Buffer Pool (Patent-Free)

Absolute DB uses the LIRS (Low Inter-Reference Recency Set) algorithm exclusively for buffer pool management. ARC (Adaptive Replacement Cache) is covered by IBM US Patent 6,996,676 and is never used.

LIRS classifies pages into hot and cold tiers based on inter-reference recency, keeping frequently-reused pages resident and evicting cold pages with O(1) hit, miss, and eviction operations. The cold-tier ratio (default ~2%) is tunable via configuration.

bash — buffer pool configuration

# Set buffer pool to 4 GB (typical production server)
absdb-server --buffer-pool-mb 4096

# Embedded / edge minimal configuration
absdb-lite --buffer-pool-mb 64

# Micro-embedded (Raspberry Pi Zero)
absdb-lite --buffer-pool-mb 1

Configuration	Recommended Buffer Pool
Embedded / IoT (Raspberry Pi Zero)	1–16 MB
Developer laptop	256–512 MB
Small production server (8 GB RAM)	4–6 GB
Large production server (256 GB RAM)	180–220 GB

Direct I/O & io_uring

On Linux, Absolute DB uses io_uring for asynchronous I/O — bypassing the kernel page cache for predictable latency and enabling batched I/O submission (up to 64 reads + 64 writes per tick). When io_uring is unavailable (older kernels, macOS), the engine falls back transparently to POSIX pread/pwrite.

Direct I/O is used for data pages to bypass the OS page cache and avoid double-buffering, since Absolute DB manages its own LIRS buffer pool. WAL writes use durable sync mode when io_uring is unavailable to guarantee durability on all platforms.

bash — I/O backend selection

# io_uring is auto-selected on Linux kernel 5.1+
# Force POSIX I/O (useful for containers without io_uring permissions)
absdb-server --io-backend posix

# Check which I/O backend is active
absdb-server --collect-env | grep io_backend

TRIM support is available on SSDs: BLKDISCARD ioctl on Linux, fallocate(PUNCH_HOLE) fallback, and F_PUNCHHOLE on macOS. This allows freed pages to be returned to the SSD's garbage collector, maintaining write performance over time.

Indexes

Absolute DB supports a comprehensive set of index types. All index types are maintained transactionally — index updates are part of the same WAL record as the data write.

Index Type	Use Case	Size vs B+Tree
B+Tree (default)	Equality, range, ORDER BY, JOIN	1×
Hash	Equality only; O(1) lookup	~0.5×
GIN	JSONB, full-text, array containment	Variable
BRIN	Monotonic columns (ts, ID); huge tables	~0.001×
HNSW	Vector similarity search	Varies by dims
Partial	Index subset of rows (WHERE predicate)	<1×
Expression	Index computed expressions	1×

Online index creation is supported — background thread builds the index while the table remains fully accessible for reads and writes. Progress is visible via SELECT * FROM absdb_active_queries.

WAL & Crash Recovery

Every write in Absolute DB is recorded in the Write-Ahead Log before the data page is modified. On crash, the WAL is replayed from the last checkpoint to restore the database to a consistent state.

CRC-32C integrity on every WAL record — detects torn writes and media errors
Group commit: up to 64 records are batched per fsync, reducing I/O overhead by orders of magnitude under high write concurrency
Re-Read Before Shutdown: WAL is re-scanned on clean shutdown to ensure no committed records are lost
Point-in-Time Recovery (PITR): WAL can be streamed to S3, GCS, or Azure Blob for continuous backup
Incremental backup: only changed pages since the last checkpoint are included

sql — WAL and recovery

-- Check current WAL position (LSN = Log Sequence Number)
SELECT absdb_current_lsn();

-- Force a checkpoint (flush dirty pages, advance WAL)
CHECKPOINT;

-- Stream WAL to S3 for continuous backup
SELECT absdb_wal_archive_enable(
    target => 's3://my-bucket/absdb-wal/',
    region => 'ap-southeast-2'
);

MVCC Snapshot Isolation

Absolute DB implements Multi-Version Concurrency Control (MVCC). Each transaction sees a consistent snapshot of the database as of its start System Change Number (SCN). Writers never block readers; readers never block writers.

Up to 4,096 concurrent active transactions are supported. The undo log enables savepoints and multi-level rollback within a transaction. Old versions are cleaned up by a background VACUUM process that honours configured PITR retention windows.

sql — MVCC and snapshot queries

-- Every SELECT automatically uses a consistent snapshot
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT count(*) FROM orders;  -- snapshot taken here
-- ... other work ...
SELECT count(*) FROM orders;  -- same count, even if rows were inserted
COMMIT;

-- Time-travel: read table as of a historical SCN
SELECT * FROM orders AS OF SCN 1048576;

-- Check isolation level
SHOW transaction_isolation;

For more on transactions, isolation levels, and advisory locks, see the Transactions & MVCC documentation.

Contents