Documentation

Storage Engine

Absolute DB provides four interoperable storage backends — B+Tree, LSM-Tree, PAX columnar, and HTAP dual-store — all sharing a unified MVCC layer, WAL, and patent-free LIRS buffer pool.

Storage Backends Overview

Every table in Absolute DB uses one of four storage backends. The choice is made at table creation time and can be changed online. All backends share the same MVCC transaction layer, WAL, and LIRS buffer pool — allowing a single transaction to span tables on different backends.

BackendBest ForPage SizeSyntax
B+TreeOLTP, point lookups, range scans4 KBDefault (no keyword needed)
LSM-TreeWrite-heavy, append-dominantVariableUSING LSM
PAX ColumnarAnalytics, OLAP, aggregations64 KBUSING COLUMNAR
HTAP Dual-StoreMixed OLTP + OLAP on same dataBothUSING HTAP

B+Tree (OLTP Default)

The default storage backend for transactional workloads. B+Tree pages are 4 KB, copy-on-write (for MVCC), and write-ahead-logged for durability. Point lookups reach 0.14 µs on the direct API path.

Bloom Filters on Leaf Pages

Every B+Tree leaf page carries a compact Bloom filter. Before reading a page from disk, the filter is checked: if the key is definitely absent, the I/O is skipped entirely. This dramatically reduces read amplification for point lookups on tables with high fill rates.

Partial Indexes

Indexes can be restricted to a subset of rows, reducing index size and improving write throughput on tables where only a fraction of rows are queried by index.

sql — B+Tree index examples
-- Standard B+Tree index
CREATE INDEX idx_orders_customer ON orders(customer_id);

-- Partial index: only index rows matching the predicate
CREATE INDEX idx_orders_pending
    ON orders(id)
    WHERE status = 'pending';

-- Expression index
CREATE INDEX idx_users_lower_email
    ON users(lower(email));

-- Composite index
CREATE INDEX idx_logs_host_ts
    ON logs(host, created_at DESC);

BRIN Index (Block Range Index)

BRIN stores only the minimum and maximum value for each range of 128 pages. For monotonically increasing columns (timestamps, auto-increment IDs), a BRIN index can be up to 1,000x smaller than a B+Tree index while still providing excellent selectivity for range queries.

sql — BRIN index
-- BRIN index for a time-series table (extremely compact)
CREATE INDEX idx_events_ts
    ON events USING BRIN(created_at);

-- Range query benefits from BRIN automatically
SELECT * FROM events
WHERE created_at BETWEEN '2026-01-01' AND '2026-01-31';

LSM-Tree (Write-Optimised)

The Log-Structured Merge-Tree backend is optimised for workloads where write throughput is the primary concern — sensor ingestion, event sourcing, audit logs, and append-dominant time-series. Writes go first to an in-memory MemTable, which is flushed to immutable SSTables (Level 0) and merged downward through levels by a background compaction thread.

sql — LSM-Tree table
-- Create a table backed by LSM-Tree
CREATE TABLE sensor_readings (
    device_id  UUID        NOT NULL,
    ts         TIMESTAMPTZ NOT NULL,
    value      DOUBLE PRECISION,
    PRIMARY KEY (device_id, ts)
) USING LSM;

-- Compaction happens automatically in the background.
-- No manual intervention required.

LSM compaction strategy is leveled: when Level N exceeds its size target, files are merged into Level N+1. Read amplification is bounded. Bloom filters on SSTables skip irrelevant files during point lookups.

PAX Columnar Storage

Partition Attributes aXross (PAX) layout stores each column contiguously within 64 KB pages. This gives SIMD-friendly memory access patterns for analytical queries — aggregations, GROUP BY, window functions — while still allowing efficient per-row access. Zone maps in each page header record per-column min/max values, enabling entire pages to be skipped when a query predicate cannot match.

EncodingBest ForTypical Ratio
RLE (Run-Length)Low-cardinality columns, sorted data10–100×
Bit-packingSmall integers, flags, enum values2–8×
DictionaryRepeated strings (up to 256 distinct values)4–32×
Delta + RLETimestamps, monotonic counters5–20×
Gorilla delta-deltaFloat time-series values~10×
sql — columnar storage
-- Create a columnar table (optimised for analytics)
CREATE TABLE metrics (
    ts    TIMESTAMPTZ     NOT NULL,
    host  TEXT            NOT NULL,
    value DOUBLE PRECISION
) USING COLUMNAR PAGE_SIZE 65536;

-- The query planner automatically chooses columnar scans
-- for aggregation and GROUP BY queries
SELECT host, avg(value), max(value)
FROM metrics
WHERE ts >= now() - INTERVAL '24 hours'
GROUP BY host
ORDER BY avg DESC;

HTAP Dual-Store

HTAP (Hybrid Transactional/Analytical Processing) maintains both a row-store and a PAX columnar store simultaneously for the same table. A zero-copy lock-free replication ring with 4,096 entries propagates row inserts to the columnar store in real time — typical replication lag is ≤ 5 ms. No ETL pipeline is required.

The query planner routes automatically: point lookups and short-range scans go to the row-store; full-table scans and aggregations go to the columnar store. A single SQL statement can read from both stores in the same execution plan.

sql — HTAP table
CREATE TABLE orders (
    id          BIGSERIAL PRIMARY KEY,
    customer_id INTEGER,
    total       NUMERIC(12,2),
    status      TEXT,
    created_at  TIMESTAMPTZ DEFAULT now()
) USING HTAP;

-- OLTP path: single-row lookup (row-store, ~0.14 µs)
SELECT * FROM orders WHERE id = 12345;

-- OLAP path: aggregation (columnar store, SIMD-accelerated)
SELECT status, count(*), sum(total)
FROM orders
WHERE created_at >= '2026-01-01'
GROUP BY status;

Pluggable Page Sizes

Page size is configurable per table to match the I/O characteristics of the workload. Smaller pages reduce read amplification for random point lookups; larger pages amortise I/O overhead for sequential scans.

Page SizeBest For
4 KBOLTP — random point lookups, short-range scans
64 KBAnalytics — sequential scans, columnar storage
2 MBBulk data — data warehouse tables, large object storage
sql — page size selection
-- OLTP table (default 4 KB)
CREATE TABLE users (id BIGSERIAL PRIMARY KEY, name TEXT);

-- Analytics table (64 KB pages)
CREATE TABLE page_views (
    ts   TIMESTAMPTZ, url TEXT, user_id BIGINT
) USING COLUMNAR PAGE_SIZE 65536;

-- Bulk data warehouse table (2 MB pages)
CREATE TABLE raw_logs (
    ts TEXT, line TEXT
) PAGE_SIZE 2097152;

LIRS Buffer Pool (Patent-Free)

Absolute DB uses the LIRS (Low Inter-Reference Recency Set) algorithm exclusively for buffer pool management. ARC (Adaptive Replacement Cache) is covered by IBM US Patent 6,996,676 and is never used.

LIRS classifies pages into hot and cold tiers based on inter-reference recency, keeping frequently-reused pages resident and evicting cold pages with O(1) hit, miss, and eviction operations. The cold-tier ratio (default ~2%) is tunable via configuration.

bash — buffer pool configuration
# Set buffer pool to 4 GB (typical production server)
absdb-server --buffer-pool-mb 4096

# Embedded / edge minimal configuration
absdb-lite --buffer-pool-mb 64

# Micro-embedded (Raspberry Pi Zero)
absdb-lite --buffer-pool-mb 1
ConfigurationRecommended Buffer Pool
Embedded / IoT (Raspberry Pi Zero)1–16 MB
Developer laptop256–512 MB
Small production server (8 GB RAM)4–6 GB
Large production server (256 GB RAM)180–220 GB

Direct I/O & io_uring

On Linux, Absolute DB uses io_uring for asynchronous I/O — bypassing the kernel page cache for predictable latency and enabling batched I/O submission (up to 64 reads + 64 writes per tick). When io_uring is unavailable (older kernels, macOS), the engine falls back transparently to POSIX pread/pwrite.

Direct I/O is used for data pages to bypass the OS page cache and avoid double-buffering, since Absolute DB manages its own LIRS buffer pool. WAL writes use durable sync mode when io_uring is unavailable to guarantee durability on all platforms.

bash — I/O backend selection
# io_uring is auto-selected on Linux kernel 5.1+
# Force POSIX I/O (useful for containers without io_uring permissions)
absdb-server --io-backend posix

# Check which I/O backend is active
absdb-server --collect-env | grep io_backend

TRIM support is available on SSDs: BLKDISCARD ioctl on Linux, fallocate(PUNCH_HOLE) fallback, and F_PUNCHHOLE on macOS. This allows freed pages to be returned to the SSD's garbage collector, maintaining write performance over time.

Indexes

Absolute DB supports a comprehensive set of index types. All index types are maintained transactionally — index updates are part of the same WAL record as the data write.

Index TypeUse CaseSize vs B+Tree
B+Tree (default)Equality, range, ORDER BY, JOIN
HashEquality only; O(1) lookup~0.5×
GINJSONB, full-text, array containmentVariable
BRINMonotonic columns (ts, ID); huge tables~0.001×
HNSWVector similarity searchVaries by dims
PartialIndex subset of rows (WHERE predicate)<1×
ExpressionIndex computed expressions

Online index creation is supported — background thread builds the index while the table remains fully accessible for reads and writes. Progress is visible via SELECT * FROM absdb_active_queries.

WAL & Crash Recovery

Every write in Absolute DB is recorded in the Write-Ahead Log before the data page is modified. On crash, the WAL is replayed from the last checkpoint to restore the database to a consistent state.

  • CRC-32C integrity on every WAL record — detects torn writes and media errors
  • Group commit: up to 64 records are batched per fsync, reducing I/O overhead by orders of magnitude under high write concurrency
  • Re-Read Before Shutdown: WAL is re-scanned on clean shutdown to ensure no committed records are lost
  • Point-in-Time Recovery (PITR): WAL can be streamed to S3, GCS, or Azure Blob for continuous backup
  • Incremental backup: only changed pages since the last checkpoint are included
sql — WAL and recovery
-- Check current WAL position (LSN = Log Sequence Number)
SELECT absdb_current_lsn();

-- Force a checkpoint (flush dirty pages, advance WAL)
CHECKPOINT;

-- Stream WAL to S3 for continuous backup
SELECT absdb_wal_archive_enable(
    target => 's3://my-bucket/absdb-wal/',
    region => 'ap-southeast-2'
);

MVCC Snapshot Isolation

Absolute DB implements Multi-Version Concurrency Control (MVCC). Each transaction sees a consistent snapshot of the database as of its start System Change Number (SCN). Writers never block readers; readers never block writers.

Up to 4,096 concurrent active transactions are supported. The undo log enables savepoints and multi-level rollback within a transaction. Old versions are cleaned up by a background VACUUM process that honours configured PITR retention windows.

sql — MVCC and snapshot queries
-- Every SELECT automatically uses a consistent snapshot
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT count(*) FROM orders;  -- snapshot taken here
-- ... other work ...
SELECT count(*) FROM orders;  -- same count, even if rows were inserted
COMMIT;

-- Time-travel: read table as of a historical SCN
SELECT * FROM orders AS OF SCN 1048576;

-- Check isolation level
SHOW transaction_isolation;

For more on transactions, isolation levels, and advisory locks, see the Transactions & MVCC documentation.

Continue Reading

WAL Storage Format Indexing

Ready to run Absolute DB?

~154 KB binary  ·  zero external dependencies  ·  2,737 tests passing  ·  SQL:2023 100%

Download Free → View Pricing All Docs