2025's Wave of Database Innovation

You can thank AI.

Not another Vector DB article

Here is the decision rule I wish I had used earlier:

If your data can be rebuilt from files and users mostly read it, try an object-storage database first. If users are writing to it all day, start with a real database and stop trying to make S3 cosplay as one.

That is the useful line. Not “serverless is the future.” Not “vector databases changed everything.” Those sentences have already been printed on enough conference lanyards.

AI did change the shape of a lot of search problems. Suddenly small teams wanted semantic search, hybrid ranking, document chat, multimodal lookup, and analytics over files sitting in object storage. The old answer was “run Postgres with pgvector” or “operate OpenSearch/Elasticsearch” or “buy a managed search service.” Those are still good answers when the workload deserves them.

But many workloads do not. They are read-heavy, rebuildable, and tolerant of a short delay between content changing and search catching up. Documentation. Catalog snapshots. Static exports. Internal knowledge bases. Local analytics. Prototype RAG systems. For those, a new class of tools has made the boring architecture unusually powerful: build an index, store it as files, serve it over HTTP.

Snapshot note: the ecosystem is moving quickly. The star counts, feature labels, and performance numbers below are a September 2025 snapshot, not a timeless scoreboard. Treat them as orientation, then check the current docs before betting a production migration on any one cell.

A database by any other name

These serverless and CDN-capable datastores are useful for mid-scale cases, roughly 1,000 to 1,000,000 records or a few GB, where traditional database infrastructure can be more ceremony than value:

Pagefind (2022, ~4.5K ⭐): Pure static approach - compile once, search forever, zero backend requirements
Orama (2023, ~8K ⭐): Universal solution running everywhere from browsers to serverless functions
Chroma (2022, ~14K ⭐): AI-native, purpose-built for RAG applications
LanceDB (2023, ~4K ⭐): Enterprise multimodal capabilities with disk-based architecture
DuckDB-WASM (2019, ~23K ⭐): Full SQL analytics database running in browsers via WebAssembly

The common move is simple: keep the durable data in files or object storage, then query it from a browser, edge function, worker, or lightweight service. That does not eliminate complexity. It moves the complexity into build pipelines, index freshness, cache invalidation, and client capabilities. Which is a perfectly good trade when reads dominate.

Battle of the Checkboxes

Feature	Pagefind	Orama	Chroma	LanceDB	DuckDB-WASM
Full-Text Search	✅ Advanced stemming	✅ BM25, 30 languages	✅ SQLite FTS	✅ Tantivy	✅ Full SQL
Vector Search	❌	✅ Cosine similarity	✅ HNSW	✅ IVF_PQ, HNSW, GPU	⚠️ Extensions
AI/RAG Integrations	None	✅ Built-in pipeline	✅ LangChain, LlamaIndex	✅ Advanced reranking	⚠️ Manual setup
Storage	Static JSON/WASM	Memory + S3 plugins	Server-based*	S3-compatible Lance	WASM + S3/HTTP
Write Support	Build-time only	Full CRUD	Full CRUD	Full CRUD	Full SQL CRUD
Performance	Sub-100ms	0.0001ms - 100ms	Sub-100ms	3-5ms vector, 50ms FTS	10ms-1s (complex SQL)

*September 2025 snapshot: Chroma requires a server runtime and does not support direct S3 object storage in the way the object-file tools do (issue #1736).

Implementation examples

The syntax differences reveal the real split: build-time search, in-memory search, vector-native storage, multimodal tables, and browser SQL are not the same product category just because they all appear in AI demos.

Static site search with Pagefind

<link href="/pagefind/pagefind-ui.css" rel="stylesheet">
<script src="/pagefind/pagefind-ui.js"></script>
<div id="search"></div>
<script>new PagefindUI({ element: "#search" });</script>

Enterprise-grade multimodal with LanceDB

Code to create a LanceDB table with automatic OpenAI embeddings:

import * as lancedb from "@lancedb/lancedb";
import "@lancedb/lancedb/embedding/openai";
import { LanceSchema, getRegistry } from "@lancedb/lancedb/embedding";
import { Utf8 } from "apache-arrow";

const db = await lancedb.connect("data/multimodal-db");
const func = getRegistry()
  .get("openai")
  ?.create({ model: "text-embedding-ada-002" });

// Schema with automatic embedding generation
const documentsSchema = LanceSchema({
  text: func.sourceField(new Utf8()),
  vector: func.vectorField(),
  category: new Utf8()
});

const table = await db.createEmptyTable("documents", documentsSchema);
await table.add([
  { text: "machine learning concepts", category: "research" },
  { text: "deep learning fundamentals", category: "research" }
]);

Example of querying a LanceDB table:

import * as lancedb from "@lancedb/lancedb";
import "@lancedb/lancedb/embedding/openai";
// "Connect" to a URL path
const db = await lancedb.connect("data/multimodal-db");
const table = db.getTable("documents");

// SQL + vector search combination
const results = await table.search("machine learning concepts")
  .where("category = 'research'")
  .limit(10)
  .toArray();

console.log(results);

Universal search with Orama

import { create, insert, search } from '@orama/orama'

const db = create({
  schema: {
    title: 'string',
    content: 'string',
    embedding: 'vector[1536]'
  }
})

await insert(db, {
  title: 'Getting Started',
  content: 'Learn the basics',
  embedding: await generateEmbedding('Learn the basics')
})

const results = await search(db, {
  term: 'basics',
  mode: 'hybrid' // Combines text + vector search
})

DuckDB-WASM:

import * as duckdb from "https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm@latest/dist/duckdb-browser.mjs";
const bundle = await duckdb.selectBundle(duckdb.getJsDelivrBundles());
const worker = new Worker(bundle.mainWorker);
const db = new duckdb.AsyncDuckDB(new duckdb.ConsoleLogger(), worker);
await db.instantiate(bundle.mainModule, bundle.pthreadWorker);

const conn = await db.connect();
await conn.query(`create table t as select * from (values (1,'hybrid search'),(2,'edge sql')) as v(id,txt);`);
// Optional full-text:
await conn.query(`install fts; load fts; select * from t where match_bm25(txt, 'hybrid');`);

AI-native search with Chroma

import { ChromaClient } from "chromadb";

const client = new ChromaClient();
const collection = await client.createCollection({ name: "knowledge-base" });

await collection.add({
  documents: ["AI will transform software development"],
  metadatas: [{ source: "tech-blog", category: "AI" }],
  ids: ["doc1"]
});

const results = await collection.query({
  queryTexts: ["future of programming"],
  where: { category: "AI" },
  nResults: 5
});

Use Cases Guide

Choose Pagefind when:

Building documentation, blogs, or knowledge bases
Content updates weekly or less
Need zero operational overhead and perfect CDN caching
Example: Company docs with 10K+ pages updating monthly

Choose Orama when:

Building dashboards, e-commerce, or dynamic applications
Need real-time updates and sub-100ms performance
Want deployment flexibility from browsers to edge functions
Example: SaaS with dynamic product catalogs

Choose Chroma when:

Building RAG applications or AI knowledge bases
Need LangChain/LlamaIndex integrations
Semantic search is core functionality
Example: AI customer support bot

Choose LanceDB when:

Working with multimodal data (images, audio, video)
Need enterprise performance at massive scale
Complex analytics and reranking required
Example: Media platform with semantic video search

Choose DuckDB-WASM when:

Need full SQL capabilities in browsers or edge functions
Working with analytical workloads and complex queries
Want to process CSV/Parquet files directly from S3
Example: Business intelligence dashboard with ad-hoc SQL queries

The Decision Rule

The practical question is not “which database is best?”

The practical question is: what kind of change must the system absorb?

Rebuildable content: Pagefind, Orama snapshots, Lance files, DuckDB over Parquet. Keep it static until it hurts.
Frequent writes: Postgres, Chroma server, a managed search service, or a queue-backed indexing pipeline. You need coordination, not vibes.
User-specific results: use a real backend. Object storage is not an authorization model.
Analytics over files: DuckDB is absurdly useful. Let SQL do SQL things.
Multimodal or vector-heavy search: LanceDB and Chroma are worth testing against your actual data, not against a README benchmark.

The happy path is cheap. The edge cases decide the architecture.

The Bigger Picture

These tools reduce the minimum viable infrastructure for useful search. That matters. In 2020, “semantic search” often implied a pile of services, a lot of glue code, and someone explaining vector indexes in a meeting where half the room wanted lunch. In 2025, a small team can prototype the same product idea with files, embeddings, and a weekend.

That does not mean every search box should become a RAG system. It means the first version no longer has to inherit production infrastructure before it has production evidence.

Even AWS has been moving in this direction with S3-adjacent vector search work, which is a useful signal: object storage is no longer just the attic where old files go. It is becoming a query surface.

Start experimenting

Pick the update pattern first: build-time, hourly batch, live writes, or per-user results.
Prototype with the smallest honest tool: Pagefind for static HTML, DuckDB for analytical files, Orama for lightweight app search, LanceDB or Chroma for vector-heavy work.
Measure the ugly part: indexing time, freshness, bundle size, permissions, and the first query after a cold start.
Promote only when pain is real: a managed database is easier to justify after the file-based version shows exactly where it bends.

Check out my practical Pagefind guide for hands-on implementation, or explore the growing ecosystem of edge-native databases reshaping data at scale.

Disclaimer: I’ve used Pagefind for years and became a contributor in 2025. I’ve experimented with Orama and Chroma for smaller projects and am exploring LanceDB for larger AI applications. No financial ties to these projects—just keen interest in the evolving database landscape.