Building AI-Powered Museum Search Without Starting Over

The client's teams were drowning in scattered databases with tools built for isolation, not cooperation. We built Lynceus — a unified AI-powered search layer that sits on top of existing infrastructure, reverse-engineers undocumented APIs, and replaces expensive LLM calls with vector geometry.

Cultural HeritageAI Integration · API Reverse Engineering · Platform Architecture

The Brief

The client's teams work across scattered databases — each museum collection siloed behind its own platform, its own search interface, its own data schema. Their existing digital toolset was functional but solitary: a researcher could query one database at a time, but cross-referencing artists across collections meant opening tabs, copying results, and stitching insights together manually. Cooperation between teams was frictional. The tools weren't built for it.

They came to us with a clear problem: the infrastructure works in isolation and fails in coordination.They couldn't afford to throw it away and rebuild from scratch — years of data, established workflows, and institutional muscle memory were tied to the existing systems. What they needed was a layer on top: something that unifies access, understands what researchers are looking for, and makes the fragmented whole feel like a single intelligent tool.

We set out to build Lynceus — a unified search engine powered by AI that sits above the existing infrastructure without replacing it. The requirements:

Unified access — one search across all collections, regardless of which platform holds the data
AI-powered query understanding — recognize artist names, extract intent, route queries to the right source automatically
No ground-up rebuild — existing databases and scrapers stay in place. Lynceus is an integration layer, not a replacement.
Built to absorb — more museum sources are coming. Adding a new collection should be a configuration change, not an engineering project.

What follows is the story of how we built it — the dead ends, the pivots, and the decisions that turned out to matter more than the code.

No Documentation? No Problem.

The first source was Navigart, a platform powering the collections of several French regional art funds (FRAC). The public-facing site is a fully JavaScript-rendered single-page application. No API documentation. No public endpoints listed. No developer portal.

Our approach: open the browser dev tools and watch.

Within minutes, the Network tab revealed the engine underneath: every search on the frontend hit api.navigart.fr/123/artworks — a clean REST API returning structured JSON. The SPA was just a skin.

But discovering the endpoint was the easy part. The API supported filtering — collection, artist name, artwork type, acquisition year, acquisition mode — and each filter had its own encoding. We didn't know which formats the API accepted, and we couldn't guess.

So we tested them. One by one.

Using a programmatic HTTP client, we systematically tested each filter as a key:value pair against the live API:

Filter	Format	Result
collection	collection:Frac Bretagne	✅ Simple text match
mode_acq	mode_acq:achat en salon	✅ Simple text match
tree_domain_all	tree_domain_all:Architecture	✅ Top-level key match
year	year:1952__1959	✅ Range with double underscore
acquisition_year	acquisition_year:MIN__MAX	✅ Range
authors	authors:CORONA Maurizio	❌ Silent failure

Every filter worked as a clean key:value pair. Every filter except one.

When the "Right" Solution Is the Wrong One

The authors filter should have been the most important one. When a user searches for "CORONA atlantico," the system needs to know that CORONA refers to an artist — specifically CORONA Maurizio — and filter accordingly.

But the Navigart API's authors aggregation doesn't use human-readable names. It uses internal keys separated by a tab character:

authors:CORONA Maurizio↹CORONA Maurizio
authors:MONK Jonathan, NANNUCCI Maurizio↹MONK Jonathan, NANNUCCI Maurizio

These keys are auto-generated, non-obvious, and impossible for an LLM (or any external system) to predict from a user's query. You can't derive CORONA Maurizio↹CORONA Maurizio from "corona maurizio" — you'd need the autocomplete endpoint to return the exact key, which means an extra API call per query, which means latency.

Sometimes the feature you can't implement directly becomes the constraint that forces a better architecture.

The decision: remove the authors filter entirely. Instead, we used the qparameter (free-text search) which already searches across author names. For the AI-powered artist detection layer we were building, we'd handle artist identification before the query reached the API — routing the search differently once we knew who the user was looking for.

If the authors filter had worked, we'd have relied on it forever. Its failure pushed us toward a solution that ended up being faster, cheaper, and more accurate.

The AI Problem: When Machine Learning Makes Things Worse

With the API figured out, we turned to the AI layer. The client wanted the search engine to understand natural language — to recognize that "Picasso paintings 1950s" means artist=Picasso, type=painting, year=1950-1959.

The initial architecture used an LLM for artist detection. Every search query was sent to a language model with a prompt like "Is there an artist name in this query? If so, extract it." This worked — technically. In practice, it had three fatal flaws:

01Cost— every query burned tokens, even simple ones like "Kandinsky"
02Latency — the LLM call added 1-3 seconds to every search
03Reliability — the model sometimes hallucinated artist names or missed obvious ones

For a search engine that needs to feel instant, this was untenable.

Trading GPUs for Geometry

The insight came from looking at what we already had: a known, finite set of artist names.

The Navigart FRAC collection contains exactly 7,100 artists. Centre Pompidou contains 7,200. These aren't unbounded — they're stable lists that change slowly. We don't need a language model to "understand" whether a query contains an artist name. We need a lookup.

The solution: load every artist name into a vector database (ChromaDB). Embed each name. For each incoming query, embed the query text and compute cosine similarity against the known artists. If the closest match exceeds a threshold, it's an artist hit.

The implementation:

sync_authors() fetches all artist names from the Navigart API on startup
ChromaDB embeds them using a local model — no API call
At query time, lookup_artist(query, source_db) returns the nearest match above threshold 0.40
If hit: strip the artist name from the query, inject the correct internal key as a filter
If no hit: pass the query as-is

The results, tested against real queries:

Query	Match	Result
CORONA atlantico	CORONA Maurizio	✅ Artist detected, search narrowed
häusermann	HÄUSERMANN Pascal	✅ Accent-insensitive match
hausermann	HÄUSERMANN Pascal	✅ Works without accents too
PARENT Claude	PARENT Claude	✅ Exact match
photographie	none	✅ Correctly rejected — not an artist

Zero LLM calls. Zero cost. Zero added latency. And the false-positive rejection is critical — "photographie" could have matched an artist name by coincidence. The 0.40 threshold was empirically tuned to prevent this.

This is the kind of trade-off that defines good AI engineering: using ML only where it adds value, and using simpler tools everywhere else. Embeddings are the right tool for closed-set matching. Language models are the right tool for open-ended reasoning. We needed the former.

Building the Platform, Not the Product

With the FRAC source working, the client dropped the next requirement: Centre Pompidou had a second collection, also on Navigart, at collection.centrepompidou.fr. More museums on Navigart would follow.

The existing code was a single hardcoded module — one scraper, one parser, one set of constants. Adding Pompidou meant either duplicating the entire module or restructuring.

We chose restructuring. The roadmap was five epics, fourteen tasks:

Epic	Tasks	Content
1 — Rename & Extract	1.1–1.5	Shared code package with parameterized scraper, parser, and factory function
2 — Migrate	2.1–2.3	Existing source rewritten to use factory (15 lines instead of duplicated code)
3 — Add Pompidou	3.1–3.4	New source as config.json + 15-line init, zero new code
4 — Artist Tracking	4.1–4.3	SQLite cross-reference: which artists appear in which collections
5 — Verify	5.1–5.2	End-to-end validation that both sources work independently and together

The key architectural decision: a factory function that reads a config file and returns a fully-wired module with search, filters, context, and author extraction. The config specifies the collection ID, API base URL, detail URL pattern, and filter schema.

Adding a new Navigart museum source now requires:

01One config file with collection-specific parameters
02One init file — 15 lines of boilerplate
03Zero new code

The search handler automatically picks up new modules through the registry — no code changes, no configuration beyond the module directory.

From Monolith to Pipeline: The Sync Problem Nobody Sees

Users never see the sync pipeline. But it's the infrastructure that makes everything else possible — it's what loads 14,300 artist names into ChromaDB so that vector lookups work.

The initial sync was a monolithic function: fetch 7,100 authors from the API, embed all of them into ChromaDB, write them to the artists database. One function call. One timeout clock.

ChromaDB embeddings for 7,000 names take approximately 500 seconds. The job queue's default timeout was 30 seconds.

The first fix was a timeout hack: timeout_ms=1,200,000 — twenty minutes. It worked. And like all hacks, it worked just well enough to become dangerous. A stuck job would block the queue for twenty minutes before timing out. No retry granularity. No progress visibility. No way to resume a partial sync.

The proper fix: decompose the monolith.

BEFORE (one giant job per module):
sync_authors("navigart_frac")
  → fetch 7,100 authors from API         (~4s)
  → upsert 15 batches to ChromaDB        (~500s)
  → bulk insert to artists.db            (~10s)
  Total: ~515s in one job, one timeout clock

AFTER (coordinator + batch workers):
sync_authors("navigart_frac")            ← completes in ~5s
  → fetch 7,100 authors from API
  → stage in artists.db (fast — ~10s for 7k)
  → enqueue 15 × chroma_batch jobs

chroma_batch(offset=0, batch_size=500)   ← ~33s each, independent
chroma_batch(offset=500, batch_size=500)
  ...
  → upsert one batch to ChromaDB

The coordinator finishes in seconds. Each batch job handles 500 authors and completes in ~33 seconds. If a batch fails, only that batch retries — not the entire module. The queue gets retry, dead-letter, and per-job observability for free.

This is the executor pattern: a fast coordinator that stages data and dispatches idempotent workers. Each worker is small enough to complete within any reasonable timeout. Each worker can be retried independently. And the coordinator never holds a long-lived connection or blocks the queue.

The Methodology: Smart Conception, Dumb Execution

Looking back, the consistent thread across every decision was a single principle: invest heavily in understanding the problem, then execute mechanically.

We spent more time testing Navigart filter encodings than writing the scraper code. That testing revealed the authors filter dead end before we built architecture around it.
We wrote a 14-task roadmap for the multi-source refactor before writing a single line of code. Each task was atomic and verifiable — "run this command, expect this output."
We empirically tuned the ChromaDB similarity threshold (0.40) against real queries rather than guessing.
We decomposed the sync pipeline only after the monolith proved problematic in production — not preemptively.

The "AI-powered" part of this project isn't the flashiest component. It's a vector lookup that costs nothing and runs in milliseconds. The actual intelligence went into the architecture: knowing where to apply AI, where to avoid it, and how to structure the system so that adding the next museum source is a configuration change, not an engineering project.

What Shipped

Component	What it does
Multi-source architecture	Factory pattern — new Navigart museums are config-only additions
Vector artist detection	ChromaDB lookup, 0.40 threshold, zero LLM calls, accent-insensitive
LLM filter extraction	Remaining NLP for structured filter extraction from natural language
Executor-pattern sync	Coordinator + batch workers, ~33s per batch, independent retries
Artist cross-referencing	SQLite tracking of which artists appear in which collections
Integration validation	Automated checks: module output, ChromaDB counts, database completeness

Scale: 2 museum sources, 14,300 indexed artists, 128,000+ artworks searchable. Architecture ready for the next source without engineering work.

Next engagement

Every integration is different. The methodology isn't.

If your organization is sitting on fragmented data sources that need unified access — without rebuilding everything — we should talk.

Discuss an integration challenge