Quête Search

Postgres full text search performance.

I've been exploring different methods for search engines off and on over the years. Apache's couchdb and mongodb both offer methods of natural language search (For mongo's I think you have to pay for the good one). The issue is that when you are dealing with a search engine, no-sql gets too messy to fast, and you want FAST results.

Postgres is nice because it's fast and free. And has a lot of extensions.

After getting the crawler working at a semi-decent pace, the newly indexed sites started to expand to millions of rows instead of just like 50K.

The previous method was to use vector search and then ts_rank on those returned results. This worked really well with a smaller index, and even with a larger one it did return the best results in terms of what you are acutally looking for. But it's old and crusty and doesn't scale well at all.

// This innocent-looking line was killing performance
ts_rank(search_vector, plainto_tsquery('english', $1)) * (0.5 + (page_rank * 0.5))

Typical search: 5-15 seconds
Complex queries: 20-30+ seconds

Understanding ts_rank: Why It's Slow

PostgreSQL's ts_rank function calculates text relevance scores by analyzing:

Word frequency
Term proximity
Document length
Position weights

For a search term matching 10,000+ documents, ts_rank was:

Finding all matching documents ✅ (fast)
Calculating individual relevance scores for each match ❌ (very slow)
Sorting by those calculated scores 💀💀💀💀

Imagine you type the word "page". Most sites somewhere in their content are going to have that word. This means on those sorts of results you ended up ranking millions of rows. Which isn't performant. You can grab a subsection and then rank those which, weirdly worked quite well. But always ran the risk of different results for the same search term etc.

The current solution is a combination of custom sql that took a bit of tweaking but overall I think hits a good ratio of performance to relevance. The issue is that no one will use a site that is returning the best result in 30 seconds. A worse result at half a second is better.

How the algorithm works

~40% weight: Pre-computed page_rank (quality signal) <-- The scrapers actually help with this one. Too many ads and you get scored less. All the usual things
~40% bonus: Exact title matches (usually most relevant)
~20% bonus: Description matches
~10% bonus: Recent content boost

Performance Results

Query Type	Before	After	Improvement
Simple queries	5-12s	500ms-1s	90%+ faster
Complex queries	20-30s	1-2s	95%+ faster
Worst case	30s+	~5s max	85%+ faster

2. Parallel Query Execution

The second but less important thing slowing things down was the sequential processing. Especially for writes. I.e there was a search.

Writes currently happen to a master node in EU, while the READs are mostly in North America. Eventually they will be load balanced between them. But writes do take a bit of time.

// Run search and count queries in parallel instead of sequentially
const [searchResults, countResults] = await Promise.all([
  searchQuery(query),
  countQuery(query)
]);

3. Smart Pagination Limits

Limiting the actual number of pages you see stops the uncessary counts. Typically this comes up in simpler queries with lots of results.

-- Limit count queries to prevent expensive full table scans
SELECT COUNT(*) FROM (
  SELECT 1 FROM search_index 
  WHERE conditions
  LIMIT 50  -- Max 5 pages
) t