Search Speed Optimization
Postgres full text search performance.
I've been exploring different methods for search engines off and on over the years. Apache's couchdb and mongodb both offer methods of natural language search (For mongo's I think you have to pay for the good one). The issue is that when you are dealing with a search engine, no-sql gets too messy to fast, and you want FAST results.
Postgres is nice because it's fast and free. And has a lot of extensions.
After getting the crawler working at a semi-decent pace, the newly indexed sites started to expand to millions of rows instead of just like 50K.
The previous method was to use vector search and then ts_rank on those returned results. This worked really well with a smaller index, and even with a larger one it did return the best results in terms of what you are acutally looking for. But it's old and crusty and doesn't scale well at all.
// This innocent-looking line was killing performance ts_rank(search_vector, plainto_tsquery('english', $1)) * (0.5 + (page_rank * 0.5))
- Typical search: 5-15 seconds
- Complex queries: 20-30+ seconds
Understanding ts_rank: Why It's Slow
PostgreSQL's ts_rank
function calculates text relevance scores by analyzing:
- Word frequency
- Term proximity
- Document length
- Position weights
For a search term matching 10,000+ documents, ts_rank
was:
- Finding all matching documents ✅ (fast)
- Calculating individual relevance scores for each match ❌ (very slow)
- Sorting by those calculated scores 💀💀💀💀
Imagine you type the word "page". Most sites somewhere in their content are going to have that word. This means on those sorts of results you ended up ranking millions of rows. Which isn't performant. You can grab a subsection and then rank those which, weirdly worked quite well. But always ran the risk of different results for the same search term etc.
The current solution is a combination of custom sql that took a bit of tweaking but overall I think hits a good ratio of performance to relevance. The issue is that no one will use a site that is returning the best result in 30 seconds. A worse result at half a second is better.
How the algorithm works
- ~40% weight: Pre-computed
page_rank
(quality signal) <-- The scrapers actually help with this one. Too many ads and you get scored less. All the usual things - ~40% bonus: Exact title matches (usually most relevant)
- ~20% bonus: Description matches
- ~10% bonus: Recent content boost
Performance Results
Query Type | Before | After | Improvement |
---|---|---|---|
Simple queries | 5-12s | 500ms-1s | 90%+ faster |
Complex queries | 20-30s | 1-2s | 95%+ faster |
Worst case | 30s+ | ~5s max | 85%+ faster |
2. Parallel Query Execution
The second but less important thing slowing things down was the sequential processing. Especially for writes. I.e there was a search.
Writes currently happen to a master node in EU, while the READs are mostly in North America. Eventually they will be load balanced between them. But writes do take a bit of time.
// Run search and count queries in parallel instead of sequentially const [searchResults, countResults] = await Promise.all([ searchQuery(query), countQuery(query) ]);
3. Smart Pagination Limits
Limiting the actual number of pages you see stops the uncessary counts. Typically this comes up in simpler queries with lots of results.
-- Limit count queries to prevent expensive full table scans SELECT COUNT(*) FROM ( SELECT 1 FROM search_index WHERE conditions LIMIT 50 -- Max 5 pages ) t