docs: document fuzzy search similarity threshold strategy

Explain the two-tier matching approach:
- % operator with server-wide threshold (0.3) for fast index scans
- similarity functions with configurable threshold (0.2) for edge cases
Add rationale for threshold value based on German name testing
This commit is contained in:
Moritz 2025-12-11 13:37:49 +01:00
parent add855c8cb
commit 12f95c1998
Signed by: moritz
GPG key ID: 1020A035E5DD0824

View file

@ -42,6 +42,21 @@ defmodule Mv.Membership.Member do
# Module constants # Module constants
@member_search_limit 10 @member_search_limit 10
# Similarity threshold for fuzzy name/address matching.
# Lower value = more results but less accurate (0.1-0.9)
#
# Fuzzy matching uses two complementary strategies:
# 1. % operator: Fast GIN-index-based matching using server-wide threshold (default 0.3)
# - Catches exact trigram matches quickly via index
# 2. similarity/word_similarity functions: Precise matching with this configurable threshold
# - Catches partial matches that % operator might miss
#
# Value 0.2 chosen based on testing with typical German names:
# - "Müller" vs "Mueller": similarity ~0.65 ✓
# - "Schmidt" vs "Schmitt": similarity ~0.75 ✓
# - "Wagner" vs "Wegner": similarity ~0.55 ✓
# - Random unrelated names: similarity ~0.15 ✗
@default_similarity_threshold 0.2 @default_similarity_threshold 0.2
# Use constants from Mv.Constants for member fields # Use constants from Mv.Constants for member fields
@ -539,9 +554,16 @@ defmodule Mv.Membership.Member do
) )
end end
# Builds fuzzy/trigram matching filter for name and street fields # Builds fuzzy/trigram matching filter for name and street fields.
# Uses pg_trgm extension with GIN indexes for performance # Uses pg_trgm extension with GIN indexes for performance.
# Note: Requires trigram indexes on first_name, last_name, street #
# Two-tier matching strategy:
# - % operator: Uses server-wide pg_trgm.similarity_threshold (typically 0.3)
# for fast index-based initial filtering
# - similarity/word_similarity: Uses @default_similarity_threshold (0.2)
# for more lenient matching to catch edge cases
#
# Note: Requires trigram GIN indexes on first_name, last_name, street.
defp build_fuzzy_filter(query, threshold) do defp build_fuzzy_filter(query, threshold) do
expr( expr(
fragment("? % first_name", ^query) or fragment("? % first_name", ^query) or