docs: document fuzzy search similarity threshold strategy

Explain the two-tier matching approach: - % operator with server-wide threshold (0.3) for fast index scans - similarity functions with configurable threshold (0.2) for edge cases Add rationale for threshold value based on German name testing
2025-12-11 13:37:49 +01:00 · 2025-12-11 13:37:49 +01:00 · 12f95c1998
commit 12f95c1998
parent add855c8cb
1 changed files with 25 additions and 3 deletions
--- a/lib/membership/member.ex
+++ b/lib/membership/member.ex
@ -42,6 +42,21 @@ defmodule Mv.Membership.Member do
  # Module constants
  @member_search_limit 10
  # Similarity threshold for fuzzy name/address matching.
  # Lower value = more results but less accurate (0.1-0.9)
  #
  # Fuzzy matching uses two complementary strategies:
  # 1. % operator: Fast GIN-index-based matching using server-wide threshold (default 0.3)
  #    - Catches exact trigram matches quickly via index
  # 2. similarity/word_similarity functions: Precise matching with this configurable threshold
  #    - Catches partial matches that % operator might miss
  #
  # Value 0.2 chosen based on testing with typical German names:
  # - "Müller" vs "Mueller": similarity ~0.65 ✓
  # - "Schmidt" vs "Schmitt": similarity ~0.75 ✓
  # - "Wagner" vs "Wegner": similarity ~0.55 ✓
  # - Random unrelated names: similarity ~0.15 ✗
  @default_similarity_threshold 0.2
  # Use constants from Mv.Constants for member fields
@ -539,9 +554,16 @@ defmodule Mv.Membership.Member do
    )
  end
-  # Builds fuzzy/trigram matching filter for name and street fields
+  # Builds fuzzy/trigram matching filter for name and street fields.
-  # Uses pg_trgm extension with GIN indexes for performance
+  # Uses pg_trgm extension with GIN indexes for performance.
-  # Note: Requires trigram indexes on first_name, last_name, street
+  #
  # Two-tier matching strategy:
  # - % operator: Uses server-wide pg_trgm.similarity_threshold (typically 0.3)
  #   for fast index-based initial filtering
  # - similarity/word_similarity: Uses @default_similarity_threshold (0.2)
  #   for more lenient matching to catch edge cases
  #
  # Note: Requires trigram GIN indexes on first_name, last_name, street.
  defp build_fuzzy_filter(query, threshold) do
    expr(
      fragment("? % first_name", ^query) or