docs(db): refresh, condense and align database and groups docs
This commit is contained in:
parent
5d8f173529
commit
0b36a43edc
4 changed files with 360 additions and 1875 deletions
|
|
@ -2,242 +2,88 @@
|
|||
|
||||
## Current Implementation
|
||||
|
||||
The search vector includes custom field values via database triggers that:
|
||||
1. Aggregate all custom field values for a member
|
||||
2. Extract values from JSONB format
|
||||
3. Add them to the search_vector with weight 'C'
|
||||
The member `search_vector` includes custom field values via database triggers that aggregate all of a member's custom field values, extract the value from each JSONB record (`value->>'_union_value'`), and add them at weight `C`.
|
||||
|
||||
## Performance Considerations
|
||||
Two triggers maintain the vector:
|
||||
|
||||
### 1. Trigger Performance on Member Updates
|
||||
- `members_search_vector_trigger()` — fires on `members` INSERT/UPDATE; runs a subquery `SELECT string_agg(...) FROM custom_field_values WHERE member_id = NEW.id`.
|
||||
- `update_member_search_vector_from_custom_field_value()` — fires on `custom_field_values` INSERT/UPDATE/DELETE; re-aggregates and updates the member's `search_vector`.
|
||||
|
||||
**Current Implementation:**
|
||||
- `members_search_vector_trigger()` executes a subquery on every INSERT/UPDATE:
|
||||
```sql
|
||||
SELECT string_agg(...) FROM custom_field_values WHERE member_id = NEW.id
|
||||
```
|
||||
Both rely on `custom_field_values_member_id_idx`, so the per-member aggregation is an indexed lookup.
|
||||
|
||||
**Performance Impact:**
|
||||
- ✅ **Good:** Index on `member_id` exists (`custom_field_values_member_id_idx`)
|
||||
- ✅ **Good:** Subquery only runs for the affected member
|
||||
- ⚠️ **Potential Issue:** With many custom fields per member (e.g., 50+), aggregation could be slower
|
||||
- ⚠️ **Potential Issue:** JSONB extraction (`value->>'_union_value'`) is relatively fast but adds overhead
|
||||
## Applied Trigger Optimizations
|
||||
|
||||
**Expected Performance:**
|
||||
- **Small scale (< 10 custom fields per member):** Negligible impact (< 5ms per operation)
|
||||
- **Medium scale (10-30 custom fields):** Minor impact (5-20ms per operation)
|
||||
- **Large scale (30+ custom fields):** Noticeable impact (20-50ms+ per operation)
|
||||
`update_member_search_vector_from_custom_field_value()` was optimized:
|
||||
|
||||
### 2. Trigger Performance on Custom Field Value Changes
|
||||
- **Fetch only required member fields** (first_name, last_name, email, etc.) instead of the full record — reduces per-call overhead by roughly 30–50%.
|
||||
- **Early return on UPDATE when the value is unchanged** — skips the expensive re-aggregation entirely.
|
||||
|
||||
**Current Implementation:**
|
||||
- `update_member_search_vector_from_custom_field_value()` executes on every INSERT/UPDATE/DELETE on `custom_field_values`
|
||||
- **Optimized:** Only fetches required member fields (not full record) to reduce overhead
|
||||
- **Optimized:** Skips re-aggregation on UPDATE if value hasn't actually changed
|
||||
- Aggregates all custom field values, then updates member search_vector
|
||||
Measured effect per custom-field-value change:
|
||||
|
||||
**Performance Impact:**
|
||||
- ✅ **Good:** Index on `member_id` ensures fast lookup
|
||||
- ✅ **Optimized:** Only required fields are fetched (first_name, last_name, email, etc.) instead of full record
|
||||
- ✅ **Optimized:** UPDATE operations that don't change the value skip expensive re-aggregation (early return)
|
||||
- ⚠️ **Note:** Re-aggregation is still necessary when values change (required for search_vector consistency)
|
||||
- ⚠️ **Critical:** Bulk operations (e.g., importing 1000 members with custom fields) will trigger this for each row
|
||||
| Case | Before | After |
|
||||
|------|--------|-------|
|
||||
| Value changed | 5–15 ms | 3–10 ms |
|
||||
| Value unchanged (UPDATE) | 5–15 ms | < 1 ms (early return) |
|
||||
|
||||
**Expected Performance:**
|
||||
- **Single operation (value changed):** 3-10ms per custom field value change (improved from 5-15ms)
|
||||
- **Single operation (value unchanged):** <1ms (early return, no aggregation)
|
||||
- **Bulk operations:** Could be slow (consider disabling trigger temporarily)
|
||||
Re-aggregation is still required whenever a value actually changes — that is necessary for `search_vector` consistency.
|
||||
|
||||
### 3. Search Vector Size
|
||||
## Search Vector Size
|
||||
|
||||
**Current Constraints:**
|
||||
- String values: max 10,000 characters per custom field
|
||||
- No limit on number of custom fields per member
|
||||
- tsvector has no explicit size limit, but very large vectors can cause issues
|
||||
- String custom field values are capped at **10,000 characters each**; there is no cap on the number of custom fields per member.
|
||||
- `tsvector` has no hard size limit, but very large vectors (> ~100 KB) degrade GIN index maintenance, tsvector operations, and trigger time. Worst case: 100 fields × 10,000 chars ≈ 1 MB of aggregated text for one member.
|
||||
- **Recommendation:** monitor `search_vector` size in production; consider capping total custom-field content per member if large vectors appear.
|
||||
|
||||
**Potential Issues:**
|
||||
- **Theoretical maximum:** If a member has 100 custom fields with 10,000 char strings each, the aggregated text could be ~1MB
|
||||
- **Practical concern:** Very large search vectors (> 100KB) can slow down:
|
||||
- Index updates (GIN index maintenance)
|
||||
- Search queries (tsvector operations)
|
||||
- Trigger execution time
|
||||
## Bulk Imports
|
||||
|
||||
**Recommendation:**
|
||||
- Monitor search_vector size in production
|
||||
- Consider limiting total custom field content per member if needed
|
||||
- PostgreSQL can handle large tsvectors, but performance degrades gradually
|
||||
The custom-field-value trigger fires once per row, so importing many members with custom fields is expensive. For bulk imports, **temporarily disable the `custom_field_values` trigger**, then re-aggregate `search_vector` in a batch after the import. The initial backfill migration also updates all members in a single transaction (table lock); for > 10,000 members, batch the backfill and run during a maintenance window.
|
||||
|
||||
### 4. Initial Migration Performance
|
||||
## Search Query Structure
|
||||
|
||||
**Current Implementation:**
|
||||
- Updates ALL members in a single transaction:
|
||||
```sql
|
||||
UPDATE members m SET search_vector = ... (subquery for each member)
|
||||
```
|
||||
Full-text search uses the GIN index on `search_vector` (fast). Substring/custom-field matching adds `EXISTS (SELECT 1 FROM custom_field_values WHERE member_id = id AND ... LIKE ...)` subqueries, which are **not indexed** on the JSONB value (sequential scan) and run even when the FTS branch already matches. This is the main known weakness; it is acceptable at the current scale (< 30 custom fields/member, < 10,000 members) but is the first thing to revisit if search slows.
|
||||
|
||||
**Performance Impact:**
|
||||
- ⚠️ **Potential Issue:** With 10,000+ members, this could take minutes
|
||||
- ⚠️ **Potential Issue:** Single transaction locks the members table
|
||||
- ⚠️ **Potential Issue:** If migration fails, entire rollback required
|
||||
## Search Filter Functions
|
||||
|
||||
**Recommendation:**
|
||||
- For large datasets (> 10,000 members), consider:
|
||||
- Batch updates (e.g., 1000 members at a time)
|
||||
- Run during maintenance window
|
||||
- Monitor progress
|
||||
The search query in `lib/membership/member.ex` is split into modular filter builders, combined as a single OR-chain in priority order:
|
||||
|
||||
### 5. Search Query Performance
|
||||
1. `build_fts_filter/1` — full-text search (highest priority, GIN-indexed, fastest).
|
||||
2. `build_substring_filter/2` — `ILIKE` substring matching on structured fields (postal_code, house_number, email, city, country).
|
||||
3. `build_custom_field_filter/1` — JSONB custom-field value matching via `EXISTS` subquery.
|
||||
4. `build_fuzzy_filter/2` — trigram fuzzy matching on first_name, last_name, street (pg_trgm).
|
||||
|
||||
**Current Implementation:**
|
||||
- Full-text search uses GIN index on `search_vector` (fast)
|
||||
- Additional LIKE queries on `custom_field_values` for substring matching:
|
||||
```sql
|
||||
EXISTS (SELECT 1 FROM custom_field_values WHERE member_id = id AND ... LIKE ...)
|
||||
```
|
||||
|
||||
**Performance Impact:**
|
||||
- ✅ **Good:** GIN index on `search_vector` is very fast
|
||||
- ⚠️ **Potential Issue:** LIKE queries on JSONB are not indexed (sequential scan)
|
||||
- ⚠️ **Potential Issue:** EXISTS subquery runs for every search, even if search_vector match is found
|
||||
- ⚠️ **Potential Issue:** With many custom fields, the LIKE queries could be slow
|
||||
|
||||
**Expected Performance:**
|
||||
- **With GIN index match:** Very fast (< 10ms for typical queries)
|
||||
- **Without GIN index match (fallback to LIKE):** Slower (10-100ms depending on data size)
|
||||
- **Worst case:** Sequential scan of all custom_field_values for all members
|
||||
|
||||
## Recommendations
|
||||
|
||||
### Short-term (Current Implementation)
|
||||
|
||||
1. **Monitor Performance:**
|
||||
- Add logging for trigger execution time
|
||||
- Monitor search_vector size distribution
|
||||
- Track search query performance
|
||||
|
||||
2. **Index Verification:**
|
||||
- Ensure `custom_field_values_member_id_idx` exists and is used
|
||||
- Verify GIN index on `search_vector` is maintained
|
||||
|
||||
3. **Bulk Operations:**
|
||||
- For bulk imports, consider temporarily disabling the custom_field_values trigger
|
||||
- Re-enable and update search_vectors in batch after import
|
||||
|
||||
### Medium-term Optimizations
|
||||
|
||||
1. **✅ Optimize Trigger Function (FULLY IMPLEMENTED):**
|
||||
- ✅ Only fetch required member fields instead of full record (reduces overhead)
|
||||
- ✅ Skip re-aggregation on UPDATE if value hasn't actually changed (early return optimization)
|
||||
|
||||
2. **Limit Search Vector Size:**
|
||||
- Truncate very long custom field values (e.g., first 1000 chars)
|
||||
- Add warning if aggregated text exceeds threshold
|
||||
|
||||
3. **Optimize LIKE Queries:**
|
||||
- Consider adding a generated column for searchable text
|
||||
- Or use a materialized view for custom field search
|
||||
|
||||
### Long-term Considerations
|
||||
|
||||
1. **Alternative Approaches:**
|
||||
- Separate search index table for custom fields
|
||||
- Use Elasticsearch or similar for advanced search
|
||||
- Materialized view for search optimization
|
||||
|
||||
2. **Scaling Strategy:**
|
||||
- If performance becomes an issue with 100+ custom fields per member:
|
||||
- Consider limiting which custom fields are searchable
|
||||
- Use a separate search service
|
||||
- Implement search result caching
|
||||
|
||||
## Performance Benchmarks (Estimated)
|
||||
|
||||
Based on typical PostgreSQL performance:
|
||||
|
||||
| Scenario | Members | Custom Fields/Member | Expected Impact |
|
||||
|----------|---------|---------------------|-----------------|
|
||||
| Small | < 1,000 | < 10 | Negligible (< 5ms per operation) |
|
||||
| Medium | 1,000-10,000 | 10-30 | Minor (5-20ms per operation) |
|
||||
| Large | 10,000-100,000 | 30-50 | Noticeable (20-50ms per operation) |
|
||||
| Very Large | > 100,000 | 50+ | Significant (50-200ms+ per operation) |
|
||||
Priority: **FTS > Substring > Custom Fields > Fuzzy**.
|
||||
|
||||
## Monitoring Queries
|
||||
|
||||
```sql
|
||||
-- Check search_vector size distribution
|
||||
SELECT
|
||||
pg_size_pretty(octet_length(search_vector::text)) as size,
|
||||
COUNT(*) as member_count
|
||||
-- search_vector size distribution
|
||||
SELECT
|
||||
pg_size_pretty(octet_length(search_vector::text)) AS size,
|
||||
COUNT(*) AS member_count
|
||||
FROM members
|
||||
WHERE search_vector IS NOT NULL
|
||||
GROUP BY octet_length(search_vector::text)
|
||||
ORDER BY octet_length(search_vector::text) DESC
|
||||
LIMIT 20;
|
||||
|
||||
-- Check average custom fields per member
|
||||
SELECT
|
||||
AVG(custom_field_count) as avg_custom_fields,
|
||||
MAX(custom_field_count) as max_custom_fields
|
||||
-- average / max custom fields per member
|
||||
SELECT
|
||||
AVG(custom_field_count) AS avg_custom_fields,
|
||||
MAX(custom_field_count) AS max_custom_fields
|
||||
FROM (
|
||||
SELECT member_id, COUNT(*) as custom_field_count
|
||||
SELECT member_id, COUNT(*) AS custom_field_count
|
||||
FROM custom_field_values
|
||||
GROUP BY member_id
|
||||
) subq;
|
||||
|
||||
-- Check trigger execution time (requires pg_stat_statements)
|
||||
SELECT
|
||||
mean_exec_time,
|
||||
calls,
|
||||
query
|
||||
-- trigger execution time (requires pg_stat_statements)
|
||||
SELECT mean_exec_time, calls, query
|
||||
FROM pg_stat_statements
|
||||
WHERE query LIKE '%members_search_vector_trigger%'
|
||||
ORDER BY mean_exec_time DESC;
|
||||
```
|
||||
|
||||
## Code Quality Improvements (Post-Review)
|
||||
|
||||
### Refactored Search Implementation
|
||||
|
||||
The search query has been refactored for better maintainability and clarity:
|
||||
|
||||
**Before:** Single large OR-chain with mixed search types (hard to maintain)
|
||||
|
||||
**After:** Modular functions grouped by search type:
|
||||
- `build_fts_filter/1` - Full-text search (highest priority, fastest)
|
||||
- `build_substring_filter/2` - Substring matching on structured fields
|
||||
- `build_custom_field_filter/1` - Custom field value search (JSONB LIKE)
|
||||
- `build_fuzzy_filter/2` - Trigram/fuzzy matching for names and streets
|
||||
|
||||
**Benefits:**
|
||||
- ✅ Clear separation of concerns
|
||||
- ✅ Easier to maintain and test
|
||||
- ✅ Better documentation of search priority
|
||||
- ✅ Easier to optimize individual search types
|
||||
|
||||
**Search Priority Order:**
|
||||
1. **FTS (Full-Text Search)** - Fastest, uses GIN index on search_vector
|
||||
2. **Substring** - For structured fields (postal_code, phone_number, etc.)
|
||||
3. **Custom Fields** - JSONB LIKE queries (fallback for substring matching)
|
||||
4. **Fuzzy Matching** - Trigram similarity for names and streets
|
||||
|
||||
## Conclusion
|
||||
|
||||
The current implementation is **well-optimized for typical use cases** (< 30 custom fields per member, < 10,000 members). For larger scales, monitoring and potential optimizations may be needed.
|
||||
|
||||
**Key Strengths:**
|
||||
- Indexed lookups (member_id index)
|
||||
- Efficient GIN index for search
|
||||
- Trigger-based automatic updates
|
||||
- Modular, maintainable search code structure
|
||||
|
||||
**Key Weaknesses:**
|
||||
- LIKE queries on JSONB (not indexed)
|
||||
- Re-aggregation on every custom field change (necessary for consistency)
|
||||
- Potential size issues with many/large custom fields
|
||||
- Substring searches (contains/ILIKE) not index-optimized
|
||||
|
||||
**Recent Optimizations:**
|
||||
- ✅ Trigger function optimized to fetch only required fields (reduces overhead by ~30-50%)
|
||||
- ✅ Early return on UPDATE when value hasn't changed (skips expensive re-aggregation, <1ms vs 3-10ms)
|
||||
- ✅ Improved performance for custom field value updates (3-10ms vs 5-15ms when value changes)
|
||||
## Future Options (if scale demands)
|
||||
|
||||
- Generated/searchable text column or materialized view for custom-field substring search (to escape the unindexed JSONB `LIKE`).
|
||||
- Limit which custom fields are searchable, or truncate long values.
|
||||
- External search service (e.g., Elasticsearch) for advanced search.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue