Nguồn: Tất cả đều từ EDA Rounds 01-07, không phải đoán.
Quy tắc: Mỗi feature phải có insight gốc. Không feature nào được thêm mà không có bằng chứng.
📊 TRẠNG THÁI
| Status |
Count |
| 💡 Proposed |
25 |
| ✅ Implemented |
0 |
| ❌ Rejected |
0 |
🏠 USER FEATURES (from fact_user_events)
F-001: user_event_count
- Insight: INS-002 (161M events, power law activity)
- Formula:
count(*) WHERE user_id = X
- Priority: HIGH — Basic engagement signal
F-002: user_contact_rate
- Insight: INS-006, INS-008 (is_contact=1 breakdown)
- Formula:
sum(is_contact) / count(*)
- Priority: HIGH — Conversion propensity
F-003: user_category_preference
- Insight: INS-009 (category contact rates differ)
- Formula:
mode(category) WHERE user_id = X
- Priority: HIGH — Category matching
F-004: user_city_preference
- Insight: INS-017 (HCM+HN = 81%)
- Formula:
mode(city_name) WHERE user_id = X
- Priority: HIGH — Geographic matching
F-005: user_device_primary
- Insight: Round 03 (Desktop 32%, MSite 27%, iOS 25%, Android 16%)
- Formula:
mode(device)
- Priority: LOW
F-006: user_avg_dwell_sec
- Insight: INS-005 (dwell is milliseconds, median 17.9s)
- Formula:
avg(dwell_time_sec / 1000)
- Priority: MEDIUM — Browse depth
F-007: user_session_count
- Insight: Round 04 (session analysis)
- Formula:
nunique(session_id)
- Priority: MEDIUM
F-008: user_recency_days
- Insight: Round 03 (temporal patterns)
- Formula:
(cutoff - max(event_ts)).days
- Priority: HIGH — Recency matters for recommendations
F-009: user_is_cold_start
- Insight: INS-004 (64% cold-start!!!)
- Formula:
user_id NOT IN train_events
- Priority: CRITICAL — 64% of test users
📦 ITEM FEATURES (from dim_listing + fact_listing_snapshot)
F-010: item_images_count
- Insight: INS-010 (16+ images = 14.56 avg leads vs 5.04 for null)
- Formula: Direct column
- Priority: MEDIUM
F-011: item_completeness_score
- Insight: INS-012 (more filled fields → more leads)
- Formula:
sum(is_not_null(area, beds, baths, direction, legal, furnishing, house_type, floors))
- Priority: MEDIUM
F-012: item_avg_views_24h
- Insight: Round 05 (listing performance)
- Formula:
mean(views_24h) from snapshot
- Priority: HIGH — Popularity signal
F-013: item_listing_age
- Insight: INS-011 (performance decays after 2 weeks)
- Formula:
max(listing_age_days)
- Priority: HIGH — Freshness decay
F-014: item_is_zombie
- Insight: INS-014 (8.4% zombie listings)
- Formula:
age>60 & views<5 & contacts=0
- Priority: HIGH — Exclusion flag
F-015: item_seller_type
- Insight: INS-003 (83.5% agent)
- Formula: Direct column (binary encode)
- Priority: MEDIUM — Fairness metric
🤝 INTERACTION FEATURES
F-016: user_item_pageview_count
- Formula:
count WHERE user=X, item=Y, event_type=pageview
- Priority: HIGH
F-017: user_item_dwell_total
- Formula:
sum(dwell_time_sec/1000) WHERE user=X, item=Y
- Priority: MEDIUM
F-018: user_item_category_match
- Formula:
user_pref_cat == item_cat
- Priority: HIGH
F-019: user_item_city_match
- Formula:
user_pref_city == item_city
- Priority: HIGH
F-020: user_item_recency
- Formula:
days since last (user, item) interaction
- Priority: HIGH
🌡️ TEMPORAL FEATURES
F-021: time_since_tet
- Insight: INS-018 (Tết -40% drop)
- Formula:
abs(event_date - tet_date).days
- Priority: LOW
F-022: is_weekend
- Insight: H-005 VERIFIED (weekdays > weekends)
- Formula:
dow >= 6
- Priority: LOW
📊 COLD-START FEATURES (for 64% of test users)
F-023: popular_items_by_city_category
- Insight: INS-004, INS-017
- Formula: Top-K most contacted items per (city, category) in last 30 days
- Priority: CRITICAL — This IS the cold-start strategy
F-024: trending_items_recent
- Insight: INS-011 (freshness matters)
- Formula: Items with highest contact growth in last 7 days
- Priority: HIGH
F-025: city_category_popularity_score
- Insight: INS-017 (HCM+HN = 81%)
- Formula: Normalized contact count per (city, category)
- Priority: HIGH
🏥 HEALTH METRIC FEATURES (từ R09 — mới)
F-026: item_contact_burst_score
- Insight: INS-024 (cold users get homogeneous trending → need burst signals)
- Formula:
contacts_last_7d / (contacts_8d_to_30d / 3 + ε)
Items với burst_score > 1.5 đang "heating up"
- Implemented in:
src/models/baselines/trending.py → BurstTrendingRecommender
- Priority: HIGH ✅ IMPLEMENTED
F-027: seller_type_fairness_bonus
- Insight: INS-019 (agent ratio 27% vs GT 52%)
- Formula: In MultiObjectiveReranker: boost items nếu agent_ratio_current < 0.52
Penalty nếu agent_ratio_current > 0.52 (đã đủ exposure)
- Implemented in:
src/evaluation/health_metrics.py → compute_fairness()
- Priority: CRITICAL ✅ IMPLEMENTED (cần verify effectiveness)
F-028: item_novelty_score (Long-tail)
- Insight: INS-022 (Coverage 3.71%, top-1% chiếm 81.9%)
- Formula:
1 - (contact_rank / total_items) → higher for rarely-recommended items
- Priority: HIGH — Verify via H-011 (R14)
- Status: 💡 PROPOSED — chưa implement
F-029: item_age_momentum_score
- Insight: INS-021 (GT contacts old items median 97d — sustained demand)
- Formula:
contacts_last_7d / listing_age_days — items maintaining contact velocity over time
- Priority: MEDIUM — Verify via H-010 (R13)
- Status: 💡 PROPOSED — chưa implement
F-030: cold_user_coverage_score
- Insight: INS-024 (63% cold users get same global trending → coverage bottleneck)
- Formula: For cold users: segment trending uses categories inversely proportional to over-representation
e.g., if 1050 over-served +5.9pp → show 1050 less in trending, boost 1010 more
- Priority: HIGH
- Status: 💡 PROPOSED — implement in BurstTrendingRecommender
F-031: item_contact_conversion_rate
- Insight: INS-042 (Adview correlation 0.75)
- Formula:
contacts_24h / (views_24h + 1)
- Priority: CRITICAL — Strong non-linear signal for Reranker.
- Status: 💡 PROPOSED
F-032: is_same_category_as_last_view
- Insight: INS-043 (75% category loyalty)
- Formula:
1 if candidate_category == user_last_interaction_category else 0
- Priority: CRITICAL — Prevent cross-category recommendations.
- Status: 💡 PROPOSED
🆕 PCI-BASED FEATURES (từ Round 19 — fact_post_contact_interactions)
F-033: pci_user_city_category_prefs
- Insight: INS-059 (10,654 blind users have PCI data)
- Formula: Extract mode(city), mode(category) from PCI for blind test users
- Priority: 🔴 CRITICAL — Convert 10,654 blind → cold-with-prefs
- Status: 💡 PROPOSED
F-034: pci_lead_count_weight
- Insight: INS-060 (644K new lead pairs)
- Formula:
lead_count as interaction weight in ALS matrix (stronger signal = higher weight)
- Priority: 🔴 CRITICAL — Enrich ALS training signal
- Status: 💡 PROPOSED
F-035: pci_purchased_boost
- Insight: INS-059 (2,436 purchased rows)
- Formula: Items with
purchased=True → weight 3x in ALS training
- Priority: HIGH — Confirmed conversion = strongest positive signal
- Status: 💡 PROPOSED