🧪 HYPOTHESES TRACKER — THEO DÕI GIẢ THUYẾT
Triết lý: Mọi insight bắt đầu từ giả thuyết. Giả thuyết phải được CHỨNG MINH hoặc BÁC BỎ bằng data.
Quy tắc: KHÔNG BAO GIỜ chấp nhận giả thuyết mà không có bằng chứng thống kê.
📊 DASHBOARD
| Status |
Count |
| ⬜ PENDING (chưa test) |
6 |
| 🔄 TESTING (đang verify) |
0 |
| ✅ VERIFIED (đã chứng minh) |
16 |
| ❌ REJECTED (bác bỏ) |
8 |
| 🔀 MODIFIED (sửa đổi) |
2 |
| TOTAL |
32 |
⬜ PENDING HYPOTHESES (chờ verify)
[H-009] — Tăng γ (fairness weight) lên 0.25 giảm agent gap mà không hurt Recall
- Phát biểu: MultiObjectiveReranker với γ=0.25 (thay vì 0.15) sẽ kéo agent ratio từ 27% về gần 52% GT mà Recall@10 giảm <2%.
- Motivation: INS-019 — agent gap 24.7pp quá lớn.
- Cách verify: Ablation study với α=0.60, β=0.15, γ=0.25, δ=0 offline.
- Status: ⬜ PENDING — Round 13
[H-010] — Tăng ALS half-life từ 7d lên 30d cải thiện Recall@10
- Phát biểu: INS-021 cho thấy GT contacts items cũ (97d). half-life=7d quá aggressive → bias fresh items. Tăng lên 30d sẽ align tốt hơn với GT và cải thiện Recall.
- Cách verify: Train ALS với half-life={3, 7, 14, 30, 60}d, so sánh offline Recall@10 với time-split val.
- Status: ⬜ PENDING — Round 13 (ablation)
[H-011] — Long-tail novelty injection tăng Coverage từ 3.71% → 8% mà Recall giảm <1%
- Phát biểu: Thêm 20% long-tail items vào BurstTrendingRecommender cold-start pool sẽ cải thiện Coverage mà không ảnh hưởng nhiều Recall.
- Cách verify: Implement coverage_bonus trong trending, so sánh metrics offline.
- Status: ⬜ PENDING — Round 14
[H-012] — Removing require_login=True trong ColdStartProfiler tăng cold-user coverage lên 40%+
- Phát biểu: Hiện tại ColdStartProfiler chỉ lấy login events → chỉ cover 18k/120k cold users (15%). Bỏ filter is_login → có thể match non-login sessions → cover thêm 25%+ cold users.
- Cách verify: Chạy ColdStartProfiler với require_login=False, so sánh coverage %.
- Status: ⬜ PENDING — Test trực tiếp
[H-019] — Filtering pageview noise (dwell > 30s) before ALS training improves als_view quality
- Phát biểu: Hiện tại als_view trained trên TẤT CẢ pageviews (bao gồm bounce views < 5s). Nếu chỉ dùng pageviews có dwell_time > 30s, ALS sẽ học signal chất lượng hơn và có thể trở thành useful candidate source.
- Motivation: INS-047 — als_view hiện tại dilutes candidate pool vì noise quá nhiều.
- Cách verify: Retrain ALS on filtered pageviews (dwell > 30s), so sánh standalone Recall@200 vs current als_view.
- Status: ⬜ PENDING
[H-027] — Time-weighted ALS (exponential recency) will improve warm recall on clean eval
- Phát biểu: INS-068 shows ALS is 5.6x worse without 3d of most recent contacts. Time-weighting contacts by
exp(-days_ago / half_life) should partially compensate for this loss.
- Motivation: INS-068 — recency is disproportionately important for ALS quality.
- Cách verify: Retrain ALS with time-weighted contacts on clean split, compare recall.
- Status: ⬜ PENDING
[H-029] — Non-login pageview preferences (city+cat only) improve blind user recall WITHOUT touching ALS
- Phát biểu: Adding
pref_city and pref_cat from non-login pageviews for 4,215 truly-blind test users into cold_user_prefs.parquet will improve their SegPop segment matching, increasing recall from snapshot-global fallback level to segment-popular level (~1.6% ceiling per INS-063).
- Motivation: INS-071 — 4,215 blind users have non-login pageviews but no preferences in current pipeline.
- Key difference from H-024 (REJECTED): H-024 removed
is_login from the ENTIRE pipeline including ALS training → density dilution → -59%. H-029 modifies ONLY _process_cold_user_prefs → zero ALS impact.
- Risk assessment: Zero risk to warm/cold-with-signal users. Only affects 4,215/161,568 = 2.6% of test users.
- Unknown: Whether Kaggle GT evaluates non-login user_ids. If not → recall contribution = 0.
- Cách verify:
- Modify
_process_cold_user_prefs to remove is_login == 'login' filter for pageview preference extraction only
- Rebuild
cold_user_prefs.parquet
- Re-run aligned eval with
--retrain_clean to measure impact
- Compare blind recall before/after
- Status: ⬜ PENDING
✅ VERIFIED HYPOTHESES
[H-020] — Adding PCI lead pairs to ALS training improves Recall@10 ✅ VERIFIED
- Phát biểu: Merging PCI lead pairs (for existing ALS users only) into the ALS training matrix increases density and improves recommendation quality.
- Evidence:
- Warm recall on clean split improved from 0.0179 (Cascade-Direct v24 baseline) to 0.0285 (with PCI + weighted contacts).
- Standing hybrid Recall@10 improved to 0.0668.
- Verified in Round: 24
- → Insight ID: INS-069
[H-021] — PCI preferences improve cold-start SegPop matching ✅ VERIFIED
- Phát biểu: Building user preferences from PCI data before split date provides a strong personalized fallback for cold users.
- Evidence:
- Cold-with-signal users with PCI preferences achieved Recall@10 = 0.0569 vs 0.0000 for users without preferences (near 30x relative improvement).
- Verified in Round: 24
- → Insight ID: INS-067
[H-022] — PCI purchased=True items weighted 3x in ALS improves warm recall ✅ VERIFIED
- Phát biểu: Assigning 3x weight to actual purchased contact pairs in ALS training provides a stronger signal and improves embedding accuracy.
- Evidence:
- The combination of weighted contact counts (real=3x, other=1x) and PCI purchased weight=3x achieved a massive warm recall boost to 0.0668 (hybrid) and 0.0285 (cascade-direct).
- Verified in Round: 24
- → Insight ID: INS-069
[H-028] — A single LightGBM ranker destroys cold user recall, requiring segmented inference ✅ VERIFIED
- Phát biểu: A single unified ranking model overfits to high-density warm behavior features, penalizing cold-start candidates that lack behavioral histories.
- Evidence:
- Hybrid mode boosted Warm recall from 0.0285 → 0.0668 (+134.4%), but destroyed Cold-with-signal recall from 0.0528 → 0.0127 (-75.9%).
- Verified in Round: 24
- → Insight ID: INS-069
[H-025] — Retraining ALS+SegPop on split-clean data will drop blind recall to ~0.01-0.02 ✅ VERIFIED
- Phát biểu: Current eval shows blind recall=0.1654 which is 10x higher than INS-063 ceiling (0.0158). This is due to model leak (ALS/SegPop trained on full data including val period). Split-clean retrain should reveal TRUE blind recall ~0.01-0.02.
- Evidence:
- Blind recall dropped from 0.1654 → 0.0004 (413x, even lower than predicted)
- Warm recall dropped from 0.0712 → 0.0179 (4x)
- Simulated LB from 0.1336 → 0.0111 (12x)
- Verified in Round: 24
- → Insight ID: INS-066, INS-068
[H-026] — PCI prefs will show relative uplift even with clean retrain ✅ VERIFIED
- Phát biểu: Cold users with PCI prefs (0.1942) outperform cold without (0.1651) by +17.6%. This relative difference should persist after clean retrain.
- Evidence:
- Cold+prefs: Recall@10 = 0.0612 vs Cold-no-prefs: 0.0020 = 30.6x uplift (éven stronger than predicted!)
- n=715 with prefs, n=55 without prefs
- Verified in Round: 24
- → Insight ID: INS-067
[H-030] — ALS1024 + cascade-direct beats hybrid/segmented production baseline ✅ VERIFIED
- Phát biểu: Increasing ContactALS capacity to 1024 factors and serving with direct cascade mode will outperform the previous segmented/hybrid baseline.
- Evidence:
- Previous best v14: 0.0344 public LB
- v17: 0.2116 public LB / Top5
- Validated artifact:
outputs/submission_1024.zip
config.inference_mode = "cascade" skips LightGBM and avoids INS-069 cold/warm overfit.
- ALS artifact verified:
user_factors=(810411,1024), item_factors=(696252,1024).
- Verified in Round: 25
- → Insight ID: INS-072
[H-013] — IntentRecommender tăng mạnh Recall cho Cold-start/Warm-start users ✅ VERIFIED
- Phát biểu: Việc match trực tiếp (District, Category, Price) từ lịch sử Pageview với các tin trong
dim_listing sẽ mang lại Recall cao độc lập.
- Evidence: Round 18 benchmark chỉ ra IntentRecommender đạt Recall@200 độc lập là 0.1140 (cao thứ 2 sau ALS).
- Verified in Round: 18
- → Impact: Giữ vai trò core candidate source cho Reranker.
- → Insight ID: INS-044
[H-016] — Hard cascade slot-competition restricts Recall@200 ceiling ✅ VERIFIED
- Phát biểu: Rigid priority queue cascade limits the theoretical Recall@200 ceiling because high-volume sources greedily consume the 200-slot budget.
- Evidence:
- Hard cascade with ALS first: Recall@200 = 0.1840.
- Hard cascade with PV first: Recall@200 = 0.2396.
- Standalone sum of candidates is 0.5045, showing massive overlap and slot competition.
- Verified in Round: 18
- → Impact: Move from hard cascade to diverse union pool generator for Reranker candidate generation.
- → Insight ID: INS-044
[H-017] — Round-robin interleave is inferior to sequential priority for candidate generation ✅ VERIFIED
- Phát biểu: Interleaving candidates from all sources equally (round-robin) will improve diversity and Recall@200 compared to sequential priority filling.
- Evidence:
- Sequential priority: Recall@200 (Active GT) = 0.3152
- Round-robin: Recall@200 (Active GT) = 0.2753 (-12.7%)
- Round-robin let SegPop consume 75k slots (vs 35k in sequential), crowding out personalized candidates.
- Verified in Round: 19
- → Impact: Round-robin REJECTED. Sequential priority with budget caps is the correct architecture.
- → Insight ID: INS-046
[H-018] — Disabling als_view improves Recall@200 ✅ VERIFIED
- Phát biểu: Pageview-based ALS (als_view) adds valuable coverage signal to the candidate pool.
- Evidence:
- WITH als_view (budget=80): Recall@200 = 0.3014
- WITHOUT als_view (budget=0): Recall@200 = 0.3177 (+5.4%)
- als_view consumed 80-95k slots but contributed ZERO net recall improvement.
- Verified in Round: 19
- → Impact: DISABLE als_view. Pageview data is only useful for PageviewReplay and IntentRecommender, NOT for CF.
- → Insight ID: INS-047
[H-014] — adview_count correlates with contact probability up to a point ✅ VERIFIED
- Phát biểu: There is a strong relationship between views and contacts.
- Evidence:
- Pearson correlation is 0.7571.
- Conversion rate is highest at 0 views (0.103) and 150+ views (0.101), dropping to 0.087 at 30 views.
- Verified in Round: 16
- → Impact: Must include
views_24h and a non-linear combination contacts_24h / (views_24h + 1) in LightGBM Reranker.
- → Insight ID: INS-042
[H-015] — Users have high category stickiness ✅ VERIFIED
- Phát biểu: Users rarely cross-shop between different real estate categories.
- Evidence:
- Average probability of staying in the exact same category across consecutive contacts is 75.11%.
- 1050 (Dự án) has the highest stickiness at 87.2%.
- Verified in Round: 17
- → Impact: Sequential recommendations must strictly penalize category switches unless there's an explicit signal. Add
is_same_category_as_last_view feature to Reranker.
- → Insight ID: INS-043
[H-002] — 64% test users are Cold-Start ✅ VERIFIED
- Phát biểu: A large portion of test users have NO training history.
- Evidence:
- Total test users: 161,568
- With event history (login): 58,153 (36%)
- Cold-start (NO history): 103,415 (64%)
- With contact interaction history: 60,212 (37.3%)
- Verified in Round: 02
- → Impact: CRITICAL. Cold-start fallback strategy is ESSENTIAL. Popularity/trending by city+category MUST be implemented. 64% of our score depends on cold-start handling!
- → Feature created:
user_is_cold_start (boolean)
- → Insight ID: INS-004
[H-003] — dwell_time_sec is in milliseconds ✅ VERIFIED
- Phát biểu: Column is labeled "sec" but values are in milliseconds.
- Evidence:
- Raw median (pageview): 17,915 → if seconds = 5 hours (IMPOSSIBLE)
- Divided by 1000: median = 17.9 seconds (REALISTIC for page viewing)
- Mean: 52.3 seconds after conversion (reasonable with some long sessions)
- Verified in Round: 02
- → Impact: ALL code referencing
dwell_time_sec must divide by 1000.
config/settings.py → min_valid_dwell_sec: 3.0 means threshold of 3000ms raw.
data_forensics.py bot detection thresholds must be recalibrated.
- → Insight ID: INS-005
❌ REJECTED HYPOTHESES
[H-001] — project_id nullity correlates with non-apartment categories ❌ REJECTED
- Phát biểu: project_id null mostly for categories 1030 (nhà ở) and 1040 (đất nền).
- Counter-evidence:
- 1010 (Phòng trọ): 58.6% null
- 1020 (Căn hộ/chung cư): 96.73% null ← SURPRISING!
- 1030 (Nhà ở): 93.54% null
- 1040 (Đất nền): 91.44% null
- 1050 (Dự án mở bán mới): 100% null ← MOST SURPRISING!
- Learning: project_id is null across ALL categories (>58%). Even category 1050 (Dự án mở bán mới) — which IS apartment projects — has 100% null project_id. This column is unreliable as a category indicator.
- → New insight: project_id may represent a specific named project within larger platforms, and most listings (even apartment ones) are not tied to a named project. Cannot use
project_id.is_not_null() as is_apartment.
[H-020] — LightGBM reranker on cascade k=200 improves top-10 ❌ REJECTED
- Phát biểu: Generating 200 candidates with cascade, then reranking with LightGBM LambdaRank will improve Recall@10 over direct cascade k=10.
- Counter-evidence:
- v10 (cascade k=10 direct): 0.034 on leaderboard
- v11 (cascade k=200 + LightGBM): 0.0048 on leaderboard
- Root cause 1: Reranker trained on EnsembleGen, deployed on CascadeGen (INS-052)
- Root cause 2: segpop.pkl was overwritten by training pipeline (INS-053)
- Rejected in Round: 21
- → Lesson: Reranker CAN work but MUST be retrained on the same candidate distribution used at inference time.
[H-021] — Intra-segment offset diversity improves cold user score ❌ REJECTED
- Phát biểu: Hash-offsetting blind users into different positions within segment popularity pools will improve score by recommending different items to different users.
- Counter-evidence:
- v10 (top items, no offset): 0.034
- v12 (offset diversity): 0.005
- Offset pushed users to position 50-200 in pool = less popular = less relevant
- Rejected in Round: 21
- → Lesson: Diversity ≠ quality. Popular items are popular because they ARE relevant. (INS-054)
[H-022] — PV-first cascade improves warm user Recall@10 ❌ REJECTED
- Phát biểu: Giving PV replay priority over ALS will improve warm user precision by combining explicit interest + CF discovery.
- Counter-evidence:
- ALS-first: Recall@10 = 0.1009
- PV-first (3 PV + 7 ALS): Recall@10 = 0.0999
- ALS and PV overlap only 0.5/10 but neither order dominates
- Rejected in Round: 21
- → Lesson: Source ordering barely matters at k=10 when ALS fills all slots (INS-056)
[H-024] — Category-proportional blind allocation beats global demand fallback ❌ REJECTED
- Phát biểu: Allocating no-preference blind slots proportionally by blind contact category distribution will outperform a global high-demand item set.
- Counter-evidence:
global_score7 from snapshot demand: Recall@10 = 0.001190, hits = 63
snap_hcm_prop_4_3_2_1: Recall@10 = 0.000660, hits = 43
snap_weighted_segments: Recall@10 = 0.000575, hits = 35
- Production-style fixed top item set + rank rotation improved blind recall from 0.0001 → 0.0005, but snapshot global demand improved it further to 0.0011 in full aligned eval.
- Rejected in Round: 24
- → Lesson: For truly blind users, segment diversification is weaker than recent item-side demand. Use diversity only for exposure constraints, not item-set selection.
- → Insight ID: INS-070
[H-031] — Snapshot demand fallback improves public leaderboard ❌ REJECTED
- Phát biểu: Since snapshot 7d demand improved truly-blind offline recall, using it as blind fallback in production should improve LB.
- Counter-evidence:
outputs/submission_snapshot_blind.zip public LB = 0.0003
- Protected v17 baseline
outputs/submission_1024.zip public LB = 0.2116
- Delta = -0.2113
- Rejected in Round: 26
- → Lesson: Snapshot demand offline gains do not transfer to public LB. Do not use snapshot fallback as final.
- → Insight ID: INS-073
[H-032] — ALS1536 + recency/time-decay branch beats ALS1024 v17 ❌ REJECTED
- Phát biểu: Increasing ALS factors from 1024 to 1536 and adding time-decay/test-only cold prefs will improve the v17 baseline.
- Counter-evidence:
- v17
outputs/submission_1024.zip: 0.2116
- v18
outputs/submission_1536.zip: 0.2108
- Delta = -0.0008
- Rejected in Round: 26
- → Lesson: ALS1024 remains the current production sweet spot. Larger factors and recency weighting must be isolated before trusting.
- → Insight ID: INS-074
[H-033] — Conservative v17/v18 slot blend can safely improve tail ranks ❌ REJECTED
- Phát biểu: Keeping v17 ranks 1-9 and replacing rank10 with a unique v18 item should preserve most v17 strength while adding incremental diversity.
- Counter-evidence:
- v17
outputs/submission_1024.zip: 0.2116
- v19
outputs/submission_blend_v17_9_v18_1.zip: 0.1974
- Delta = -0.0142
- Rejected in Round: 26
- → Lesson: v17 top-10 ordering is valuable even at rank10. Slot-level blending is not safe.
- → Insight ID: INS-075
[H-023] — Warm users contribute ~0.10 recall ✅ VERIFIED
- Phát biểu: The v10 leaderboard score (0.034) is entirely from warm users (33.7%), implying warm Recall@10 ≈ 0.10.
- Evidence:
- 0.034 / 0.337 = 0.101 (implied warm recall)
- Offline eval on warm users: Recall@10 = 0.1009 (✅ MATCHES!)
- Cold users get SegPop items → ~0 recall contribution
- Verified in Round: 21
- → Insight ID: INS-055
🔀 MODIFIED HYPOTHESES
[H-004] → [H-004-M] — other_interaction IS a positive signal (is_contact=1) 🔀 MODIFIED
- Original: "other_interaction: positive signal or noise?"
- Modified to: other_interaction IS consistently flagged as
is_contact=1, confirming it IS a positive interaction.
- Evidence:
- ALL other_interaction events have
is_contact=1 (561,188 events, 100%)
- ALL pageview events have
is_contact=0 (404,986 events, 100%)
- This is consistent with đề thi line 99-102 definition.
- Đề thi line 105 saying "other_interaction is browsing noise" appears to be an ERROR or intentional misdirection.
- Reason: The
is_contact flag is the ground truth. Data says other_interaction = positive.
- → Impact: MUST include
other_interaction in positive_events config! Current settings.py EXCLUDES it. This is a CRITICAL config bug that would dramatically hurt Recall@10.
- → Action: Update
config/settings.py positive_events list to include other_interaction.
- → Insight ID: INS-006
[H-023F] → [H-023F-M] — Pure freshness is not enough; snapshot demand freshness is the useful variant 🔀 MODIFIED
- Original: Freshness-first SegPop, prioritizing items posted ≤7 days, will improve blind user cold-start.
- Modified to: Truly-blind fallback should rank by recent demand from snapshots (
contacts_7d*20 + views_7d), not by posted_date freshness alone.
- Evidence:
global_fresh_only: Recall@10 = 0.000000
global_score7_fresh: Recall@10 = 0.000538
global_score7: Recall@10 = 0.001190
- Full aligned eval blind recall improved from 0.0001 → 0.0011 after deploying snapshot demand fallback.
- Reason: Recent demand is a stronger proxy for current market relevance than newness alone.
- → Insight ID: INS-070