🧪 HYPOTHESES TRACKER — THEO DÕI GIẢ THUYẾT

Triết lý: Mọi insight bắt đầu từ giả thuyết. Giả thuyết phải được CHỨNG MINH hoặc BÁC BỎ bằng data. Quy tắc: KHÔNG BAO GIỜ chấp nhận giả thuyết mà không có bằng chứng thống kê.

📊 DASHBOARD

Status	Count
⬜ PENDING (chưa test)	6
🔄 TESTING (đang verify)	0
✅ VERIFIED (đã chứng minh)	16
❌ REJECTED (bác bỏ)	8
🔀 MODIFIED (sửa đổi)	2
TOTAL	32

⬜ PENDING HYPOTHESES (chờ verify)

[H-009] — Tăng γ (fairness weight) lên 0.25 giảm agent gap mà không hurt Recall

Phát biểu: MultiObjectiveReranker với γ=0.25 (thay vì 0.15) sẽ kéo agent ratio từ 27% về gần 52% GT mà Recall@10 giảm <2%.
Motivation: INS-019 — agent gap 24.7pp quá lớn.
Cách verify: Ablation study với α=0.60, β=0.15, γ=0.25, δ=0 offline.
Status: ⬜ PENDING — Round 13

[H-010] — Tăng ALS half-life từ 7d lên 30d cải thiện Recall@10

Phát biểu: INS-021 cho thấy GT contacts items cũ (97d). half-life=7d quá aggressive → bias fresh items. Tăng lên 30d sẽ align tốt hơn với GT và cải thiện Recall.
Cách verify: Train ALS với half-life={3, 7, 14, 30, 60}d, so sánh offline Recall@10 với time-split val.
Status: ⬜ PENDING — Round 13 (ablation)

[H-011] — Long-tail novelty injection tăng Coverage từ 3.71% → 8% mà Recall giảm <1%

Phát biểu: Thêm 20% long-tail items vào BurstTrendingRecommender cold-start pool sẽ cải thiện Coverage mà không ảnh hưởng nhiều Recall.
Cách verify: Implement coverage_bonus trong trending, so sánh metrics offline.
Status: ⬜ PENDING — Round 14

[H-012] — Removing require_login=True trong ColdStartProfiler tăng cold-user coverage lên 40%+

Phát biểu: Hiện tại ColdStartProfiler chỉ lấy login events → chỉ cover 18k/120k cold users (15%). Bỏ filter is_login → có thể match non-login sessions → cover thêm 25%+ cold users.
Cách verify: Chạy ColdStartProfiler với require_login=False, so sánh coverage %.
Status: ⬜ PENDING — Test trực tiếp

[H-019] — Filtering pageview noise (dwell > 30s) before ALS training improves als_view quality

Phát biểu: Hiện tại als_view trained trên TẤT CẢ pageviews (bao gồm bounce views < 5s). Nếu chỉ dùng pageviews có dwell_time > 30s, ALS sẽ học signal chất lượng hơn và có thể trở thành useful candidate source.
Motivation: INS-047 — als_view hiện tại dilutes candidate pool vì noise quá nhiều.
Cách verify: Retrain ALS on filtered pageviews (dwell > 30s), so sánh standalone Recall@200 vs current als_view.
Status: ⬜ PENDING

[H-027] — Time-weighted ALS (exponential recency) will improve warm recall on clean eval

Phát biểu: INS-068 shows ALS is 5.6x worse without 3d of most recent contacts. Time-weighting contacts by exp(-days_ago / half_life) should partially compensate for this loss.
Motivation: INS-068 — recency is disproportionately important for ALS quality.
Cách verify: Retrain ALS with time-weighted contacts on clean split, compare recall.
Status: ⬜ PENDING

[H-029] — Non-login pageview preferences (city+cat only) improve blind user recall WITHOUT touching ALS

Phát biểu: Adding pref_city and pref_cat from non-login pageviews for 4,215 truly-blind test users into cold_user_prefs.parquet will improve their SegPop segment matching, increasing recall from snapshot-global fallback level to segment-popular level (~1.6% ceiling per INS-063).
Motivation: INS-071 — 4,215 blind users have non-login pageviews but no preferences in current pipeline.
Key difference from H-024 (REJECTED): H-024 removed is_login from the ENTIRE pipeline including ALS training → density dilution → -59%. H-029 modifies ONLY _process_cold_user_prefs → zero ALS impact.
Risk assessment: Zero risk to warm/cold-with-signal users. Only affects 4,215/161,568 = 2.6% of test users.
Unknown: Whether Kaggle GT evaluates non-login user_ids. If not → recall contribution = 0.
Cách verify:
1. Modify _process_cold_user_prefs to remove is_login == 'login' filter for pageview preference extraction only
2. Rebuild cold_user_prefs.parquet
3. Re-run aligned eval with --retrain_clean to measure impact
4. Compare blind recall before/after
Status: ⬜ PENDING

✅ VERIFIED HYPOTHESES

[H-020] — Adding PCI lead pairs to ALS training improves Recall@10 ✅ VERIFIED

Phát biểu: Merging PCI lead pairs (for existing ALS users only) into the ALS training matrix increases density and improves recommendation quality.
Evidence:
- Warm recall on clean split improved from 0.0179 (Cascade-Direct v24 baseline) to 0.0285 (with PCI + weighted contacts).
- Standing hybrid Recall@10 improved to 0.0668.
Verified in Round: 24
→ Insight ID: INS-069

[H-021] — PCI preferences improve cold-start SegPop matching ✅ VERIFIED

Phát biểu: Building user preferences from PCI data before split date provides a strong personalized fallback for cold users.
Evidence:
- Cold-with-signal users with PCI preferences achieved Recall@10 = 0.0569 vs 0.0000 for users without preferences (near 30x relative improvement).
Verified in Round: 24
→ Insight ID: INS-067

[H-022] — PCI `purchased=True` items weighted 3x in ALS improves warm recall ✅ VERIFIED

Phát biểu: Assigning 3x weight to actual purchased contact pairs in ALS training provides a stronger signal and improves embedding accuracy.
Evidence:
- The combination of weighted contact counts (real=3x, other=1x) and PCI purchased weight=3x achieved a massive warm recall boost to 0.0668 (hybrid) and 0.0285 (cascade-direct).
Verified in Round: 24
→ Insight ID: INS-069

[H-028] — A single LightGBM ranker destroys cold user recall, requiring segmented inference ✅ VERIFIED

Phát biểu: A single unified ranking model overfits to high-density warm behavior features, penalizing cold-start candidates that lack behavioral histories.
Evidence:
- Hybrid mode boosted Warm recall from 0.0285 → 0.0668 (+134.4%), but destroyed Cold-with-signal recall from 0.0528 → 0.0127 (-75.9%).
Verified in Round: 24
→ Insight ID: INS-069

[H-025] — Retraining ALS+SegPop on split-clean data will drop blind recall to ~0.01-0.02 ✅ VERIFIED

Phát biểu: Current eval shows blind recall=0.1654 which is 10x higher than INS-063 ceiling (0.0158). This is due to model leak (ALS/SegPop trained on full data including val period). Split-clean retrain should reveal TRUE blind recall ~0.01-0.02.
Evidence:
- Blind recall dropped from 0.1654 → 0.0004 (413x, even lower than predicted)
- Warm recall dropped from 0.0712 → 0.0179 (4x)
- Simulated LB from 0.1336 → 0.0111 (12x)
Verified in Round: 24
→ Insight ID: INS-066, INS-068

[H-026] — PCI prefs will show relative uplift even with clean retrain ✅ VERIFIED

Phát biểu: Cold users with PCI prefs (0.1942) outperform cold without (0.1651) by +17.6%. This relative difference should persist after clean retrain.
Evidence:
- Cold+prefs: Recall@10 = 0.0612 vs Cold-no-prefs: 0.0020 = 30.6x uplift (éven stronger than predicted!)
- n=715 with prefs, n=55 without prefs
Verified in Round: 24
→ Insight ID: INS-067

[H-030] — ALS1024 + cascade-direct beats hybrid/segmented production baseline ✅ VERIFIED

Phát biểu: Increasing ContactALS capacity to 1024 factors and serving with direct cascade mode will outperform the previous segmented/hybrid baseline.
Evidence:
- Previous best v14: 0.0344 public LB
- v17: 0.2116 public LB / Top5
- Validated artifact: outputs/submission_1024.zip
- config.inference_mode = "cascade" skips LightGBM and avoids INS-069 cold/warm overfit.
- ALS artifact verified: user_factors=(810411,1024), item_factors=(696252,1024).
Verified in Round: 25
→ Insight ID: INS-072

[H-013] — IntentRecommender tăng mạnh Recall cho Cold-start/Warm-start users ✅ VERIFIED

Phát biểu: Việc match trực tiếp (District, Category, Price) từ lịch sử Pageview với các tin trong dim_listing sẽ mang lại Recall cao độc lập.
Evidence: Round 18 benchmark chỉ ra IntentRecommender đạt Recall@200 độc lập là 0.1140 (cao thứ 2 sau ALS).
Verified in Round: 18
→ Impact: Giữ vai trò core candidate source cho Reranker.
→ Insight ID: INS-044

[H-016] — Hard cascade slot-competition restricts Recall@200 ceiling ✅ VERIFIED

Phát biểu: Rigid priority queue cascade limits the theoretical Recall@200 ceiling because high-volume sources greedily consume the 200-slot budget.
Evidence:
- Hard cascade with ALS first: Recall@200 = 0.1840.
- Hard cascade with PV first: Recall@200 = 0.2396.
- Standalone sum of candidates is 0.5045, showing massive overlap and slot competition.
Verified in Round: 18
→ Impact: Move from hard cascade to diverse union pool generator for Reranker candidate generation.
→ Insight ID: INS-044

[H-017] — Round-robin interleave is inferior to sequential priority for candidate generation ✅ VERIFIED

Phát biểu: Interleaving candidates from all sources equally (round-robin) will improve diversity and Recall@200 compared to sequential priority filling.
Evidence:
- Sequential priority: Recall@200 (Active GT) = 0.3152
- Round-robin: Recall@200 (Active GT) = 0.2753 (-12.7%)
- Round-robin let SegPop consume 75k slots (vs 35k in sequential), crowding out personalized candidates.
Verified in Round: 19
→ Impact: Round-robin REJECTED. Sequential priority with budget caps is the correct architecture.
→ Insight ID: INS-046

[H-018] — Disabling als_view improves Recall@200 ✅ VERIFIED

Phát biểu: Pageview-based ALS (als_view) adds valuable coverage signal to the candidate pool.
Evidence:
- WITH als_view (budget=80): Recall@200 = 0.3014
- WITHOUT als_view (budget=0): Recall@200 = 0.3177 (+5.4%)
- als_view consumed 80-95k slots but contributed ZERO net recall improvement.
Verified in Round: 19
→ Impact: DISABLE als_view. Pageview data is only useful for PageviewReplay and IntentRecommender, NOT for CF.
→ Insight ID: INS-047

[H-014] — adview_count correlates with contact probability up to a point ✅ VERIFIED

Phát biểu: There is a strong relationship between views and contacts.
Evidence:
- Pearson correlation is 0.7571.
- Conversion rate is highest at 0 views (0.103) and 150+ views (0.101), dropping to 0.087 at 30 views.
Verified in Round: 16
→ Impact: Must include views_24h and a non-linear combination contacts_24h / (views_24h + 1) in LightGBM Reranker.
→ Insight ID: INS-042

[H-015] — Users have high category stickiness ✅ VERIFIED

Phát biểu: Users rarely cross-shop between different real estate categories.
Evidence:
- Average probability of staying in the exact same category across consecutive contacts is 75.11%.
- 1050 (Dự án) has the highest stickiness at 87.2%.
Verified in Round: 17
→ Impact: Sequential recommendations must strictly penalize category switches unless there's an explicit signal. Add is_same_category_as_last_view feature to Reranker.
→ Insight ID: INS-043

[H-002] — 64% test users are Cold-Start ✅ VERIFIED

Phát biểu: A large portion of test users have NO training history.
Evidence:
- Total test users: 161,568
- With event history (login): 58,153 (36%)
- Cold-start (NO history): 103,415 (64%)
- With contact interaction history: 60,212 (37.3%)
Verified in Round: 02
→ Impact: CRITICAL. Cold-start fallback strategy is ESSENTIAL. Popularity/trending by city+category MUST be implemented. 64% of our score depends on cold-start handling!
→ Feature created: user_is_cold_start (boolean)
→ Insight ID: INS-004

[H-003] — dwell_time_sec is in milliseconds ✅ VERIFIED

Phát biểu: Column is labeled "sec" but values are in milliseconds.
Evidence:
- Raw median (pageview): 17,915 → if seconds = 5 hours (IMPOSSIBLE)
- Divided by 1000: median = 17.9 seconds (REALISTIC for page viewing)
- Mean: 52.3 seconds after conversion (reasonable with some long sessions)
Verified in Round: 02
→ Impact: ALL code referencing dwell_time_sec must divide by 1000.
- config/settings.py → min_valid_dwell_sec: 3.0 means threshold of 3000ms raw.
- data_forensics.py bot detection thresholds must be recalibrated.
→ Insight ID: INS-005

❌ REJECTED HYPOTHESES

[H-001] — project_id nullity correlates with non-apartment categories ❌ REJECTED

Phát biểu: project_id null mostly for categories 1030 (nhà ở) and 1040 (đất nền).
Counter-evidence:
- 1010 (Phòng trọ): 58.6% null
- 1020 (Căn hộ/chung cư): 96.73% null ← SURPRISING!
- 1030 (Nhà ở): 93.54% null
- 1040 (Đất nền): 91.44% null
- 1050 (Dự án mở bán mới): 100% null ← MOST SURPRISING!
Learning: project_id is null across ALL categories (>58%). Even category 1050 (Dự án mở bán mới) — which IS apartment projects — has 100% null project_id. This column is unreliable as a category indicator.
→ New insight: project_id may represent a specific named project within larger platforms, and most listings (even apartment ones) are not tied to a named project. Cannot use project_id.is_not_null() as is_apartment.

[H-020] — LightGBM reranker on cascade k=200 improves top-10 ❌ REJECTED

Phát biểu: Generating 200 candidates with cascade, then reranking with LightGBM LambdaRank will improve Recall@10 over direct cascade k=10.
Counter-evidence:
- v10 (cascade k=10 direct): 0.034 on leaderboard
- v11 (cascade k=200 + LightGBM): 0.0048 on leaderboard
- Root cause 1: Reranker trained on EnsembleGen, deployed on CascadeGen (INS-052)
- Root cause 2: segpop.pkl was overwritten by training pipeline (INS-053)
Rejected in Round: 21
→ Lesson: Reranker CAN work but MUST be retrained on the same candidate distribution used at inference time.

[H-021] — Intra-segment offset diversity improves cold user score ❌ REJECTED

Phát biểu: Hash-offsetting blind users into different positions within segment popularity pools will improve score by recommending different items to different users.
Counter-evidence:
- v10 (top items, no offset): 0.034
- v12 (offset diversity): 0.005
- Offset pushed users to position 50-200 in pool = less popular = less relevant
Rejected in Round: 21
→ Lesson: Diversity ≠ quality. Popular items are popular because they ARE relevant. (INS-054)

[H-022] — PV-first cascade improves warm user Recall@10 ❌ REJECTED

Phát biểu: Giving PV replay priority over ALS will improve warm user precision by combining explicit interest + CF discovery.
Counter-evidence:
- ALS-first: Recall@10 = 0.1009
- PV-first (3 PV + 7 ALS): Recall@10 = 0.0999
- ALS and PV overlap only 0.5/10 but neither order dominates
Rejected in Round: 21
→ Lesson: Source ordering barely matters at k=10 when ALS fills all slots (INS-056)

[H-024] — Category-proportional blind allocation beats global demand fallback ❌ REJECTED

Phát biểu: Allocating no-preference blind slots proportionally by blind contact category distribution will outperform a global high-demand item set.
Counter-evidence:
- global_score7 from snapshot demand: Recall@10 = 0.001190, hits = 63
- snap_hcm_prop_4_3_2_1: Recall@10 = 0.000660, hits = 43
- snap_weighted_segments: Recall@10 = 0.000575, hits = 35
- Production-style fixed top item set + rank rotation improved blind recall from 0.0001 → 0.0005, but snapshot global demand improved it further to 0.0011 in full aligned eval.
Rejected in Round: 24
→ Lesson: For truly blind users, segment diversification is weaker than recent item-side demand. Use diversity only for exposure constraints, not item-set selection.
→ Insight ID: INS-070

[H-031] — Snapshot demand fallback improves public leaderboard ❌ REJECTED

Phát biểu: Since snapshot 7d demand improved truly-blind offline recall, using it as blind fallback in production should improve LB.
Counter-evidence:
- outputs/submission_snapshot_blind.zip public LB = 0.0003
- Protected v17 baseline outputs/submission_1024.zip public LB = 0.2116
- Delta = -0.2113
Rejected in Round: 26
→ Lesson: Snapshot demand offline gains do not transfer to public LB. Do not use snapshot fallback as final.
→ Insight ID: INS-073

[H-032] — ALS1536 + recency/time-decay branch beats ALS1024 v17 ❌ REJECTED

Phát biểu: Increasing ALS factors from 1024 to 1536 and adding time-decay/test-only cold prefs will improve the v17 baseline.
Counter-evidence:
- v17 outputs/submission_1024.zip: 0.2116
- v18 outputs/submission_1536.zip: 0.2108
- Delta = -0.0008
Rejected in Round: 26
→ Lesson: ALS1024 remains the current production sweet spot. Larger factors and recency weighting must be isolated before trusting.
→ Insight ID: INS-074

[H-033] — Conservative v17/v18 slot blend can safely improve tail ranks ❌ REJECTED

Phát biểu: Keeping v17 ranks 1-9 and replacing rank10 with a unique v18 item should preserve most v17 strength while adding incremental diversity.
Counter-evidence:
- v17 outputs/submission_1024.zip: 0.2116
- v19 outputs/submission_blend_v17_9_v18_1.zip: 0.1974
- Delta = -0.0142
Rejected in Round: 26
→ Lesson: v17 top-10 ordering is valuable even at rank10. Slot-level blending is not safe.
→ Insight ID: INS-075

[H-023] — Warm users contribute ~0.10 recall ✅ VERIFIED

Phát biểu: The v10 leaderboard score (0.034) is entirely from warm users (33.7%), implying warm Recall@10 ≈ 0.10.
Evidence:
- 0.034 / 0.337 = 0.101 (implied warm recall)
- Offline eval on warm users: Recall@10 = 0.1009 (✅ MATCHES!)
- Cold users get SegPop items → ~0 recall contribution
Verified in Round: 21
→ Insight ID: INS-055

🔀 MODIFIED HYPOTHESES

[H-004] → [H-004-M] — other_interaction IS a positive signal (is_contact=1) 🔀 MODIFIED

Original: "other_interaction: positive signal or noise?"
Modified to: other_interaction IS consistently flagged as is_contact=1, confirming it IS a positive interaction.
Evidence:
- ALL other_interaction events have is_contact=1 (561,188 events, 100%)
- ALL pageview events have is_contact=0 (404,986 events, 100%)
- This is consistent with đề thi line 99-102 definition.
- Đề thi line 105 saying "other_interaction is browsing noise" appears to be an ERROR or intentional misdirection.
Reason: The is_contact flag is the ground truth. Data says other_interaction = positive.
→ Impact: MUST include other_interaction in positive_events config! Current settings.py EXCLUDES it. This is a CRITICAL config bug that would dramatically hurt Recall@10.
→ Action: Update config/settings.py positive_events list to include other_interaction.
→ Insight ID: INS-006

[H-023F] → [H-023F-M] — Pure freshness is not enough; snapshot demand freshness is the useful variant 🔀 MODIFIED

Original: Freshness-first SegPop, prioritizing items posted ≤7 days, will improve blind user cold-start.
Modified to: Truly-blind fallback should rank by recent demand from snapshots (contacts_7d*20 + views_7d), not by posted_date freshness alone.
Evidence:
- global_fresh_only: Recall@10 = 0.000000
- global_score7_fresh: Recall@10 = 0.000538
- global_score7: Recall@10 = 0.001190
- Full aligned eval blind recall improved from 0.0001 → 0.0011 after deploying snapshot demand fallback.
Reason: Recent demand is a stronger proxy for current market relevance than newness alone.
→ Insight ID: INS-070

🧪 HYPOTHESES TRACKER — THEO DÕI GIẢ THUYẾT

📊 DASHBOARD

⬜ PENDING HYPOTHESES (chờ verify)

[H-009] — Tăng γ (fairness weight) lên 0.25 giảm agent gap mà không hurt Recall

[H-010] — Tăng ALS half-life từ 7d lên 30d cải thiện Recall@10

[H-011] — Long-tail novelty injection tăng Coverage từ 3.71% → 8% mà Recall giảm <1%

[H-012] — Removing require_login=True trong ColdStartProfiler tăng cold-user coverage lên 40%+

[H-019] — Filtering pageview noise (dwell > 30s) before ALS training improves als_view quality

[H-027] — Time-weighted ALS (exponential recency) will improve warm recall on clean eval

[H-029] — Non-login pageview preferences (city+cat only) improve blind user recall WITHOUT touching ALS

✅ VERIFIED HYPOTHESES

[H-020] — Adding PCI lead pairs to ALS training improves Recall@10 ✅ VERIFIED

[H-021] — PCI preferences improve cold-start SegPop matching ✅ VERIFIED

[H-022] — PCI purchased=True items weighted 3x in ALS improves warm recall ✅ VERIFIED

[H-028] — A single LightGBM ranker destroys cold user recall, requiring segmented inference ✅ VERIFIED

[H-025] — Retraining ALS+SegPop on split-clean data will drop blind recall to ~0.01-0.02 ✅ VERIFIED

[H-026] — PCI prefs will show relative uplift even with clean retrain ✅ VERIFIED

[H-030] — ALS1024 + cascade-direct beats hybrid/segmented production baseline ✅ VERIFIED

[H-013] — IntentRecommender tăng mạnh Recall cho Cold-start/Warm-start users ✅ VERIFIED

[H-016] — Hard cascade slot-competition restricts Recall@200 ceiling ✅ VERIFIED

[H-017] — Round-robin interleave is inferior to sequential priority for candidate generation ✅ VERIFIED

[H-018] — Disabling als_view improves Recall@200 ✅ VERIFIED

[H-014] — adview_count correlates with contact probability up to a point ✅ VERIFIED

[H-015] — Users have high category stickiness ✅ VERIFIED

[H-002] — 64% test users are Cold-Start ✅ VERIFIED

[H-003] — dwell_time_sec is in milliseconds ✅ VERIFIED

❌ REJECTED HYPOTHESES

[H-001] — project_id nullity correlates with non-apartment categories ❌ REJECTED

[H-020] — LightGBM reranker on cascade k=200 improves top-10 ❌ REJECTED

[H-021] — Intra-segment offset diversity improves cold user score ❌ REJECTED

[H-022] — PV-first cascade improves warm user Recall@10 ❌ REJECTED

[H-024] — Category-proportional blind allocation beats global demand fallback ❌ REJECTED

[H-031] — Snapshot demand fallback improves public leaderboard ❌ REJECTED

[H-032] — ALS1536 + recency/time-decay branch beats ALS1024 v17 ❌ REJECTED

[H-033] — Conservative v17/v18 slot blend can safely improve tail ranks ❌ REJECTED

[H-023] — Warm users contribute ~0.10 recall ✅ VERIFIED

🔀 MODIFIED HYPOTHESES

[H-004] → [H-004-M] — other_interaction IS a positive signal (is_contact=1) 🔀 MODIFIED

[H-023F] → [H-023F-M] — Pure freshness is not enough; snapshot demand freshness is the useful variant 🔀 MODIFIED

[H-022] — PCI `purchased=True` items weighted 3x in ALS improves warm recall ✅ VERIFIED