📊 INSIGHTS REGISTRY — ĐĂNG KÝ TOÀN BỘ INSIGHTS

Mục đích: Lưu MỌI insight đã phát hiện, trở thành knowledge base tích lũy. Quy tắc: Mỗi insight phải có ID duy nhất, bằng chứng, và feature suggestion. Đọc file này: Trước khi bắt đầu round mới, để không duplicate effort.


📈 DASHBOARD

Category Số insights Breakthrough?
Data Quality 1 -
Data Scale 1 -
Marketplace Structure 1 -
Algorithm Architecture 4 ✅ GAME-CHANGING
Leaderboard Diagnosis 4 ✅ ROOT CAUSE
Experiment Failures (Round 21) 5 ⚠️ LESSONS
Experiment Failures (Round 22) 2 🔴 CRITICAL LESSONS
PCI Data Discovery (Round 19) 2 🔴🔴🔴 BREAKTHROUGH
Cold-Start Ceiling (Round 23) 2 🔴🔴🔴 GAME-CHANGING
Eval Infrastructure (Round 24) 5 🔴🔴🔴 CRITICAL
Leaderboard Breakthrough (Round 25) 1 ✅ GAME-CHANGING
Leaderboard Postmortem (Round 26) 3 🔴 GUARDRAILS
TOTAL 31 -

🏷️ INSIGHT INDEX (QUICK REFERENCE)

ID Round Category Headline Impact Feature Idea?
INS-001 01 Data Quality Systematic nullity in dim_listing by property type 🟡 MED is_apartment flag
INS-002 01 Data Scale fact_user_events = 161.7M rows, 500 files 🔴 HIGH Must pre-aggregate
INS-003 01 Marketplace Agent sellers dominate 83.4% of listings 🟡 MED Fairness metric input
INS-045 19 Algorithm Budget-based sequential union beats hard cascade: Recall@200 0.27→0.31 🔴🔴🔴 Budget caps per source
INS-046 19 Algorithm Round-robin interleave HURTS recall vs sequential priority 🔴🔴 Keep sequential
INS-047 19 Algorithm als_view (pageview CF) dilutes candidate pool — disable improves Recall@200 🔴🔴 Set als_view budget=0
INS-048 20 Leaderboard SegPop city name bug: "Hồ Chí Minh" ≠ "Tp Hồ Chí Minh" → 91k users same items 🔴🔴🔴 Fix key names
INS-049 20 Leaderboard 56.4% test users have ZERO training events — completely blind 🔴🔴🔴 Hash-based segment assignment
INS-050 20 Leaderboard Offline eval doesn't predict leaderboard: best=0.006 vs top1=0.32 (53x gap) 🔴🔴🔴 Need test-aligned eval
INS-051 20 Leaderboard 50.3% contacts on items posted ≤7 days → recency > popularity 🔴🔴 Recency-weighted SegPop
INS-052 21 Experiment LightGBM reranker trained on EnsembleGen ≠ CascadeGen distribution 🔴🔴🔴 Must retrain
INS-053 21 Engineering Training pipeline overwrites segpop.pkl with alltime version 🔴🔴🔴 Backup/restore
INS-054 21 Experiment Offset diversity for cold users HURTS: top items are most relevant 🔴🔴 Don't offset
INS-055 21 Analysis Warm users already at ~0.10 recall; cold users (66%) ≈ 0 🔴🔴🔴 Cold=primary lever
INS-056 21 Experiment PV-first cascade ≈ ALS-first (0.0999 vs 0.1009) 🟡 Keep ALS-first
INS-057 22 Experiment Removing is_login filter HURTS: 0.034→0.014 (-59%). Non-login = noise 🔴🔴🔴 KEEP is_login filter
INS-058 22 Analysis ALS matrix density is key: 16.1→7.5 contacts/user killed embeddings 🔴🔴🔴 Density > size
INS-059 19 Data Source 10,654 blind test users have PCI data (avg 16.3 items) — convert blind→warm 🔴🔴🔴 PCI prefs for blind
INS-060 19 Data Source 644,732 NEW lead pairs from PCI not in ALS training data 🔴🔴🔴 Merge PCI into ALS
INS-061 19 Architecture 4-stage pipeline (Cascade→Feature→LightGBM→Reranker) code EXISTS but unused since v11 bug 🔴🔴🔴 Retrain LightGBM on cascade
INS-063 23 Cold-Start SegPop ceiling ~1.6% Recall@10 even with PERFECT city+cat knowledge 🔴🔴🔴 Popularity alone cannot solve cold-start
INS-064 23 Cold-Start 44% blind contacts on items ≤7d old; 1050 (Dự án) = #1 category for blind users 🔴🔴🔴 Freshness-first SegPop, category reweighting
INS-065 24 Eval Val: 76.8% warm / 4.7% cold / 18.5% blind — Test: 36% / 7.7% / 56.4%. Distribution mismatch 🔴🔴🔴 Must simulate test ratio
INS-066 24 Eval ALS/SegPop trained on full data leaks val contacts → blind recall inflated 10x (0.165 vs 0.016 ceiling) 🔴🔴🔴 Must retrain models on split-clean data
INS-067 24 Eval Cold+PCI prefs = 0.0612 recall vs 0.0020 without (30x uplift). PCI prefs are critical for cold users 🔴🔴🔴 Expand PCI coverage to more cold/blind test users
INS-068 24 Eval ALS recall drops 5.6x when 3d val contacts removed (0.10→0.018). Most recent contacts are disproportionately important 🔴🔴🔴 Time-weight ALS toward recent contacts
INS-069 24 Model Architecture LightGBM ranker overfits to warm features, severely destroying cold-start recall 🔴🔴🔴 Implement Segmented Inference Policy
INS-071 25 Cold-Start Signal 4,215 truly-blind test users have non-login pageviews with extractable city+cat prefs. But INS-057 warns non-login = device-level IDs 🟡🟡 H-029: verify if non-login pref injection helps or is irrelevant
INS-072 25 Leaderboard v17 reached 0.2116 LB / top5: ALS 1024 + full-data cascade-direct + uppercase ID submission 🔴🔴🔴 Keep cascade mode as production baseline
INS-073 26 Leaderboard Failure Snapshot blind fallback scored 0.0003 on LB despite offline promise 🔴🔴🔴 Never use snapshot fallback for final unless LB-ablation proves it
INS-074 26 Leaderboard Failure ALS1536 + time-decay + test-only prefs scored 0.2108, slightly below v17 0.2116 🔴🔴 ALS1024 remains production sweet spot
INS-075 26 Leaderboard Failure v17 top9 + v18 slot10 blend scored 0.1974; even rank10 replacement hurt badly 🔴🔴🔴 Do not slot-blend v17 unless full-list eval proves gain

📖 DETAILED INSIGHTS

[INS-001] — Systematic Nullity in dim_listing

📊 Data Evidence

project_id: 88.71% null (2,756,219 / 3,107,114)
direction: 82.15% null
floors: 70.52% null
furnishing: 54.81% null
house_type: 51.47% null
bathrooms: 44.85% null
bedrooms: 31.78% null

🏠 Domain Explanation

Đất nền (1040) và nhà ở (1030) tự nhiên không có project_id, floors, furnishing. Nullity không phải lỗi data — là reflection của property type.

💡 Feature Engineering Suggestion

Feature name: is_apartment
Formula: project_id.is_not_null()
Expected impact: Strong signal cho category classification. LightGBM handles NaN natively.

🎯 Follow-up


[INS-002] — Massive Scale of Clickstream Data

📊 Data Evidence

fact_user_events: 161,731,336 rows, 500 files
fact_listing_snapshot: 19,762,167 rows, 62 files
fact_post_contact_interactions: 25,486,445 rows, 147 files
dim_listing: 3,107,114 rows, 40 files

💡 Feature Engineering Suggestion

Strategy: Pre-aggregate fact_user_events to user-level and item-level before joins.
Never operate at raw event level in feature engineering.
Use Polars LazyFrame + column pushdown + date filters.

[INS-003] — Agent Seller Dominance

📊 Data Evidence

agent: 2,593,063 (83.5%)
private: 514,051 (16.5%)

🏠 Domain Explanation

BĐS Việt Nam đặc thù: Môi giới (agent) chiếm đa số listing vì cá nhân (private) ít biết cách đăng tin chuyên nghiệp. Fairness metric phải điều chỉnh exposure cho private sellers.

💡 Feature Engineering Suggestion

Feature name: seller_type_encoded (binary)
Use in Fairness metric: Target ratio should reflect natural distribution, not 50/50.


[INS-019] — Fairness Gap: Agent/Private Ratio Severely Miscalibrated

📊 Data Evidence

Submission:    agent=27.3%,  private=72.7%
GT contacts:   agent=52.0%,  private=48.0%
Gap:           −24.7 pp (agents heavily under-served)

🏠 Domain Explanation

Agents chiếm 83.5% của dim_listing nhưng collectively chỉ nhận 52% contacts vì private sellers có lead/listing cao hơn 3x. Hệ thống đang đẩy quá nhiều private sellers trong top-10 → agents phản ứng tiêu cực, ảnh hưởng doanh thu B2B của Chợ Tốt.

💡 Feature Engineering Suggestion

Feature: seller_type_fairness_correction
Formula: if agent_ratio_current < 0.52: boost agent-seller items in reranker
Impact: Calibrate HealthMetrics.gt_dist với agent_ratio=0.520 (từ data thực)

🎯 Business Impact

Agents trả phí premium placement. Under-serving họ = churn risk + doanh thu B2B giảm.


[INS-020] — Category Imbalance: 1050 Over-Served, 1010 Under-Served

📊 Data Evidence

Category | Submission | GT contacts | Gap
1010     |    11.3%   |    15.6%    | -4.3pp (under)
1020     |    41.2%   |    44.6%    | -3.4pp (under)
1030     |     8.7%   |     6.5%    | +2.3pp (over)
1050     |    29.0%   |    23.1%    | +5.9pp (OVER-SERVE)

💡 Feature Engineering Suggestion

Feature: category_exposure_correction
Formula: KL divergence from GT category distribution → boost under-served categories
Used in: MultiObjectiveReranker fairness term γ

[INS-021] — Freshness "Paradox" Debunked (Survivorship Bias)

📊 Data Evidence

Submission  — median listing age: 10 days,  mean: 36 days
GT contacts — median listing age: 97 days,  mean: 106 days

BUT PDF 2 reveals: 69.7% of all contacts happen in the first 7 days.

🏠 Domain Explanation

The 97-day median age for GT contacts is an illusion caused by Survivorship Bias. Bad listings are removed early. Only high-quality listings survive to 90+ days. The true "Golden Moment" is the first 7 days.

💡 Feature Engineering Suggestion

Recommendation: Keep ALS half_life at 7d to capture the 69.7% Golden Moment.
DO NOT raise half-life to 30d as originally hypothesized in R09.
Reranker delta: Maintain freshness weight to boost new items.

[INS-022] — Coverage Extremely Low: 3.71%, Popularity Bias Severe

📊 Data Evidence

Items recommended: 115,340 / 3,107,114 = 3.71%
Top-1% items: 81.9% of all recommendation slots
96.3% of catalogue: NEVER recommended

🏠 Domain Explanation

Feedback loop kinh điển: popular items → recommended → more views → more contacts → more popular. New sellers never get traction. Marketplace health degrades over time.

💡 Feature Engineering Suggestion

Feature: item_novelty_score = 1 - (popularity_rank / total_items)
Strategy: Add novelty bonus in BurstTrendingRecommender for long-tail items
Target: Raise coverage from 3.71% → 8%+ without sacrificing Recall@10

[INS-023] — Ground-Truth Distribution Calibrated from Data

📊 Data Evidence

{
  "agent_ratio": 0.520,
  "category_dist": { "1010": 0.156, "1020": 0.446, "1030": 0.065, "1040": 0.102, "1050": 0.231 }
}

Saved to: .cache/gt_dist.json — loaded by HealthMetrics automatically.

💡 Feature Engineering Suggestion

Replace hardcoded values in HealthMetrics (agent_ratio=0.7, category_dist=generic)
with data-driven values. This is now done automatically via gt_dist_path param.

[INS-024] — Reranker Impact Minimal for Cold Users (70% of Base)

📊 Data Evidence

Before reranking: Diversity entropy = 0.6947, Fairness = 0.273
After  reranking: Diversity entropy = 0.6986 (+0.004), Fairness = 0.273 (UNCHANGED)
Root cause: 101,441 cold users (63%) get homogeneous global trending → dominates aggregate

💡 Action Required

To meaningfully improve health metrics across ALL users:
1. Make BurstTrendingRecommender diversity-aware (inject agent items, balance categories)
2. Or: expand cold-start coverage via better ColdStartProfiler (remove require_login constraint)
3. Or: add novelty injection to global trending (force 20% long-tail items)

[INS-025] — 85.5% of GT Items Are COMPLETELY NEW to the User

📊 Data Evidence

GT pairs (last 3 days): 62,893
Repeat contacts (user contacted before):  7,088 / 62,893 = 11.3%
Previously viewed (pageview before):      9,111 / 62,893 = 14.5%
ANY prior interaction:                    9,130 / 62,893 = 14.5%
COMPLETELY NEW to user:                  53,763 / 62,893 = 85.5%

🏠 Domain Explanation

BĐS khác e-commerce: users không "re-buy" items. Họ liên tục duyệt tin MỚI trong khu vực quan tâm. ALS/CF chỉ giúp 14.5% — phần còn lại phải đến từ segment popularity hoặc content-based matching.

💡 Strategy Implication

CRITICAL: ALS collaborative filtering là SECONDARY signal, không phải PRIMARY.
PRIMARY signal = popularity within user's preferred (city, category) segment.
This explains why v1-v4 scored 0.006 — they over-relied on CF for 85.5% of GT.

🎯 Action


[INS-026] — 91.9% of GT Items Match User's Preferred City

📊 Data Evidence

GT pairs with known user prefs: 53,074
Same city as user preference:     48,775 / 53,074 = 91.9%
Same category as user preference: 38,297 / 53,074 = 72.2%
BOTH city + category match:       36,342 / 53,074 = 68.5%

🏠 Domain Explanation

Người tìm BĐS gần như LUÔN tìm trong cùng 1 thành phố (92%). Category consistency cũng cao (72%) — người tìm căn hộ hiếm khi chuyển sang đất nền. Đây là đặc trưng domain BĐS: quyết định mua/thuê = location-first.

💡 Strategy Implication

Feature: user_preferred_city (mode of contacted cities) → MUST-HAVE filter
Feature: user_preferred_category (mode of contacted categories) → strong filter
Recommendation cascade: (city+cat+district) → (city+cat) → (city) → (cat) → global

[INS-027] — Submission Item Coverage vs GT Coverage Gap

📊 Data Evidence

Submission unique items:    9,290
GT unique items (last 3d): 28,706
Overlap:                    6,211 / 28,706 = 21.6% (only 1 in 5 GT items in submission!)
93.1% of GT users have post_contact history (NOT cold-start problem!)

🏠 Domain Explanation

Popularity bias cực nặng: ta chỉ recommend 9K items cho 161K users. GT cần 28K items. Submission chỉ cover 21.6% GT items → Recall bị cap ở ~0.22 max ngay từ đầu, bất kể ranking quality.

💡 Strategy Implication

MUST diversify item pool: recommend 50K+ unique items across all users
Reduce popularity concentration: top-1% items should be <30% of slots (was 81.9%)
Use finer-grain segments (city+cat+district) to naturally diversify

⭐ TOP BREAKTHROUGH INSIGHTS

ID Breakthrough Impact
INS-025 85.5% GT items are NEW → CF is secondary, segment popularity is primary 🔴🔴🔴
INS-026 91.9% city match → location is the dominant filter 🔴🔴🔴
INS-027 Submission covers only 21.6% of GT items → popularity bias kills score 🔴🔴
INS-022 Coverage crisis: 3.71% → need long-tail strategy 🔴🔴
INS-019 Agent fairness gap: 24.7pp → critical for B2B revenue 🔴🔴
INS-021 Freshness paradox: half-life=7d too aggressive 🔴
INS-024 Reranker ineffective for cold users → need cold trending diversity 🔴

[INS-028] — Funnel Drop-off: 83.9% Soft Intent vs 20.5% Real Lead

📊 Data Evidence

Positive Rate: 83.9%
Real Lead Rate: 20.5%
Median time to soft interact: 20s. Median to Real Lead: 40-67s.

🏠 Domain Explanation

Users save/share passively but hesitate to contact. Real contact takes 3x the time to decide.

💡 Feature Engineering Suggestion

Feature: time_to_contact (proxy for intent). Optimize UI to show price/area/location above the fold.

[INS-029] — Category Intent: Đất nền Highest CR, Nhà ở Lowest

📊 Data Evidence

Đất nền (1040) Positive Rate: 87.6%
Nhà ở (1030) Positive Rate: 70.2%
Dự án (1050) Volume High, CR Low (78.4%)

🏠 Domain Explanation

Đất nền buyers have urgency. Dự án browsers are curious but avoid agents. Nhà ở lacks supply/demand.

💡 Feature Engineering Suggestion

Feature: category_urgency_weight. Boost 1040 for fast conversions.

[INS-030] — Listing DNA: Images, Furnishing, and Legal Status

📊 Data Evidence

Images: Top 5% listings have >= 8 images.
Furnishing: "Nội thất cao cấp" gives 1.63x lift. "Nhà trống" gives 0.50x.
Legal: "Sổ hồng riêng" gives 1.80x lift. "Giấy tờ viết tay" gives 0.21x.

🏠 Domain Explanation

High-quality images, premium furnishing, and clear legal status reduce buyer risk and increase confidence to contact.

💡 Feature Engineering Suggestion

Features: has_so_hong_rieng, has_noi_that_cao_cap, images_count >= 8. Strong predictors for LightGBM.

[INS-031] — Geography & Category CR Dynamics

📊 Data Evidence

Cities: Bình Định/Khánh Hoà (180-220% CR) vs HN/HCM (~160%).
Category: Phòng trọ (1.87x lift) vs Dự án (0.38x lift).

🏠 Domain Explanation

Secondary markets have less supply, making each listing perform better. Dự án (Projects) have long nurture periods, while Phòng trọ converts immediately.

💡 Feature Engineering Suggestion

Feature: category_conversion_weight. Penalize 1050 in short-term predictions.

[INS-032] — The Cold-Start Bloodbath (90.8% Drop-off)

📊 Data Evidence

New Users = 59.7% of total users.
Retention 30D for New Users = 9.2% (90.8% drop off).
Power Users = 4.1% of total users, but Retention 30D = 89.7%.

🏠 Domain Explanation

New users leave if the first session recommendations do not match their intent. If they do not find relevance immediately, they assume the platform has no supply for them.

💡 Feature Engineering Suggestion

Cold-start fallback strategy MUST focus on the most popular, high-quality segments (Căn hộ, Phòng trọ in HCM/HN) to prevent immediate churn.

[INS-033] — Aha! Moment: 3 Sessions > 1 Contact

📊 Data Evidence

Baseline conversion to Power User: 2.56%
Conversion if user has 1 Contact in first 7 days: 7.85% (3.1x lift)
Conversion if user reaches >= 3 sessions in first 7 days: 19.65% (7.7x lift)

🏠 Domain Explanation

A single contact often means "Good Churn" (user found a room and uninstalled). Reaching 3 sessions means Habit Formation (user is researching, comparing, and treating the platform as a tool).

💡 Feature Engineering Suggestion

[INS-034] — Intent Matching Can Recover 31.9% of Valid Items

📊 Data Evidence

Total GT contacts for users with intent: 110,659
GT items present in dim_listing: 2,914 (2.6%)
GT items matching Top 1 Intent (District, Category, Price): 668 (22.9% of active items)
GT items matching Top 3 Intents (District, Category, Price): 931 (31.9% of active items)
GT items matching Top 1 (City, Category): 2,139 (73.4% of active items)

🏠 Domain Explanation

Ngành BĐS có tốc độ thanh khoản cực cao. 97.4% số tin user liên hệ đã không còn trên sàn lúc test. Do đó, thay vì cố gợi ý các tin CŨ từ lịch sử (CF), nếu ta rút trích Chân dung nhu cầu (Intent) từ lịch sử Pageview và match trực tiếp với các tin MỚI NHẤT cùng phân khúc (Quận/Loại hình/Khung giá), ta có thể bắt được 31.9% nhu cầu mua thực tế!

💡 Strategy Implication

CRITICAL: Intent-Based Recommendation is MANDATORY for cold-start items.
Implement `IntentRecommender` directly targeting `dim_listing`.
Place it high in the cascade hierarchy (Priority 1.5).

[INS-035] — Recent Segment Contacts > Global Popularity

📊 Data Evidence

🏠 Domain Explanation

"Trending now" in a local area is much more relevant than "All-time popular". BĐS is highly temporal; properties popular 3 months ago are irrelevant.

[INS-036] — Pageview Replay is the Strongest Single Predictor

📊 Data Evidence

💡 Strategy Implication

PV Replay MUST be Priority 1. It represents the user's immediate, explicit intent.

[INS-037] — The 7-Day "Golden Window" for Pageviews

📊 Data Evidence

🏠 Domain Explanation

Old pageviews crowd out high-quality fresh recommendations from fallbacks. A user viewing a property 25 days ago has likely moved on.

[INS-038] — CoView is Noisy; Optimal Cascade Order

📊 Data Evidence

💡 Strategy Implication

Optimal ordering by precision: Pageview -> CoContact -> ALS -> RecentCC -> SegPop. Drop CoView.

[INS-039] — Ward-Level Intent Matching is Too Strict

📊 Data Evidence

🏠 Domain Explanation

Real estate inventory is too sparse at the Phường/Xã level. Users are willing to cross Ward boundaries within the same District or City.

💡 Strategy Implication

Elevate Intent matching to District level minimum.

[INS-040] — 97.5% of Active Inventory Ignored Due to Glob Bug

📊 Data Evidence

🏠 Domain Explanation

When 97.5% of active inventory is artificially removed from the candidate pool, the IntentRecommender and CascadeCandidateGenerator are forced to recommend stale or irrelevant properties. Real estate relies heavily on the full breadth of active supply to match nuanced user queries.

💡 Strategy Implication

ALWAYS load partitioned parquet files via pl.scan_parquet(dim_files).collect() rather than assuming a single file. Fixed in V6, immediately reviving candidate quality.

[INS-041] — Pageview Replay Trumps Generalized Intent (Priority 1)

📊 Data Evidence

🏠 Domain Explanation

While IntentRecommender (District + Cat + Price) is brilliant for filling gaps and cold-start discovery, it CANNOT beat the explicit, exact-match signal of a user clicking on a specific property yesterday (PageviewReplay).

💡 Strategy Implication

PageviewReplay MUST remain Priority 1. IntentRecommender serves as the ultimate high-quality Fallback (Priority 1.5) to capture the 27% Recall@200 ceiling.

[INS-042] — Non-linear Correlation between Views and Contacts

📊 Data Evidence

🏠 Domain Explanation

Listings with very low views but high conversion are often "Hidden Gems" or mispriced properties that get snapped up instantly. Listings with average views (30-50) are typical properties that users browse but hesitate to contact. "Mega-hot" listings (150+ views) are likely highly desirable projects where FOMO drives contact rates back up.

💡 Strategy Implication

The correlation of 0.7571 proves that views_24h is one of the strongest predictive features for the Reranker. Must include views_24h and a non-linear feature like conversion_rate (contacts_24h / (views_24h + 1)) in LightGBM.

[INS-043] — The "Sticky" Category Phenomenon (75% Loyalty)

📊 Data Evidence

🏠 Domain Explanation

Unlike e-commerce where users might buy a phone then buy a case, real estate users are highly fixed in their intent. A user looking for a house (1030) rarely switches to renting a room (1010). The 87.2% loyalty in 1050 (Dự án) shows that project investors are a very distinct segment from typical residential buyers.

💡 Strategy Implication

[INS-044] — Candidate Cascade Slot Competition & Recall@200 Ceiling

📊 Data Evidence

🏠 Domain Explanation

A rigid cascade priority queue is perfect for generating a final Top-10 list, but flawed for generating a Candidate Pool for a Reranker. High-volume generators like ALS or Intent fill up the 200-slot quota instantly, starving high-precision local matches (like Pageview Replay or CoContact) of slots. If ALS is placed first, the final top-10 precision is destroyed because ALS has poor precision in the top ranks.

💡 Strategy Implication

We must shift from a "hard priority cascade" to a "diverse union generator" for candidate generation. Instead of slot-filling until 200 is reached, we should extract a fixed budget of candidates from each generator (e.g., top 50 from PV, top 50 from ALS, top 50 from Intent, top 50 from KNN) and union them to form a robust, high-recall candidate pool (aiming for Recall@200 > 0.40). We then let the LightGBM Reranker sort the final top-10 list.

[INS-045] — Budget-based Sequential Union Dramatically Improves Recall@200

📊 Data Evidence

🏠 Domain Explanation

Mỗi model recommender có thế mạnh riêng: ALS tốt cho warm users có lịch sử contact, Intent tốt cho fresh listings, RecentCC tốt cho cold-start. Khi dùng hard cascade, model đầu tiên "ăn hết" 200 slots, các model phía sau bị starve hoàn toàn. Budget caps cho phép MỌI model đều đóng góp candidates, tạo pool đa dạng hơn.

💡 Strategy Implication

[INS-046] — Round-Robin Interleave is INFERIOR to Sequential Priority

📊 Data Evidence

🏠 Domain Explanation

Round-robin cho mỗi source 1 item per turn. Với warm users có lịch sử phong phú, SegPop/RecentCC (low-precision fallback) chiếm quá nhiều slots trong các turn đầu, đẩy ra các high-precision personalized candidates từ ALS/Intent. Ví dụ: ALS item rank #5 (rất chính xác) bị thay bằng SegPop item rank #5 (popularity noise). Sequential priority đảm bảo high-precision sources fill trước, low-precision sources chỉ fill remaining slots.

💡 Strategy Implication

[INS-047] — Pageview-based ALS (als_view) Dilutes Candidate Quality

📊 Data Evidence

🏠 Domain Explanation

Pageview là tín hiệu rất noisy trong BĐS. Người dùng view 100 tin nhưng chỉ contact 1-2 tin. ALS trained on pageviews sẽ recommend items "giống với những gì user đã xem" — nhưng hầu hết items user xem rồi SKIP (không contact). Trong khi contact-based ALS recommend items "giống với những gì user ĐÃ QUYẾT ĐỊNH liên hệ" — tín hiệu mạnh hơn nhiều. Khi als_view chiếm slots, nó đẩy ra các candidates từ UserKNN, Seller, RecentCC (có precision cao hơn).

💡 Strategy Implication

[INS-048] — SegPop City Name Mismatch Bug

📊 Data Evidence

SegPop city keys: "Tp Hồ Chí Minh", "Hà Nội", "Đà Nẵng", ...
Cold-start fallback code used: "Hồ Chí Minh", "Hà Nội" → key mismatch!
Result: 96,075/161,568 test users (59.5%) received IDENTICAL 10 items
Top rank-1 item assigned to 96,075 users (should be <10% = 16k max)

🏠 Domain Explanation

SegPop dùng city_name từ dim_listing làm key. Trong data, HCM được lưu là "Tp Hồ Chí Minh" (có prefix "Tp"). Cold-start fallback hardcode "Hồ Chí Minh" (thiếu prefix) → key lookup trả rỗng → tất cả blind users rơi vào global fallback → cùng 10 items.

💡 Strategy Implication


[INS-049] — 56.4% Test Users Are Completely Blind (Zero Events)

📊 Data Evidence

Total test users:                 161,568
With contact history (training):   54,502 (33.7%)
With pageview history (training):  70,520 (43.6%)
With ANY training event:           70,520 (43.6%)
Completely blind (ZERO events):    91,048 (56.4%)

🏠 Domain Explanation

Hơn nửa test users là users hoàn toàn mới — chưa bao giờ xuất hiện trong training data. Không có contact, không có pageview, không có bất kỳ signal nào. Với users này, mọi personalized model (ALS, UserKNN, CoContact, PV Replay, Intent) đều KHÔNG hoạt động. Chỉ SegPop/RecentCC có thể serve.

💡 Strategy Implication


[INS-050] — Offline Eval Does NOT Predict Leaderboard Score

📊 Data Evidence

Offline eval (scripts/evaluate.py):
  - Val users: 57,907 (time-split, 3 ngày cuối training)
  - 100% val users CÓ contact history → warm users only
  - Recall@200 (Active GT): 0.3177
  - Recall@10 (Active GT): 0.0899

Leaderboard scores (actual submissions):
  - v4 ALS half_life=30d factors=256:    0.0060 (BEST)
  - v5 ALS half_life=7d filter=True:     0.0036
  - Hybrid ALS+SegPop+LightGBM:         0.0033
  - Cascade V3 (glob bug fixed):         0.0004
  - Cascade V5 (PV-first + SegPop bug):  0.0003
  - Top 1 on leaderboard:               ~0.32

Gap: best offline Recall@10=0.09 vs best leaderboard=0.006 (15x gap)
      vs top1=0.32 (53x gap from our best)

🏠 Domain Explanation

Offline eval chỉ test trên users CÓ contact trong validation period → 100% warm users. Test set có 56.4% completely blind users → pipeline phải handle cold-start mà offline eval không đo được. Thêm vào đó, validation split 3 ngày có thể KHÔNG phản ánh test period (gần 1 tháng).

💡 Strategy Implication


[INS-051] — 50.3% Contacts on Items Posted ≤7 Days (Recency Signal)

📊 Data Evidence

Age of contacted items (days since posted, last 7 days of training):
  <=   1 day:  133,720 / 589,760 = 22.7%
  <=   3 days: 208,375 / 589,760 = 35.3%
  <=   7 days: 296,763 / 589,760 = 50.3%
  <=  14 days: 377,735 / 589,760 = 64.0%
  <=  30 days: 465,175 / 589,760 = 78.9%
  <=  90 days: 539,235 / 589,760 = 91.4%

🏠 Domain Explanation

BĐS Việt Nam có thanh khoản cực nhanh — 50% contacts rơi vào items mới đăng trong 7 ngày. Tin cũ hơn 30 ngày chỉ chiếm 21% contacts. Users tích cực tìm tin MỚI, không quay lại tin cũ. Điều này bổ sung INS-035 (Recent Segment > Global) và INS-021 (Freshness Paradox) bằng hard numbers.

💡 Strategy Implication


[INS-052] — LightGBM Reranker Train/Test Distribution Mismatch

📊 Data Evidence

Training: EnsembleCandidateGenerator (ALS + SegPop only, ~3 sources)
Inference: CascadeCandidateGenerator (9 sources: ALS, PV, Intent, CoContact, UserKNN, Seller, RecentCC, SegPop)
Features: 28 features including score_als, score_view_als, score_segpop, is_from_*
Result: v11 hybrid (cascade k=200 + LightGBM rerank) = 0.0048 vs v10 (cascade k=10 direct) = 0.0340

🏠 Domain Explanation

LightGBM LambdaRank learned to score candidates based on EnsembleCandidateGenerator distributions — where score_als is the primary discriminator. In CascadeGen, many items come from Intent/PV/CoContact with score_als=0 → ranker incorrectly scores them low → top-10 becomes ALS-only, worse than diverse cascade.

💡 Strategy Implication


[INS-053] — Training Pipeline Silently Overwrites Recency SegPop

📊 Data Evidence

segpop.pkl (recency, 4.5MB) → created 04:02, used for v10 (0.034)
Training pipeline ran at 04:14 → overwrote segpop.pkl with alltime version (6.1MB)
v11 (04:03) and v12 (04:25) used ALLTIME segpop → 0.0048 and 0.005
After restore: v13 (recency segpop) = identical stats to v10

💡 Strategy Implication


[INS-054] — Offset Diversity for Cold Users HURTS Performance

📊 Data Evidence

v10 (top items from segment pool, no offset): 0.0340
v12 (hash-offset into segment pool for diversity): 0.0050

Cold rank-1 unique items: v10=2,192 → v12=8,037 (+266% diversity)
BUT: max users per rank-1 item: v10=8,144 → v12=642 (12x less concentrated)

🏠 Domain Explanation

SegPop items sorted by popularity/recency score. Position 0-9 in each segment = MOST contacted items. Offset pushes users to position 50-200 = LESS contacted items. More diverse ≠ more relevant. In BĐS, popular items ARE the best cold-start recommendations because popularity = demand signal.

💡 Lesson Learned


[INS-055] — Warm Users Already at ~0.10 Recall; Cold Users = Primary Score Lever

📊 Data Evidence

v10 total leaderboard score: 0.034
Warm users: 54,502 (33.7%)
Cold users: 107,066 (66.3%)
Implied warm Recall@10: 0.034 / 0.337 = 0.101 (matches offline eval 0.1009!)
Implied cold Recall@10: ≈ 0 (all SegPop, same items per segment)

Offline eval (warm only, 5k users):
  Active GT Recall@10 = 0.1009
  Active GT Recall@200 = 0.3393

💡 Strategy Implication

To reach 0.10 total:
  Option A: warm=0.30, cold=0 → total = 0.30 × 0.337 = 0.101 (need 3x warm improvement)
  Option B: warm=0.10, cold=0.05 → total = 0.10 × 0.337 + 0.05 × 0.663 = 0.067
  Option C: warm=0.15, cold=0.03 → total = 0.15 × 0.337 + 0.03 × 0.663 = 0.070

Warm users: Recall@200=0.34 → can potentially reach 0.15-0.20 Recall@10 with proper reranking
Cold users: Need ANY personalization signal — test user metadata? registration info?

[INS-056] — PV-First Cascade ≈ ALS-First for Warm Users

📊 Data Evidence

ALS-first (budget=10): Recall@10 (Active GT) = 0.1009
PV-first (budget=3 PV + 7 ALS): Recall@10 (Active GT) = 0.0999
ALS vs PV top-10 overlap: mean=0.5/10 (nearly disjoint)

🏠 Domain Explanation

ALS and PV produce complementary but equally good top-10 lists. PV replays viewed items (14.5% of GT), ALS discovers new similar items (also ~10% hit rate). Neither dominates. The cascade order doesn't matter much because ALS fills all 10 slots for warm users anyway.

💡 Lesson Learned


[INS-057] — Removing is_login Filter DESTROYS Score (0.034→0.014)

📊 Data Evidence

WITH is_login filter (v10/v14):
  Contact pairs: 13,020,004 (810,411 users, density=16.1 contacts/user)
  ALS matrix: 810K × 691K, nnz=13M
  Score: 0.0340 / 0.0344

WITHOUT is_login filter (v13):
  Contact pairs: 21,192,783 (2,813,537 users, density=7.5 contacts/user)
  ALS matrix: 2.8M × 731K, nnz=21M  
  Score: 0.0140 (-59%!)

Difference: +62.8% more data, BUT score dropped 59%

🏠 Domain Explanation

Non-login events come from anonymous/device-level sessions. These user_ids are NOT the same users evaluated in GT (ground truth only counts login contacts). Adding 2M+ anonymous users to the ALS matrix:

  1. Diluted embeddings: Same 256 factors spread across 3.5x more users → less expressive per user
  2. Added noise: Anonymous browsing patterns ≠ purchase intent patterns of logged-in users
  3. Reduced density: 16.1→7.5 contacts/user → sparser matrix → worse factorization

💡 Lesson Learned

🎯 Actionable Rule

NEVER remove is_login filter from production pipeline.
Non-login events may be useful ONLY as side features (e.g., item popularity boost),
NOT as primary collaborative filtering signal.

[INS-058] — ALS Matrix Density > Size: The Embedding Quality Principle

📊 Data Evidence

Density comparison:
  Login-only: 13M pairs / 810K users = 16.1 contacts/user → score 0.034
  All users:  21M pairs / 2.8M users = 7.5 contacts/user → score 0.014
  
Density dropped 53%, score dropped 59%. Near-linear relationship.

256 ALS factors:
  810K users → ~0.032% density in factor space
  2.8M users → ~0.009% density → 3.5x sparser embeddings

🏠 Domain Explanation

In implicit feedback collaborative filtering, embedding quality depends on:

  1. Number of observed interactions per user (more = better personalization)
  2. Signal-to-noise ratio (login contacts = high intent, non-login = browsing noise)
  3. Factor dimensionality relative to user count (256 factors for 2.8M users = under-specified)

💡 Implications for Next Steps


[INS-059] — 10,654 Blind Test Users Have PCI Data (Untapped)

📊 Data Evidence

fact_post_contact_interactions (PCI):
  Total: 25,486,445 rows, 1,872,512 users, 574,245 items
  Date range: 2025-11-09 to 2026-04-09

Test user coverage:
  60,212 test users in PCI (37.3%)
  10,654 "blind" test users have PCI data but ZERO in fact_user_events
  
Blind users PCI signal:
  173,651 rows (avg 16.3 items/user)
  26,268 rows with lead_count > 0
  3,613 rows with chat messages
  2,436 rows with purchased = True
  
Category distribution (blind PCI users):
  1020 (Căn hộ/CC): 48.5%
  1050 (Dự án): 18.9%
  1010 (Phòng trọ): 15.8%
  7,670 users have recent data (after 2026-03-01)

🏠 Domain Explanation

PCI is a pre-aggregated daily contact/lead table independent from fact_user_events. Users who submitted lead forms, chatted with agents, or purchased through the platform appear in PCI even if their raw events weren't captured with is_login contacts. These 10,654 users represent HIGH-INTENT buyers/renters with proven commercial behavior.

💡 Feature Ideas

🎯 Actionable Next Steps

  1. Extract user preferences (city, category) from PCI for blind users
  2. Merge into cold_user_prefs.parquet → IntentRecommender will pick up
  3. Potentially feed PCI pairs into ALS matrix (see INS-060)

[INS-060] — 644,732 NEW Lead Pairs from PCI Not in ALS Training

📊 Data Evidence

PCI lead pairs (lead_count > 0): 2,444,156 total
Already in ALS training:         1,799,424 (overlap)
NEW pairs from PCI:              644,732 (25.9% net new)
New unique users:                237,086

Current ALS matrix: 13,020,004 pairs (810,411 users)
After PCI merge:   ~13,664,736 pairs (+5%)
Potential users:   ~1,047,497 (+29%)

🏠 Domain Explanation

PCI aggregates contact metrics from a different pipeline than fact_user_events. The 644K new pairs represent contacts/leads that were captured through PCI's aggregation but not through fact_user_events is_contact flag. These are HIGH-QUALITY signals (lead_count > 0 = confirmed commercial intent).

💡 Strategy: Selective Merge (preserve density per INS-058)

CRITICAL: Do NOT blindly add all 237K new users (INS-058 lesson: density > size)
INSTEAD:
  Option A: Add PCI pairs ONLY for existing ALS users (increase density per user)
  Option B: Add PCI pairs for ALL users but increase ALS factors (512)
  Option C: Add PCI pairs for test users only (targeted improvement)
  
Recommended: Option A first (safe, increases density), then test Option C

🎯 Actionable Next Steps

  1. Filter PCI lead pairs to only existing ALS users → merge into als_contact_pairs
  2. Retrain ALS on enriched matrix
  3. Offline eval → compare Recall@10 vs baseline
  4. If improved, try Option C (add test user PCI pairs)

[INS-062] — other_interaction IS Signal, NOT Noise (A/B Tested)

📊 Data Evidence

A/B Test: 3 ALS variants, offline eval on 5K val users, 256 factors, 30 iters, GPU

Variant A (ALL 5 types, equal weight):
  Pairs: 13M, Users: 810K, Density: 16.1
  Coverage: 100%, Recall@10: 0.0564, NDCG@10: 0.0814

Variant B (REAL 4 types only, no other_interaction):
  Pairs: 2.4M, Users: 335K, Density: 7.1
  Coverage: 75.7%, Recall@10: 0.0186 (-67%!!), NDCG@10: 0.0310

Variant C (Weighted: real=3x, other_interaction=1x):
  Pairs: 13M, Users: 810K, Density: 16.1
  Coverage: 100%, Recall@10: 0.0573 (+1.6%), NDCG@10: 0.0815

other_interaction breakdown:
  90.6M events (94.2% of all contacts)
  796K unique login users
  475K users ONLY have other_interaction (never real contact)
  14,671 test users would LOSE ALS coverage if removed

🏠 Domain Explanation

other_interaction là bất kỳ hành vi tương tác nào ngoài pageview: lưu tin, share, click "quan tâm", v.v. Mặc dù yếu hơn view_phone/chat, nó VẪN LÀ tín hiệu tích cực theo định nghĩa cuộc thi (is_contact=1). Loại bỏ nó giảm ALS density từ 16.1→7.1 (INS-058) và mất 475K users khỏi embedding space.

💡 Strategy Implication

🎯 Action


[INS-063] — SegPop Ceiling ~1.6% Recall@10 Even with Perfect Segment Knowledge

📊 Data Evidence

Theoretical max Recall@10 (perfect city+cat): 0.0158
SegPop hit rates (blind val users, knowing true city+cat):
  Top-10:  1.22%
  Top-20:  2.18%
  Top-50:  4.10%
  Top-100: 6.24%
  Top-200: 9.02%
  Top-500: 14.22%

Blind val users: 13,460 (contacted 28,732 unique items in 3 days)

🏠 Domain Explanation

BĐS có item diversity cực cao — 28,732 items cho 13,460 users trong 3 ngày. Mỗi (city, cat) segment có hàng ngàn items nhưng top-10 chỉ cover fraction rất nhỏ. Khác với e-commerce nơi top-10 popular products chiếm 30%+ purchases, BĐS users tìm kiếm rất long-tail (mỗi căn nhà là unique).

💡 Strategy Implication

CRITICAL: Popularity-based cold-start CANNOT solve the problem alone.
Even with perfect segment knowledge, ceiling is ~1.6% Recall@10.
Top teams reaching 0.32 MUST use a fundamentally different approach:
  - Content-based matching (listing features → user intent)
  - OR they have access to more user signals we're missing
  - OR the metric is computed differently than we assume
Focus should shift to WARM USER RERANKING as primary lever.

[INS-064] — Blind Users Contact Fresh Items (44% ≤7d) and Prefer 1050 (Dự án)

📊 Data Evidence

Blind user contact item age distribution:
  ≤ 1 day:  11.2%
  ≤ 3 days: 27.5%
  ≤ 7 days: 43.9%
  ≤14 days: 59.1%
  ≤30 days: 75.1%

Blind user category distribution:
  1050 (Dự án):   39.6% ← #1 (vs warm users where 1020 dominates)
  1020 (Căn hộ):   30.5%
  1010 (Phòng trọ): 15.9%
  1040 (Đất nền):    7.8%
  1030 (Nhà ở):      6.2%

Blind user city distribution:
  Tp Hồ Chí Minh: 73.8%
  Đà Nẵng:          6.5%
  Hà Nội:           6.4%

🏠 Domain Explanation

Blind users (no training history) are likely NEW users exploring the platform. They disproportionately view 1050 (Dự án/new projects) because these are heavily marketed — billboard ads, Google Ads, social media campaigns drive new users to specific projects. Fresh items dominate because new users arrive via marketing of newly-launched developments.

💡 Strategy Implication

1. SegPop for blind users should overweight 1050 (Dự án) category
   Current hash allocation doesn't reflect this 40% preference
2. Fresh items (≤7d) should be prioritized over historically popular items
3. Consider building a "new user" SegPop variant that:
   - Weights items by (recency × segment_contact_volume)
   - Allocates 4/10 slots to 1050, 3/10 to 1020, 2/10 to 1010, 1/10 to 1040

[INS-065] — Val Distribution ≠ Test Distribution (76.8% warm vs 36%)

📊 Data Evidence

Val GT users: 57,907 (classified by pre-split data)
  Warm (contact history):     44,447 (76.8%)
  Cold+signal (login/PCI, no contacts): 2,735 (4.7%)
  Truly blind:                10,725 (18.5%)

Test users: 161,568 (from INS-049)
  Login events:               58,153 (36.0%)
  Non-login only:             12,367 (7.7%)
  Truly blind:                91,048 (56.4%)

🏠 Domain Explanation

Val users selected by having val-period contacts are biased toward active users. Test set includes ALL registered users, many of whom never engaged. Any offline eval using val GT overweights warm users relative to test.

💡 Strategy Implication


[INS-066] — Model Leak: ALS/SegPop Trained on Full Data Inflate Eval

📊 Data Evidence

Truly blind Recall@10 = 0.1654 (model leak present)
INS-063 ceiling:       0.0158 (clean SegPop, same users)
Inflation factor: ~10x

Warm Recall@10 = 0.0712 (model leak present)
Expected clean:  ~0.06 (estimated, matches v10 warm decomposition)

🏠 Domain Explanation

SegPop was fitted on contacts INCLUDING the 3-day val period. Items popular during val period are perfectly ranked for val users. ALS embeddings similarly encode val-period user-item interactions. This creates circular evaluation: model "predicts" data it was trained on.

💡 Strategy Implication

CRITICAL: Must retrain ALS + SegPop on contacts.filter(last_date <= split_date)
before any trustworthy offline eval. Current absolute numbers are MEANINGLESS.
Relative comparisons (A vs B with same leak) may still be directionally valid.

🎯 Actionable Next Steps

  1. Add --retrain_clean to evaluate_aligned.py → retrain SegPop + ALS on train-only
  2. Re-run eval on clean models to establish TRUE baseline
  3. Then ablate: cascade vs hybrid, PCI prefs vs no prefs

[INS-067] — PCI Prefs Provide 30x Recall Uplift for Cold Users (CLEAN EVAL)

📊 Data Evidence

Split-clean eval (ALS+SegPop retrained on data <= split_date):
  Cold + prefs (PCI/PV): Recall@10 = 0.0612 (n=715)  ← HIGHEST in eval
  Cold (no prefs):       Recall@10 = 0.0020 (n=55)
  Uplift: 30.6x

Prefs breakdown:
  Contact-based prefs: 3,600 (warm users only)
  PCI prefs (split-clean): 43
  Pageview prefs: 672
  Total with prefs: 4,315/10,000

🏠 Domain Explanation

Cold users with SIGNAL (pageviews or PCI leads but no contacts) can be effectively served by the cascade recommender when we extract their city+category preference. IntentRecommender matches them to fresh listings in their preferred segment. Without prefs, they fall back to global SegPop which has ~0 recall.

💡 Strategy Implication

CRITICAL: Expanding PCI coverage is the highest-ROI action:
- INS-059 shows 10,654 blind TEST users have PCI data
- Currently only 43 val cold users matched PCI prefs (small sample)
- Each converted blind→cold user could gain 0.06 recall per user
- 10,654 × 0.06 / 161,568 = +0.004 total LB score from PCI alone

[INS-068] — ALS Extremely Sensitive to Most Recent Contacts (5.6x Drop)

📊 Data Evidence

Production ALS (ALL contacts including val 3d):
  Warm Recall@10 ≈ 0.10 (implied from v14 LB=0.0344)
  Pairs: 13,020,004

Clean ALS (contacts <= split_date only):
  Warm Recall@10 = 0.0179
  Pairs: 12,737,124 (only 2.2% fewer)

Recall drop: 5.6x from removing just 2.2% of most recent data

🏠 Domain Explanation

BĐS market moves fast — the most recent contacts capture current user intent. A user's contacts from 3 months ago may represent a completely different life situation (already bought, changed city, etc.). The 3-day val period contacts are so predictive because they're the MOST recent signal. Removing them forces ALS to extrapolate from older, less relevant interactions.

💡 Strategy Implication

1. Time-weighted ALS: weight recent contacts exponentially higher
   Current: equal weight. Proposed: weight = exp(-days_ago / half_life)
2. For production inference: ALWAYS train on ALL available data up to current date
   The time-split eval artificially handicaps ALS by removing the most valuable signal
3. For offline eval: accept that clean eval underestimates production recall
   True production warm recall ≈ 0.10, not 0.0179

INS-069: LightGBM Overfits to Warm Features

📊 Data Evidence

In leak-free, aligned offline evaluation:

Cascade-Direct (k=10):
  Warm Recall@10 = 0.0285
  Cold-with-signal Recall@10 = 0.0528

Hybrid Mode (Cascade k=200 -> LightGBM reranker):
  Warm Recall@10 = 0.0668 (+134.4% relative gain)
  Cold-with-signal Recall@10 = 0.0127 (-75.9% relative loss)

🏠 Domain Explanation

A single LightGBM ranking model trained on all data learns to heavily rely on rich user behavior features (such as historical contact rates, active collaborative filtering scores, and total views). For warm users, these features are highly predictive. However, cold-start users (login/PCI signal but no contacts) have sparse/missing values for these behavioral features. The model, trained almost exclusively on warm patterns, interprets the absence of behavioral signals as a negative indicator, penalizing cold candidates. This forces relevant cold listing recommendations to the bottom of the list.

💡 Strategy Implication

1. Deploy a Segmented Inference Policy:
   - For WARM users: Route through Cascade (k=200) -> LightGBM Reranker.
   - For COLD/BLIND users: Route directly from Cascade (k=10) (no LightGBM reranker), or route through a specialized cold-start reranker.
2. A single unified ranking pipeline is mathematically suboptimal when user state distributions (sparse vs dense) are highly skewed.

INS-070: Snapshot 7-Day Demand Is the Best Current Truly-Blind Fallback

📊 Data Evidence

Targeted EDA on all 10,725 truly-blind validation users compared no-preference fallback strategies:

global_score7 (contacts_7d*20 + views_7d): Recall@10 = 0.001190, hits = 63
snap_hcm_prop_4_3_2_1:                    Recall@10 = 0.000660, hits = 43
contact_weighted_segments:                 Recall@10 = 0.000593, hits = 43
snap_weighted_segments:                    Recall@10 = 0.000575, hits = 35
global_score7_fresh:                       Recall@10 = 0.000538, hits = 27
global_score1_fresh:                       Recall@10 = 0.000510, hits = 21
global_fresh_only:                         Recall@10 = 0.000000

Full split-clean aligned eval after deploying snapshot fallback:

Before snapshot fallback:
  Simulated LB = 0.0271
  Truly blind = 0.0001

After snapshot fallback:
  Simulated LB = 0.0274
  Warm = 0.0633
  Cold-with-signal = 0.0517
  Truly blind = 0.0011

🏠 Domain Explanation

For users with no contact, no login signal, and no PCI preference, there is no reliable user-side personalization. The best available signal is item-side market demand from recent snapshots. Pure posted_date freshness is not enough: users contact listings that are both recent and demand-proven, not merely new.

💡 Strategy Implication

1. Use snapshot last-7-day demand as the default no-preference blind fallback.
2. Do not use pure freshness as a blind strategy.
3. Hash/segment diversity can be used for rank-1 exposure control, but should not replace the top demand item set.
4. The remaining blind ceiling is low unless a new user-side signal source is found.

[INS-071] — 4,215 Truly-Blind Test Users Have Non-Login Pageviews with Extractable Preferences

📊 Data Evidence

Total test users:           161,568
Currently truly blind:       94,875 (58.7%)
  - With LOGIN events:           0 (all login users already covered)
  - With NON-LOGIN events:   4,276 (4.5% of blind)
  - With NO events at all:  90,599 (95.5% of blind)

Non-login pageview users (subset of 4,276):
  - Users with pageviews:        4,215
  - All 4,215 have both pref_city AND pref_cat extractable
  - Avg pageviews/user:         12.1 (median: 4)
  - 1,187 users have REAL contacts (view_phone/chat/zalo/sms)

City distribution:
  HCM:       2,983 (70.8%)
  Hà Nội:      300 (7.1%)
  Đà Nẵng:     292 (6.9%)
  Bình Dương:   191 (4.5%)

Category distribution:
  1020 (Căn hộ):   1,647 (39.1%)
  1050 (Dự án):    1,105 (26.2%)
  1010 (Phòng trọ):  775 (18.4%)
  1040 (Đất nền):    401 (9.5%)
  1030 (Nhà ở):      287 (6.8%)

🏠 Domain Explanation

These 4,215 users browsed listings on Chợ Tốt without logging in (device-level sessions). Their user_id is a device/cookie identifier, NOT a logged-in account ID. Per INS-057 and lesson #9, there are two conflicting considerations:

  1. For using these prefs: The user_id IS in test_users.parquet, so Kaggle expects recommendations for them. If Kaggle's GT includes non-login contacts, these users CAN have non-zero recall.
  2. Against using these prefs: If Kaggle's GT only counts login contacts (like our offline eval does), then these device-level user_ids will never have GT contacts → recall contribution = 0 regardless of recommendations.

Key fact: Overlap with both login and non-login: 0 — no user_id appears in both login and non-login events, confirming these are fundamentally different identifier types.

⚠️ Reconciliation with INS-057

INS-057 established that removing is_login from the ENTIRE pipeline (including ALS contact matrix) dropped LB score -59%. However, the proposed action here is DIFFERENT:

The key question is NOT whether non-login data hurts ALS (it does), but whether Kaggle evaluates non-login user_ids at all. This requires H-029 verification.

💡 Strategy Implication

1. DO NOT change ALS training or contact_pairs (INS-057/058 lesson stands).
2. ONLY modify _process_cold_user_prefs to also extract preferences from non-login pageviews.
3. This is ZERO-RISK to warm/cold-with-signal users (their flow is untouched).
4. Potential upside: 4,215 users × SegPop city+cat recall ≈ 1.6% (INS-063 ceiling) = +0.001 total
5. But if Kaggle ignores non-login GT, upside = 0.
6. Verify via H-029 before spending a submission attempt.

[INS-072] — v17 Leaderboard Breakthrough: ALS 1024 + Cascade Direct Reached 0.2116 / Top5

📊 Data Evidence

Leaderboard:
  v14 previous best: 0.0344
  v17 current best:  0.2116 (Top5)
  Relative gain:     6.15x

Validated artifact:
  File: outputs/submission_1024.zip
  CSV inside zip: submission.csv
  Rows: 1,615,680
  Users: 161,568
  Columns: ID,user_id,rank,item_id
  Unique items: 62,947
  Rank-1 top item: 9,948 users (<10% rule)
  Zip size: 41.37 MB
  Validator: ALL SUBMISSION RULES PASS

Model artifact:
  ALS factors: 1024
  ALS user_factors: (810,411, 1024)
  ALS item_factors: (696,252, 1024)
  ALS model size: 5.8 GB

🏠 Domain Explanation

The winning shift was not another reranker layer. The public leaderboard rewarded a strong high-capacity collaborative retrieval model trained on all available positive contact data, then served through a direct top-10 cascade. Skipping LightGBM removes the warm/cold distribution overfit from INS-069. Increasing ALS capacity from 256 to 1024 factors greatly improves warm-user ranking, while Recency SegPop and intent fallbacks keep cold/blind users valid.

The uppercase ID column is also essential: .agent/submission_rules.md requires ID,user_id,rank,item_id. Earlier lowercase id validation was wrong for this competition.

💡 Strategy Implication

1. Treat v17 as the new production baseline.
2. Keep config.inference_mode = "cascade" unless a new ablation beats 0.2116 on leaderboard.
3. Do not re-enable unified LightGBM for production without a clean segment-specific proof.
4. Future gains should be ablated against v17, not the old 0.034 baseline.
5. Submission validation must enforce uppercase ID and zip/gz packaging.

[INS-073] — Snapshot Blind Fallback Was Leaderboard-Rejected

📊 Data Evidence

Submission: outputs/submission_snapshot_blind.zip
Public LB: 0.0003

Offline context before submission:
  Snapshot demand fallback looked useful in aligned blind eval:
  blind recall improved from ~0.0001 to ~0.0011.

Leaderboard reality:
  0.0003 is near the original broken baseline range and far below v17 0.2116.

🏠 Domain Explanation

The snapshot demand signal was item-side market activity, not user-side intent. For truly-blind users, offline validation can over-reward items that were recently active in the validation window, while public LB appears to reward the high-capacity ALS/cascade list structure much more strongly. This means blind-fallback experiments are especially vulnerable to offline/LB mismatch.

💡 Strategy Implication

1. Do not use snapshot fallback in final production submissions.
2. Keep snapshot code gated away from cascade production unless a tiny LB ablation proves benefit.
3. Treat blind fallback as low-ceiling; protect warm ALS quality first.

[INS-074] — ALS1536 + Time Decay Did Not Beat ALS1024 Baseline

📊 Data Evidence

v17 baseline:
  File: outputs/submission_1024.zip
  Public LB: 0.2116
  ALS factors: 1024
  Mode: cascade direct

v18 experiment:
  File: outputs/submission_1536.zip
  Public LB: 0.2108
  ALS factors: 1536
  Added: time-decay ALS, pci_merge_mode=test_only, non-login cold prefs, recent_cc=5

Delta:
  0.2108 - 0.2116 = -0.0008

🏠 Domain Explanation

More ALS capacity and recency weighting did not automatically improve leaderboard precision. The v18 branch was close but still worse than v17, so the current evidence says ALS1024 is the safest production capacity. The degradation is small and confounded by multiple simultaneous changes, but it is enough to reject v18 as final.

💡 Strategy Implication

1. Keep outputs/submission_1024.zip as the protected best artifact.
2. Do not assume larger ALS factors are better beyond 1024.
3. Any future factor/time-decay work must be isolated one variable at a time.

[INS-075] — Slot-Level Blending Damaged v17

📊 Data Evidence

v17 baseline:
  outputs/submission_1024.zip = 0.2116

v19 conservative blend:
  outputs/submission_blend_v17_9_v18_1.zip = 0.1974
  Policy: keep v17 ranks 1-9, replace rank10 with first unique v18 item.

Delta:
  0.1974 - 0.2116 = -0.0142

🏠 Domain Explanation

Even the tenth slot of v17 carries meaningful signal. Replacing only one item per user with v18 introduced enough noise to lose 6.7% relative score. This also suggests the public metric is sensitive to the exact v17 cascade ordering, not just rank-1 items.

💡 Strategy Implication

1. Do not blend by mechanically replacing tail slots.
2. Treat v17 full top-10 as an atomic strong baseline.
3. Ensemble only if a learned or segment-specific policy proves it beats v17 before submission.