📊 INSIGHTS REGISTRY — ĐĂNG KÝ TOÀN BỘ INSIGHTS

Mục đích: Lưu MỌI insight đã phát hiện, trở thành knowledge base tích lũy. Quy tắc: Mỗi insight phải có ID duy nhất, bằng chứng, và feature suggestion. Đọc file này: Trước khi bắt đầu round mới, để không duplicate effort.

📈 DASHBOARD

Category	Số insights	Breakthrough?
Data Quality	1	-
Data Scale	1	-
Marketplace Structure	1	-
Algorithm Architecture	4	✅ GAME-CHANGING
Leaderboard Diagnosis	4	✅ ROOT CAUSE
Experiment Failures (Round 21)	5	⚠️ LESSONS
Experiment Failures (Round 22)	2	🔴 CRITICAL LESSONS
PCI Data Discovery (Round 19)	2	🔴🔴🔴 BREAKTHROUGH
Cold-Start Ceiling (Round 23)	2	🔴🔴🔴 GAME-CHANGING
Eval Infrastructure (Round 24)	5	🔴🔴🔴 CRITICAL
Leaderboard Breakthrough (Round 25)	1	✅ GAME-CHANGING
Leaderboard Postmortem (Round 26)	3	🔴 GUARDRAILS
TOTAL	31	-

🏷️ INSIGHT INDEX (QUICK REFERENCE)

ID	Round	Category	Headline	Impact	Feature Idea?
INS-001	01	Data Quality	Systematic nullity in dim_listing by property type	🟡 MED	is_apartment flag
INS-002	01	Data Scale	fact_user_events = 161.7M rows, 500 files	🔴 HIGH	Must pre-aggregate
INS-003	01	Marketplace	Agent sellers dominate 83.4% of listings	🟡 MED	Fairness metric input
INS-045	19	Algorithm	Budget-based sequential union beats hard cascade: Recall@200 0.27→0.31	🔴🔴🔴	Budget caps per source
INS-046	19	Algorithm	Round-robin interleave HURTS recall vs sequential priority	🔴🔴	Keep sequential
INS-047	19	Algorithm	als_view (pageview CF) dilutes candidate pool — disable improves Recall@200	🔴🔴	Set als_view budget=0
INS-048	20	Leaderboard	SegPop city name bug: "Hồ Chí Minh" ≠ "Tp Hồ Chí Minh" → 91k users same items	🔴🔴🔴	Fix key names
INS-049	20	Leaderboard	56.4% test users have ZERO training events — completely blind	🔴🔴🔴	Hash-based segment assignment
INS-050	20	Leaderboard	Offline eval doesn't predict leaderboard: best=0.006 vs top1=0.32 (53x gap)	🔴🔴🔴	Need test-aligned eval
INS-051	20	Leaderboard	50.3% contacts on items posted ≤7 days → recency > popularity	🔴🔴	Recency-weighted SegPop
INS-052	21	Experiment	LightGBM reranker trained on EnsembleGen ≠ CascadeGen distribution	🔴🔴🔴	Must retrain
INS-053	21	Engineering	Training pipeline overwrites segpop.pkl with alltime version	🔴🔴🔴	Backup/restore
INS-054	21	Experiment	Offset diversity for cold users HURTS: top items are most relevant	🔴🔴	Don't offset
INS-055	21	Analysis	Warm users already at ~0.10 recall; cold users (66%) ≈ 0	🔴🔴🔴	Cold=primary lever
INS-056	21	Experiment	PV-first cascade ≈ ALS-first (0.0999 vs 0.1009)	🟡	Keep ALS-first
INS-057	22	Experiment	Removing is_login filter HURTS: 0.034→0.014 (-59%). Non-login = noise	🔴🔴🔴	KEEP is_login filter
INS-058	22	Analysis	ALS matrix density is key: 16.1→7.5 contacts/user killed embeddings	🔴🔴🔴	Density > size
INS-059	19	Data Source	10,654 blind test users have PCI data (avg 16.3 items) — convert blind→warm	🔴🔴🔴	PCI prefs for blind
INS-060	19	Data Source	644,732 NEW lead pairs from PCI not in ALS training data	🔴🔴🔴	Merge PCI into ALS
INS-061	19	Architecture	4-stage pipeline (Cascade→Feature→LightGBM→Reranker) code EXISTS but unused since v11 bug	🔴🔴🔴	Retrain LightGBM on cascade
INS-063	23	Cold-Start	SegPop ceiling ~1.6% Recall@10 even with PERFECT city+cat knowledge	🔴🔴🔴	Popularity alone cannot solve cold-start
INS-064	23	Cold-Start	44% blind contacts on items ≤7d old; 1050 (Dự án) = #1 category for blind users	🔴🔴🔴	Freshness-first SegPop, category reweighting
INS-065	24	Eval	Val: 76.8% warm / 4.7% cold / 18.5% blind — Test: 36% / 7.7% / 56.4%. Distribution mismatch	🔴🔴🔴	Must simulate test ratio
INS-066	24	Eval	ALS/SegPop trained on full data leaks val contacts → blind recall inflated 10x (0.165 vs 0.016 ceiling)	🔴🔴🔴	Must retrain models on split-clean data
INS-067	24	Eval	Cold+PCI prefs = 0.0612 recall vs 0.0020 without (30x uplift). PCI prefs are critical for cold users	🔴🔴🔴	Expand PCI coverage to more cold/blind test users
INS-068	24	Eval	ALS recall drops 5.6x when 3d val contacts removed (0.10→0.018). Most recent contacts are disproportionately important	🔴🔴🔴	Time-weight ALS toward recent contacts
INS-069	24	Model Architecture	LightGBM ranker overfits to warm features, severely destroying cold-start recall	🔴🔴🔴	Implement Segmented Inference Policy
INS-071	25	Cold-Start Signal	4,215 truly-blind test users have non-login pageviews with extractable city+cat prefs. But INS-057 warns non-login = device-level IDs	🟡🟡	H-029: verify if non-login pref injection helps or is irrelevant
INS-072	25	Leaderboard	v17 reached 0.2116 LB / top5: ALS 1024 + full-data cascade-direct + uppercase ID submission	🔴🔴🔴	Keep cascade mode as production baseline
INS-073	26	Leaderboard Failure	Snapshot blind fallback scored 0.0003 on LB despite offline promise	🔴🔴🔴	Never use snapshot fallback for final unless LB-ablation proves it
INS-074	26	Leaderboard Failure	ALS1536 + time-decay + test-only prefs scored 0.2108, slightly below v17 0.2116	🔴🔴	ALS1024 remains production sweet spot
INS-075	26	Leaderboard Failure	v17 top9 + v18 slot10 blend scored 0.1974; even rank10 replacement hurt badly	🔴🔴🔴	Do not slot-blend v17 unless full-list eval proves gain

📖 DETAILED INSIGHTS

[INS-001] — Systematic Nullity in dim_listing

Discovered in: Round 01
Category: Data Quality
Impact Level: 🟡 MEDIUM

📊 Data Evidence

project_id: 88.71% null (2,756,219 / 3,107,114)
direction: 82.15% null
floors: 70.52% null
furnishing: 54.81% null
house_type: 51.47% null
bathrooms: 44.85% null
bedrooms: 31.78% null

🏠 Domain Explanation

Đất nền (1040) và nhà ở (1030) tự nhiên không có project_id, floors, furnishing. Nullity không phải lỗi data — là reflection của property type.

💡 Feature Engineering Suggestion

Feature name: is_apartment
Formula: project_id.is_not_null()
Expected impact: Strong signal cho category classification. LightGBM handles NaN natively.

🎯 Follow-up

H-001: Verify null rate by category (Round 02)

[INS-002] — Massive Scale of Clickstream Data

Discovered in: Round 01
Category: Data Scale
Impact Level: 🔴 HIGH

📊 Data Evidence

fact_user_events: 161,731,336 rows, 500 files
fact_listing_snapshot: 19,762,167 rows, 62 files
fact_post_contact_interactions: 25,486,445 rows, 147 files
dim_listing: 3,107,114 rows, 40 files

💡 Feature Engineering Suggestion

Strategy: Pre-aggregate fact_user_events to user-level and item-level before joins.
Never operate at raw event level in feature engineering.
Use Polars LazyFrame + column pushdown + date filters.

[INS-003] — Agent Seller Dominance

Discovered in: Round 01
Category: Marketplace Structure
Impact Level: 🟡 MEDIUM

📊 Data Evidence

agent: 2,593,063 (83.5%)
private: 514,051 (16.5%)

🏠 Domain Explanation

BĐS Việt Nam đặc thù: Môi giới (agent) chiếm đa số listing vì cá nhân (private) ít biết cách đăng tin chuyên nghiệp. Fairness metric phải điều chỉnh exposure cho private sellers.

💡 Feature Engineering Suggestion

Feature name: seller_type_encoded (binary)
Use in Fairness metric: Target ratio should reflect natural distribution, not 50/50.

[INS-019] — Fairness Gap: Agent/Private Ratio Severely Miscalibrated

Discovered in: Round 09
Category: Marketplace Health / Fairness
Impact Level: 🔴🔴 CRITICAL

📊 Data Evidence

Submission:    agent=27.3%,  private=72.7%
GT contacts:   agent=52.0%,  private=48.0%
Gap:           −24.7 pp (agents heavily under-served)

🏠 Domain Explanation

Agents chiếm 83.5% của dim_listing nhưng collectively chỉ nhận 52% contacts vì private sellers có lead/listing cao hơn 3x. Hệ thống đang đẩy quá nhiều private sellers trong top-10 → agents phản ứng tiêu cực, ảnh hưởng doanh thu B2B của Chợ Tốt.

💡 Feature Engineering Suggestion

Feature: seller_type_fairness_correction
Formula: if agent_ratio_current < 0.52: boost agent-seller items in reranker
Impact: Calibrate HealthMetrics.gt_dist với agent_ratio=0.520 (từ data thực)

🎯 Business Impact

Agents trả phí premium placement. Under-serving họ = churn risk + doanh thu B2B giảm.

[INS-020] — Category Imbalance: 1050 Over-Served, 1010 Under-Served

Discovered in: Round 09
Category: Marketplace Health / Diversity
Impact Level: 🟡 MEDIUM

📊 Data Evidence

Category | Submission | GT contacts | Gap
1010     |    11.3%   |    15.6%    | -4.3pp (under)
1020     |    41.2%   |    44.6%    | -3.4pp (under)
1030     |     8.7%   |     6.5%    | +2.3pp (over)
1050     |    29.0%   |    23.1%    | +5.9pp (OVER-SERVE)

💡 Feature Engineering Suggestion

Feature: category_exposure_correction
Formula: KL divergence from GT category distribution → boost under-served categories
Used in: MultiObjectiveReranker fairness term γ

[INS-021] — Freshness "Paradox" Debunked (Survivorship Bias)

Discovered in: Round 09 (Debunked in PDF 2 Lifecycle Analysis)
Category: Freshness / Model Calibration
Impact Level: 🔴 HIGH

📊 Data Evidence

Submission  — median listing age: 10 days,  mean: 36 days
GT contacts — median listing age: 97 days,  mean: 106 days

BUT PDF 2 reveals: 69.7% of all contacts happen in the first 7 days.

🏠 Domain Explanation

The 97-day median age for GT contacts is an illusion caused by Survivorship Bias. Bad listings are removed early. Only high-quality listings survive to 90+ days. The true "Golden Moment" is the first 7 days.

💡 Feature Engineering Suggestion

Recommendation: Keep ALS half_life at 7d to capture the 69.7% Golden Moment.
DO NOT raise half-life to 30d as originally hypothesized in R09.
Reranker delta: Maintain freshness weight to boost new items.

[INS-022] — Coverage Extremely Low: 3.71%, Popularity Bias Severe

Discovered in: Round 09
Category: Coverage / Long-tail
Impact Level: 🔴🔴 CRITICAL

📊 Data Evidence

Items recommended: 115,340 / 3,107,114 = 3.71%
Top-1% items: 81.9% of all recommendation slots
96.3% of catalogue: NEVER recommended

🏠 Domain Explanation

Feedback loop kinh điển: popular items → recommended → more views → more contacts → more popular. New sellers never get traction. Marketplace health degrades over time.

💡 Feature Engineering Suggestion

Feature: item_novelty_score = 1 - (popularity_rank / total_items)
Strategy: Add novelty bonus in BurstTrendingRecommender for long-tail items
Target: Raise coverage from 3.71% → 8%+ without sacrificing Recall@10

[INS-023] — Ground-Truth Distribution Calibrated from Data

Discovered in: Round 09
Category: Model Calibration
Impact Level: 🟡 MEDIUM

📊 Data Evidence

{
  "agent_ratio": 0.520,
  "category_dist": { "1010": 0.156, "1020": 0.446, "1030": 0.065, "1040": 0.102, "1050": 0.231 }
}

Saved to: .cache/gt_dist.json — loaded by HealthMetrics automatically.

💡 Feature Engineering Suggestion

Replace hardcoded values in HealthMetrics (agent_ratio=0.7, category_dist=generic)
with data-driven values. This is now done automatically via gt_dist_path param.

[INS-024] — Reranker Impact Minimal for Cold Users (70% of Base)

Discovered in: Round 09 (post-pipeline analysis)
Category: System Architecture
Impact Level: 🔴 HIGH

📊 Data Evidence

Before reranking: Diversity entropy = 0.6947, Fairness = 0.273
After  reranking: Diversity entropy = 0.6986 (+0.004), Fairness = 0.273 (UNCHANGED)
Root cause: 101,441 cold users (63%) get homogeneous global trending → dominates aggregate

💡 Action Required

To meaningfully improve health metrics across ALL users:
1. Make BurstTrendingRecommender diversity-aware (inject agent items, balance categories)
2. Or: expand cold-start coverage via better ColdStartProfiler (remove require_login constraint)
3. Or: add novelty injection to global trending (force 20% long-tail items)

[INS-025] — 85.5% of GT Items Are COMPLETELY NEW to the User

Discovered in: Diagnostic Analysis (post-v4 submission)
Category: Model Architecture / Ground Truth Pattern
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

GT pairs (last 3 days): 62,893
Repeat contacts (user contacted before):  7,088 / 62,893 = 11.3%
Previously viewed (pageview before):      9,111 / 62,893 = 14.5%
ANY prior interaction:                    9,130 / 62,893 = 14.5%
COMPLETELY NEW to user:                  53,763 / 62,893 = 85.5%

🏠 Domain Explanation

BĐS khác e-commerce: users không "re-buy" items. Họ liên tục duyệt tin MỚI trong khu vực quan tâm. ALS/CF chỉ giúp 14.5% — phần còn lại phải đến từ segment popularity hoặc content-based matching.

💡 Strategy Implication

CRITICAL: ALS collaborative filtering là SECONDARY signal, không phải PRIMARY.
PRIMARY signal = popularity within user's preferred (city, category) segment.
This explains why v1-v4 scored 0.006 — they over-relied on CF for 85.5% of GT.

🎯 Action

Shift pipeline from CF-first → segment-popularity-first
ALS only supplements for the 14.5% repeat/viewed items
Implemented in run_submission_v5.py

[INS-026] — 91.9% of GT Items Match User's Preferred City

Discovered in: Diagnostic Analysis (post-v4 submission)
Category: User Behavior / Geographic Signal
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

GT pairs with known user prefs: 53,074
Same city as user preference:     48,775 / 53,074 = 91.9%
Same category as user preference: 38,297 / 53,074 = 72.2%
BOTH city + category match:       36,342 / 53,074 = 68.5%

🏠 Domain Explanation

Người tìm BĐS gần như LUÔN tìm trong cùng 1 thành phố (92%). Category consistency cũng cao (72%) — người tìm căn hộ hiếm khi chuyển sang đất nền. Đây là đặc trưng domain BĐS: quyết định mua/thuê = location-first.

💡 Strategy Implication

Feature: user_preferred_city (mode of contacted cities) → MUST-HAVE filter
Feature: user_preferred_category (mode of contacted categories) → strong filter
Recommendation cascade: (city+cat+district) → (city+cat) → (city) → (cat) → global

[INS-027] — Submission Item Coverage vs GT Coverage Gap

Discovered in: Diagnostic Analysis (post-v4 submission)
Category: Model Evaluation / Debugging
Impact Level: 🔴🔴 CRITICAL

📊 Data Evidence

Submission unique items:    9,290
GT unique items (last 3d): 28,706
Overlap:                    6,211 / 28,706 = 21.6% (only 1 in 5 GT items in submission!)
93.1% of GT users have post_contact history (NOT cold-start problem!)

🏠 Domain Explanation

Popularity bias cực nặng: ta chỉ recommend 9K items cho 161K users. GT cần 28K items. Submission chỉ cover 21.6% GT items → Recall bị cap ở ~0.22 max ngay từ đầu, bất kể ranking quality.

💡 Strategy Implication

MUST diversify item pool: recommend 50K+ unique items across all users
Reduce popularity concentration: top-1% items should be <30% of slots (was 81.9%)
Use finer-grain segments (city+cat+district) to naturally diversify

⭐ TOP BREAKTHROUGH INSIGHTS

ID	Breakthrough	Impact
INS-025	85.5% GT items are NEW → CF is secondary, segment popularity is primary	🔴🔴🔴
INS-026	91.9% city match → location is the dominant filter	🔴🔴🔴
INS-027	Submission covers only 21.6% of GT items → popularity bias kills score	🔴🔴
INS-022	Coverage crisis: 3.71% → need long-tail strategy	🔴🔴
INS-019	Agent fairness gap: 24.7pp → critical for B2B revenue	🔴🔴
INS-021	Freshness paradox: half-life=7d too aggressive	🔴
INS-024	Reranker ineffective for cold users → need cold trending diversity	🔴

[INS-028] — Funnel Drop-off: 83.9% Soft Intent vs 20.5% Real Lead

Discovered in: PDF 1 (Funnel Analysis)
Category: User Behavior
Impact Level: 🔴 HIGH

📊 Data Evidence

Positive Rate: 83.9%
Real Lead Rate: 20.5%
Median time to soft interact: 20s. Median to Real Lead: 40-67s.

🏠 Domain Explanation

Users save/share passively but hesitate to contact. Real contact takes 3x the time to decide.

💡 Feature Engineering Suggestion

Feature: time_to_contact (proxy for intent). Optimize UI to show price/area/location above the fold.

[INS-029] — Category Intent: Đất nền Highest CR, Nhà ở Lowest

Discovered in: PDF 1 (Funnel Analysis)
Category: Category Performance
Impact Level: 🟡 MEDIUM

📊 Data Evidence

Đất nền (1040) Positive Rate: 87.6%
Nhà ở (1030) Positive Rate: 70.2%
Dự án (1050) Volume High, CR Low (78.4%)

🏠 Domain Explanation

Đất nền buyers have urgency. Dự án browsers are curious but avoid agents. Nhà ở lacks supply/demand.

💡 Feature Engineering Suggestion

Feature: category_urgency_weight. Boost 1040 for fast conversions.

[INS-030] — Listing DNA: Images, Furnishing, and Legal Status

Discovered in: PDF 3 (DNA Analysis)
Category: Listing Quality
Impact Level: 🔴🔴 CRITICAL

📊 Data Evidence

Images: Top 5% listings have >= 8 images.
Furnishing: "Nội thất cao cấp" gives 1.63x lift. "Nhà trống" gives 0.50x.
Legal: "Sổ hồng riêng" gives 1.80x lift. "Giấy tờ viết tay" gives 0.21x.

🏠 Domain Explanation

High-quality images, premium furnishing, and clear legal status reduce buyer risk and increase confidence to contact.

💡 Feature Engineering Suggestion

Features: has_so_hong_rieng, has_noi_that_cao_cap, images_count >= 8. Strong predictors for LightGBM.

[INS-031] — Geography & Category CR Dynamics

Discovered in: PDF 3 (DNA Analysis)
Category: Marketplace Structure
Impact Level: 🟡 MEDIUM

📊 Data Evidence

Cities: Bình Định/Khánh Hoà (180-220% CR) vs HN/HCM (~160%).
Category: Phòng trọ (1.87x lift) vs Dự án (0.38x lift).

🏠 Domain Explanation

Secondary markets have less supply, making each listing perform better. Dự án (Projects) have long nurture periods, while Phòng trọ converts immediately.

💡 Feature Engineering Suggestion

Feature: category_conversion_weight. Penalize 1050 in short-term predictions.

[INS-032] — The Cold-Start Bloodbath (90.8% Drop-off)

Discovered in: PDF 4 (User Cohorts Analysis)
Category: User Retention
Impact Level: 🔴🔴 CRITICAL

📊 Data Evidence

New Users = 59.7% of total users.
Retention 30D for New Users = 9.2% (90.8% drop off).
Power Users = 4.1% of total users, but Retention 30D = 89.7%.

🏠 Domain Explanation

New users leave if the first session recommendations do not match their intent. If they do not find relevance immediately, they assume the platform has no supply for them.

💡 Feature Engineering Suggestion

Cold-start fallback strategy MUST focus on the most popular, high-quality segments (Căn hộ, Phòng trọ in HCM/HN) to prevent immediate churn.

[INS-033] — Aha! Moment: 3 Sessions > 1 Contact

Discovered in: PDF 4 (User Cohorts Analysis)
Category: Product Strategy
Impact Level: 🔴 HIGH

📊 Data Evidence

Baseline conversion to Power User: 2.56%
Conversion if user has 1 Contact in first 7 days: 7.85% (3.1x lift)
Conversion if user reaches >= 3 sessions in first 7 days: 19.65% (7.7x lift)

🏠 Domain Explanation

A single contact often means "Good Churn" (user found a room and uninstalled). Reaching 3 sessions means Habit Formation (user is researching, comparing, and treating the platform as a tool).

💡 Feature Engineering Suggestion

[INS-034] — Intent Matching Can Recover 31.9% of Valid Items

Discovered in: Round 15 (Diagnostic 6)
Category: Model Strategy / Content-Based
Impact Level: 🔴🔴 CRITICAL

📊 Data Evidence

Total GT contacts for users with intent: 110,659
GT items present in dim_listing: 2,914 (2.6%)
GT items matching Top 1 Intent (District, Category, Price): 668 (22.9% of active items)
GT items matching Top 3 Intents (District, Category, Price): 931 (31.9% of active items)
GT items matching Top 1 (City, Category): 2,139 (73.4% of active items)

🏠 Domain Explanation

Ngành BĐS có tốc độ thanh khoản cực cao. 97.4% số tin user liên hệ đã không còn trên sàn lúc test. Do đó, thay vì cố gợi ý các tin CŨ từ lịch sử (CF), nếu ta rút trích Chân dung nhu cầu (Intent) từ lịch sử Pageview và match trực tiếp với các tin MỚI NHẤT cùng phân khúc (Quận/Loại hình/Khung giá), ta có thể bắt được 31.9% nhu cầu mua thực tế!

💡 Strategy Implication

CRITICAL: Intent-Based Recommendation is MANDATORY for cold-start items.
Implement `IntentRecommender` directly targeting `dim_listing`.
Place it high in the cascade hierarchy (Priority 1.5).

[INS-035] — Recent Segment Contacts > Global Popularity

Discovered in: Round 10 (Recall Strategies)
Category: Fallback Strategy
Impact Level: 🟡 MEDIUM

📊 Data Evidence

Global SegPop top-10 Recall: 0.0226
Recent CC (last 7d) within user's city+cat Recall: 0.0522 (2.3x higher)

🏠 Domain Explanation

"Trending now" in a local area is much more relevant than "All-time popular". BĐS is highly temporal; properties popular 3 months ago are irrelevant.

[INS-036] — Pageview Replay is the Strongest Single Predictor

Discovered in: Round 11 & Round 13
Category: Candidate Generation
Impact Level: 🔴 HIGH

📊 Data Evidence

79.9% of City+Category extracted from PVs match the actual contacted City+Category.
Standalone PV(30d) Recall: 0.0618 (Highest among all standalone sources, beating ALS at 0.0412).

💡 Strategy Implication

PV Replay MUST be Priority 1. It represents the user's immediate, explicit intent.

[INS-037] — The 7-Day "Golden Window" for Pageviews

Discovered in: Round 12 (Pageview Optimization)
Category: Feature Tuning
Impact Level: 🔴 HIGH

📊 Data Evidence

30-day PV window gives higher standalone recall (0.1977) than 7-day (0.1813).
BUT full cascade (PV -> CC -> RecentCC) peaks at 0.2480 with the 7-day window, and drops to 0.2350 with the 30-day window.

🏠 Domain Explanation

Old pageviews crowd out high-quality fresh recommendations from fallbacks. A user viewing a property 25 days ago has likely moved on.

[INS-038] — CoView is Noisy; Optimal Cascade Order

Discovered in: Round 13 (Cascade Config)
Category: Architecture
Impact Level: 🔴 HIGH

📊 Data Evidence

CoView standalone Recall: 0.0107 (Weakest).
Cascade with CoView: 0.0807. Cascade without CoView: 0.0844.

💡 Strategy Implication

Optimal ordering by precision: Pageview -> CoContact -> ALS -> RecentCC -> SegPop. Drop CoView.

[INS-039] — Ward-Level Intent Matching is Too Strict

Discovered in: Round 14 (Intent Basic)
Category: Content-Based
Impact Level: 🟡 MEDIUM

📊 Data Evidence

Only 0.1% (57/48842) of GT items perfectly matched the user's (Ward, Price, Cat) intent.

🏠 Domain Explanation

Real estate inventory is too sparse at the Phường/Xã level. Users are willing to cross Ward boundaries within the same District or City.

💡 Strategy Implication

Elevate Intent matching to District level minimum.

[INS-040] — 97.5% of Active Inventory Ignored Due to Glob Bug

Discovered in: Round 16 (Submission Debugging)
Category: Data Pipeline / Engineering
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

Original code loaded valid_items via pl.read_parquet(dim_files[0]).
This loaded exactly 77,836 items out of 3,107,114 available active items.
Recall@10 plummeted to 0.0003 on the public leaderboard.

🏠 Domain Explanation

When 97.5% of active inventory is artificially removed from the candidate pool, the IntentRecommender and CascadeCandidateGenerator are forced to recommend stale or irrelevant properties. Real estate relies heavily on the full breadth of active supply to match nuanced user queries.

💡 Strategy Implication

ALWAYS load partitioned parquet files via pl.scan_parquet(dim_files).collect() rather than assuming a single file. Fixed in V6, immediately reviving candidate quality.

[INS-041] — Pageview Replay Trumps Generalized Intent (Priority 1)

Discovered in: Round 16 (Cascade Priority Tuning)
Category: Algorithm Architecture
Impact Level: 🔴 HIGH

📊 Data Evidence

Cascade (Intent -> Pageview): Recall@10 = 0.0531
Cascade (Pageview -> Intent): Recall@10 = 0.1018 (1.9x Lift)

🏠 Domain Explanation

While IntentRecommender (District + Cat + Price) is brilliant for filling gaps and cold-start discovery, it CANNOT beat the explicit, exact-match signal of a user clicking on a specific property yesterday (PageviewReplay).

💡 Strategy Implication

PageviewReplay MUST remain Priority 1. IntentRecommender serves as the ultimate high-quality Fallback (Priority 1.5) to capture the 27% Recall@200 ceiling.

[INS-042] — Non-linear Correlation between Views and Contacts

Discovered in: Round 16 (Adview Correlation Analysis)
Category: Listing Quality / Performance
Impact Level: 🔴 HIGH

📊 Data Evidence

adview_count = 0: Conversion Rate = 0.103
adview_count = 30: Conversion Rate drops to 0.087
adview_count = 150+: Conversion Rate rises back to 0.101
Pearson correlation between adview_count and total_contacts: 0.7571

🏠 Domain Explanation

Listings with very low views but high conversion are often "Hidden Gems" or mispriced properties that get snapped up instantly. Listings with average views (30-50) are typical properties that users browse but hesitate to contact. "Mega-hot" listings (150+ views) are likely highly desirable projects where FOMO drives contact rates back up.

💡 Strategy Implication

The correlation of 0.7571 proves that views_24h is one of the strongest predictive features for the Reranker. Must include views_24h and a non-linear feature like conversion_rate (contacts_24h / (views_24h + 1)) in LightGBM.

[INS-043] — The "Sticky" Category Phenomenon (75% Loyalty)

Discovered in: Round 17 (Sequential Category Transitions)
Category: User Behavior
Impact Level: 🔴🔴 CRITICAL

📊 Data Evidence

1010 -> 1010 transition: 71.3%
1020 -> 1020 transition: 76.5%
1050 -> 1050 transition: 87.2%
Average probability of staying in the exact same category across consecutive contacts: 75.11%

🏠 Domain Explanation

Unlike e-commerce where users might buy a phone then buy a case, real estate users are highly fixed in their intent. A user looking for a house (1030) rarely switches to renting a room (1010). The 87.2% loyalty in 1050 (Dự án) shows that project investors are a very distinct segment from typical residential buyers.

💡 Strategy Implication

Sequential recommendation models MUST heavily penalize candidates that do not match the user's most recent interaction category. Reranker should have a feature is_same_category_as_last_view.

[INS-044] — Candidate Cascade Slot Competition & Recall@200 Ceiling

Discovered in: Round 18 (Candidate Generator Evaluation)
Category: Algorithm Architecture
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

Standalone Recall@200:
- ALS: 0.1749
- Intent: 0.1140
- Pageview: 0.0913
- UserKNN: 0.0862
- Seller: 0.0302
- SegPop: 0.0079
Combined Recall@200 (hard cascade with ALS first): 0.1840 (but Recall@10 dropped to 0.0395).
Combined Recall@200 (hard cascade with PV first): 0.2396 (but Recall@10 is 0.0795).
Under the current hard cascade, ALS/Intent greedily fill up the 200 slots, preventing KNN, Seller, and CoContact from adding value.

🏠 Domain Explanation

A rigid cascade priority queue is perfect for generating a final Top-10 list, but flawed for generating a Candidate Pool for a Reranker. High-volume generators like ALS or Intent fill up the 200-slot quota instantly, starving high-precision local matches (like Pageview Replay or CoContact) of slots. If ALS is placed first, the final top-10 precision is destroyed because ALS has poor precision in the top ranks.

💡 Strategy Implication

We must shift from a "hard priority cascade" to a "diverse union generator" for candidate generation. Instead of slot-filling until 200 is reached, we should extract a fixed budget of candidates from each generator (e.g., top 50 from PV, top 50 from ALS, top 50 from Intent, top 50 from KNN) and union them to form a robust, high-recall candidate pool (aiming for Recall@200 > 0.40). We then let the LightGBM Reranker sort the final top-10 list.

[INS-045] — Budget-based Sequential Union Dramatically Improves Recall@200

Discovered in: Round 19 (Cascade Budget Optimization)
Category: Algorithm Architecture
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

Hard cascade (ALS first, greedy fill): Recall@200 = 0.1840
Hard cascade (PV first, greedy fill): Recall@200 = 0.2396
Budget-based sequential union: Recall@200 = 0.3177 (+33% improvement over PV-first cascade)
Budget config (best): PV=50, Intent=60, CoContact=40, ALS=100, ALS_View=0, UserKNN=50, Seller=40, RecentCC=80
Source distribution: ALS=177k, Intent=69k, RecentCC=62k, CoContact=46k, PV=28k, UserKNN=25k, Seller=13k, SegPop=16k

🏠 Domain Explanation

Mỗi model recommender có thế mạnh riêng: ALS tốt cho warm users có lịch sử contact, Intent tốt cho fresh listings, RecentCC tốt cho cold-start. Khi dùng hard cascade, model đầu tiên "ăn hết" 200 slots, các model phía sau bị starve hoàn toàn. Budget caps cho phép MỌI model đều đóng góp candidates, tạo pool đa dạng hơn.

💡 Strategy Implication

Giữ sequential priority (model chất lượng cao chạy trước), nhưng giới hạn budget mỗi source.
Tổng budget > 200 để ensure luôn đủ 200 candidates (do overlap giữa sources).
RecentCC và SegPop là fallback thiết yếu cho cold-start users.

[INS-046] — Round-Robin Interleave is INFERIOR to Sequential Priority

Discovered in: Round 19 (Cascade Budget Optimization)
Category: Algorithm Architecture
Impact Level: 🔴🔴 CRITICAL

📊 Data Evidence

Sequential priority union: Recall@200 (Active GT) = 0.3152
Round-robin interleave: Recall@200 (Active GT) = 0.2753 (-12.7%)
Round-robin source distribution: ALS=61k, ALS_View=51k, Intent=31k, PV=17k, UserKNN=34k, CoContact=27k, Seller=36k, RecentCC=67k, SegPop=75k

🏠 Domain Explanation

Round-robin cho mỗi source 1 item per turn. Với warm users có lịch sử phong phú, SegPop/RecentCC (low-precision fallback) chiếm quá nhiều slots trong các turn đầu, đẩy ra các high-precision personalized candidates từ ALS/Intent. Ví dụ: ALS item rank #5 (rất chính xác) bị thay bằng SegPop item rank #5 (popularity noise). Sequential priority đảm bảo high-precision sources fill trước, low-precision sources chỉ fill remaining slots.

💡 Strategy Implication

KHÔNG dùng round-robin cho candidate generation. Sequential priority là thiết kế đúng.
Round-robin chỉ tốt khi mọi source có precision tương đương — nhưng trong thực tế ALS/Intent precision >>> SegPop.

[INS-047] — Pageview-based ALS (als_view) Dilutes Candidate Quality

Discovered in: Round 19 (Cascade Budget Optimization)
Category: Algorithm Architecture
Impact Level: 🔴🔴 CRITICAL

📊 Data Evidence

WITH als_view (budget=80): Recall@200 (Active GT) = 0.3014
WITHOUT als_view (budget=0): Recall@200 (Active GT) = 0.3177 (+5.4% improvement)
als_view consumed ~80-95k candidate slots nhưng KHÔNG tăng recall.

🏠 Domain Explanation

Pageview là tín hiệu rất noisy trong BĐS. Người dùng view 100 tin nhưng chỉ contact 1-2 tin. ALS trained on pageviews sẽ recommend items "giống với những gì user đã xem" — nhưng hầu hết items user xem rồi SKIP (không contact). Trong khi contact-based ALS recommend items "giống với những gì user ĐÃ QUYẾT ĐỊNH liên hệ" — tín hiệu mạnh hơn nhiều. Khi als_view chiếm slots, nó đẩy ra các candidates từ UserKNN, Seller, RecentCC (có precision cao hơn).

💡 Strategy Implication

Disable als_view (budget=0) trong cascade generator.
Pageview data vẫn hữu ích cho: (1) PageviewReplay (recent views), (2) IntentRecommender (extract intent from views). Nhưng KHÔNG nên dùng làm collaborative filtering signal.
Nếu muốn dùng pageview CF trong tương lai, cần filter noise: chỉ dùng pageviews có dwell_time > 30s hoặc pageviews dẫn đến contact.

[INS-048] — SegPop City Name Mismatch Bug

Discovered in: Round 20 (Leaderboard Diagnosis)
Category: Leaderboard Diagnosis / Engineering Bug
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

SegPop city keys: "Tp Hồ Chí Minh", "Hà Nội", "Đà Nẵng", ...
Cold-start fallback code used: "Hồ Chí Minh", "Hà Nội" → key mismatch!
Result: 96,075/161,568 test users (59.5%) received IDENTICAL 10 items
Top rank-1 item assigned to 96,075 users (should be <10% = 16k max)

🏠 Domain Explanation

SegPop dùng city_name từ dim_listing làm key. Trong data, HCM được lưu là "Tp Hồ Chí Minh" (có prefix "Tp"). Cold-start fallback hardcode "Hồ Chí Minh" (thiếu prefix) → key lookup trả rỗng → tất cả blind users rơi vào global fallback → cùng 10 items.

💡 Strategy Implication

LUÔN verify city name keys trước khi hardcode: print(sorted(segpop._city.keys()))
Fixed bằng hash-based segment assignment: hash(user_id) % 12 segments → diverse output
Đã ghi vào .agent/submission_rules.md (INS-048 rule)

[INS-049] — 56.4% Test Users Are Completely Blind (Zero Events)

Discovered in: Round 20 (Leaderboard Diagnosis)
Category: Leaderboard Diagnosis / Cold-Start
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

Total test users:                 161,568
With contact history (training):   54,502 (33.7%)
With pageview history (training):  70,520 (43.6%)
With ANY training event:           70,520 (43.6%)
Completely blind (ZERO events):    91,048 (56.4%)

🏠 Domain Explanation

Hơn nửa test users là users hoàn toàn mới — chưa bao giờ xuất hiện trong training data. Không có contact, không có pageview, không có bất kỳ signal nào. Với users này, mọi personalized model (ALS, UserKNN, CoContact, PV Replay, Intent) đều KHÔNG hoạt động. Chỉ SegPop/RecentCC có thể serve.

💡 Strategy Implication

91k users cần chiến lược cold-start cực tốt, KHÔNG phải global popular
Hash-based segment assignment phân bổ blind users vào 12 top segments theo contact volume
Kết hợp với INS-051 (recency): SegPop cho blind users nên ưu tiên items MỚI posted
Top teams đạt 0.32 → họ phải có cold-start strategy vượt trội

[INS-050] — Offline Eval Does NOT Predict Leaderboard Score

Discovered in: Round 20 (Leaderboard Diagnosis)
Category: Leaderboard Diagnosis / Evaluation
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

Offline eval (scripts/evaluate.py):
  - Val users: 57,907 (time-split, 3 ngày cuối training)
  - 100% val users CÓ contact history → warm users only
  - Recall@200 (Active GT): 0.3177
  - Recall@10 (Active GT): 0.0899

Leaderboard scores (actual submissions):
  - v4 ALS half_life=30d factors=256:    0.0060 (BEST)
  - v5 ALS half_life=7d filter=True:     0.0036
  - Hybrid ALS+SegPop+LightGBM:         0.0033
  - Cascade V3 (glob bug fixed):         0.0004
  - Cascade V5 (PV-first + SegPop bug):  0.0003
  - Top 1 on leaderboard:               ~0.32

Gap: best offline Recall@10=0.09 vs best leaderboard=0.006 (15x gap)
      vs top1=0.32 (53x gap from our best)

🏠 Domain Explanation

Offline eval chỉ test trên users CÓ contact trong validation period → 100% warm users. Test set có 56.4% completely blind users → pipeline phải handle cold-start mà offline eval không đo được. Thêm vào đó, validation split 3 ngày có thể KHÔNG phản ánh test period (gần 1 tháng).

💡 Strategy Implication

⚠️ KHÔNG TIN offline eval metrics. Chúng overestimate do selection bias (warm users only)
Cần build eval pipeline mô phỏng test distribution: 56% blind users, 34% warm
Priority: tối ưu cold-start strategy > tối ưu warm user recall
Đã ghi vào .agent/submission_rules.md (Section 2.5)

[INS-051] — 50.3% Contacts on Items Posted ≤7 Days (Recency Signal)

Discovered in: Round 20 (Leaderboard Diagnosis)
Category: Leaderboard Diagnosis / Item Recency
Impact Level: 🔴🔴 CRITICAL

📊 Data Evidence

Age of contacted items (days since posted, last 7 days of training):
  <=   1 day:  133,720 / 589,760 = 22.7%
  <=   3 days: 208,375 / 589,760 = 35.3%
  <=   7 days: 296,763 / 589,760 = 50.3%
  <=  14 days: 377,735 / 589,760 = 64.0%
  <=  30 days: 465,175 / 589,760 = 78.9%
  <=  90 days: 539,235 / 589,760 = 91.4%

🏠 Domain Explanation

BĐS Việt Nam có thanh khoản cực nhanh — 50% contacts rơi vào items mới đăng trong 7 ngày. Tin cũ hơn 30 ngày chỉ chiếm 21% contacts. Users tích cực tìm tin MỚI, không quay lại tin cũ. Điều này bổ sung INS-035 (Recent Segment > Global) và INS-021 (Freshness Paradox) bằng hard numbers.

💡 Strategy Implication

SegPop PHẢI ưu tiên items posted gần đây: recency_score = contact_count / (age_days/7 + 1)
ALS không capture recency → cần recency boost feature trong reranker
Đã build .cache/recency_segpop.parquet với 139,233 items scored by recency
Feature cho LightGBM reranker: item_age_days, is_posted_7d, recency_score

[INS-052] — LightGBM Reranker Train/Test Distribution Mismatch

Discovered in: Round 21 (v11 submission failure)
Category: Experiment Failure / Model Architecture
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

Training: EnsembleCandidateGenerator (ALS + SegPop only, ~3 sources)
Inference: CascadeCandidateGenerator (9 sources: ALS, PV, Intent, CoContact, UserKNN, Seller, RecentCC, SegPop)
Features: 28 features including score_als, score_view_als, score_segpop, is_from_*
Result: v11 hybrid (cascade k=200 + LightGBM rerank) = 0.0048 vs v10 (cascade k=10 direct) = 0.0340

🏠 Domain Explanation

LightGBM LambdaRank learned to score candidates based on EnsembleCandidateGenerator distributions — where score_als is the primary discriminator. In CascadeGen, many items come from Intent/PV/CoContact with score_als=0 → ranker incorrectly scores them low → top-10 becomes ALS-only, worse than diverse cascade.

💡 Strategy Implication

MUST retrain LightGBM on CASCADE candidates before using hybrid mode
Or: train separate models for warm users (ALS features) vs cold users (popularity features)
NEVER deploy a model trained on distribution A to score distribution B

[INS-053] — Training Pipeline Silently Overwrites Recency SegPop

Discovered in: Round 21 (root cause analysis of v11/v12 failures)
Category: Engineering Bug / Pipeline Integrity
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

segpop.pkl (recency, 4.5MB) → created 04:02, used for v10 (0.034)
Training pipeline ran at 04:14 → overwrote segpop.pkl with alltime version (6.1MB)
v11 (04:03) and v12 (04:25) used ALLTIME segpop → 0.0048 and 0.005
After restore: v13 (recency segpop) = identical stats to v10

💡 Strategy Implication

ALWAYS backup segpop_recency.pkl before running training pipeline
Add guard in training pipeline: skip SegPop fit if cached recency version exists
Or: training pipeline should save as segpop_trained.pkl, keep segpop.pkl as inference artifact

[INS-054] — Offset Diversity for Cold Users HURTS Performance

Discovered in: Round 21 (v12 experiment)
Category: Experiment Failure / Cold-Start
Impact Level: 🔴🔴 CRITICAL

📊 Data Evidence

v10 (top items from segment pool, no offset): 0.0340
v12 (hash-offset into segment pool for diversity): 0.0050

Cold rank-1 unique items: v10=2,192 → v12=8,037 (+266% diversity)
BUT: max users per rank-1 item: v10=8,144 → v12=642 (12x less concentrated)

🏠 Domain Explanation

SegPop items sorted by popularity/recency score. Position 0-9 in each segment = MOST contacted items. Offset pushes users to position 50-200 = LESS contacted items. More diverse ≠ more relevant. In BĐS, popular items ARE the best cold-start recommendations because popularity = demand signal.

💡 Lesson Learned

Diversity for diversity's sake HURTS when the ranking signal is strong
Only diversify when there's evidence that concentrated recommendations miss user preferences
Top SegPop items per segment are genuinely the best cold-start choices

[INS-055] — Warm Users Already at ~0.10 Recall; Cold Users = Primary Score Lever

Discovered in: Round 21 (segment analysis)
Category: Analysis / Strategy
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

v10 total leaderboard score: 0.034
Warm users: 54,502 (33.7%)
Cold users: 107,066 (66.3%)
Implied warm Recall@10: 0.034 / 0.337 = 0.101 (matches offline eval 0.1009!)
Implied cold Recall@10: ≈ 0 (all SegPop, same items per segment)

Offline eval (warm only, 5k users):
  Active GT Recall@10 = 0.1009
  Active GT Recall@200 = 0.3393

💡 Strategy Implication

To reach 0.10 total:
  Option A: warm=0.30, cold=0 → total = 0.30 × 0.337 = 0.101 (need 3x warm improvement)
  Option B: warm=0.10, cold=0.05 → total = 0.10 × 0.337 + 0.05 × 0.663 = 0.067
  Option C: warm=0.15, cold=0.03 → total = 0.15 × 0.337 + 0.03 × 0.663 = 0.070

Warm users: Recall@200=0.34 → can potentially reach 0.15-0.20 Recall@10 with proper reranking
Cold users: Need ANY personalization signal — test user metadata? registration info?

[INS-056] — PV-First Cascade ≈ ALS-First for Warm Users

Discovered in: Round 21 (offline eval)
Category: Experiment / Architecture
Impact Level: 🟡 MEDIUM

📊 Data Evidence

ALS-first (budget=10): Recall@10 (Active GT) = 0.1009
PV-first (budget=3 PV + 7 ALS): Recall@10 (Active GT) = 0.0999
ALS vs PV top-10 overlap: mean=0.5/10 (nearly disjoint)

🏠 Domain Explanation

ALS and PV produce complementary but equally good top-10 lists. PV replays viewed items (14.5% of GT), ALS discovers new similar items (also ~10% hit rate). Neither dominates. The cascade order doesn't matter much because ALS fills all 10 slots for warm users anyway.

💡 Lesson Learned

ALS-first is marginally better → keep it
The gain from mixing sources is negligible at k=10 because ALS is already well-personalized
Real improvement must come from better candidates, not source ordering

[INS-057] — Removing is_login Filter DESTROYS Score (0.034→0.014)

Discovered in: Round 22 (v13 submission)
Category: Experiment / Data Filtering
Impact Level: 🔴🔴🔴 CRITICAL LESSON

📊 Data Evidence

WITH is_login filter (v10/v14):
  Contact pairs: 13,020,004 (810,411 users, density=16.1 contacts/user)
  ALS matrix: 810K × 691K, nnz=13M
  Score: 0.0340 / 0.0344

WITHOUT is_login filter (v13):
  Contact pairs: 21,192,783 (2,813,537 users, density=7.5 contacts/user)
  ALS matrix: 2.8M × 731K, nnz=21M  
  Score: 0.0140 (-59%!)

Difference: +62.8% more data, BUT score dropped 59%

🏠 Domain Explanation

Non-login events come from anonymous/device-level sessions. These user_ids are NOT the same users evaluated in GT (ground truth only counts login contacts). Adding 2M+ anonymous users to the ALS matrix:

Diluted embeddings: Same 256 factors spread across 3.5x more users → less expressive per user
Added noise: Anonymous browsing patterns ≠ purchase intent patterns of logged-in users
Reduced density: 16.1→7.5 contacts/user → sparser matrix → worse factorization

💡 Lesson Learned

is_login filter is CORRECT and MUST be kept — it's not a bug, it's a feature
More data ≠ better model. QUALITY > QUANTITY for collaborative filtering
ALS matrix density (contacts/user ratio) is more important than raw matrix size
ALWAYS validate hypotheses with offline eval before submitting
The 10K "newly warm" test users were warm with ANONYMOUS contacts that GT doesn't evaluate

🎯 Actionable Rule

NEVER remove is_login filter from production pipeline.
Non-login events may be useful ONLY as side features (e.g., item popularity boost),
NOT as primary collaborative filtering signal.

[INS-058] — ALS Matrix Density > Size: The Embedding Quality Principle

Discovered in: Round 22 (analysis of v13 failure)
Category: Analysis / Algorithm Design
Impact Level: 🔴🔴🔴 CRITICAL INSIGHT

📊 Data Evidence

Density comparison:
  Login-only: 13M pairs / 810K users = 16.1 contacts/user → score 0.034
  All users:  21M pairs / 2.8M users = 7.5 contacts/user → score 0.014
  
Density dropped 53%, score dropped 59%. Near-linear relationship.

256 ALS factors:
  810K users → ~0.032% density in factor space
  2.8M users → ~0.009% density → 3.5x sparser embeddings

🏠 Domain Explanation

In implicit feedback collaborative filtering, embedding quality depends on:

Number of observed interactions per user (more = better personalization)
Signal-to-noise ratio (login contacts = high intent, non-login = browsing noise)
Factor dimensionality relative to user count (256 factors for 2.8M users = under-specified)

💡 Implications for Next Steps

To improve ALS: increase interactions/user (e.g., add PCI data for LOGIN users only)
To improve ALS: tune regularization/factors for current density level
Consider: ALS with higher factors (512?) or more iterations
Consider: time-weighted ALS where recent contacts count more
DO NOT add more users — add more signal per existing user

[INS-059] — 10,654 Blind Test Users Have PCI Data (Untapped)

Discovered in: Round 19 (src/eda/round_19_pci_untapped.py)
Category: Data Source / Cold-Start
Impact Level: 🔴🔴🔴 BREAKTHROUGH

📊 Data Evidence

fact_post_contact_interactions (PCI):
  Total: 25,486,445 rows, 1,872,512 users, 574,245 items
  Date range: 2025-11-09 to 2026-04-09

Test user coverage:
  60,212 test users in PCI (37.3%)
  10,654 "blind" test users have PCI data but ZERO in fact_user_events
  
Blind users PCI signal:
  173,651 rows (avg 16.3 items/user)
  26,268 rows with lead_count > 0
  3,613 rows with chat messages
  2,436 rows with purchased = True
  
Category distribution (blind PCI users):
  1020 (Căn hộ/CC): 48.5%
  1050 (Dự án): 18.9%
  1010 (Phòng trọ): 15.8%
  7,670 users have recent data (after 2026-03-01)

🏠 Domain Explanation

PCI is a pre-aggregated daily contact/lead table independent from fact_user_events. Users who submitted lead forms, chatted with agents, or purchased through the platform appear in PCI even if their raw events weren't captured with is_login contacts. These 10,654 users represent HIGH-INTENT buyers/renters with proven commercial behavior.

💡 Feature Ideas

F-033: Build city+category preferences from PCI for 10,654 blind users
F-034: Use PCI lead_count as weight in SegPop matching
F-035: PCI purchased items as "golden" positive signal (weight 3x)

🎯 Actionable Next Steps

Extract user preferences (city, category) from PCI for blind users
Merge into cold_user_prefs.parquet → IntentRecommender will pick up
Potentially feed PCI pairs into ALS matrix (see INS-060)

[INS-060] — 644,732 NEW Lead Pairs from PCI Not in ALS Training

Discovered in: Round 19 (src/eda/round_19_pci_untapped.py)
Category: Data Source / Model Training
Impact Level: 🔴🔴🔴 BREAKTHROUGH

📊 Data Evidence

PCI lead pairs (lead_count > 0): 2,444,156 total
Already in ALS training:         1,799,424 (overlap)
NEW pairs from PCI:              644,732 (25.9% net new)
New unique users:                237,086

Current ALS matrix: 13,020,004 pairs (810,411 users)
After PCI merge:   ~13,664,736 pairs (+5%)
Potential users:   ~1,047,497 (+29%)

🏠 Domain Explanation

PCI aggregates contact metrics from a different pipeline than fact_user_events. The 644K new pairs represent contacts/leads that were captured through PCI's aggregation but not through fact_user_events is_contact flag. These are HIGH-QUALITY signals (lead_count > 0 = confirmed commercial intent).

💡 Strategy: Selective Merge (preserve density per INS-058)

CRITICAL: Do NOT blindly add all 237K new users (INS-058 lesson: density > size)
INSTEAD:
  Option A: Add PCI pairs ONLY for existing ALS users (increase density per user)
  Option B: Add PCI pairs for ALL users but increase ALS factors (512)
  Option C: Add PCI pairs for test users only (targeted improvement)
  
Recommended: Option A first (safe, increases density), then test Option C

🎯 Actionable Next Steps

Filter PCI lead pairs to only existing ALS users → merge into als_contact_pairs
Retrain ALS on enriched matrix
Offline eval → compare Recall@10 vs baseline
If improved, try Option C (add test user PCI pairs)

[INS-062] — other_interaction IS Signal, NOT Noise (A/B Tested)

Discovered in: Round 23 (A/B Test on ALS variants)
Category: Data Quality / Model Training
Impact Level: 🔴🔴🔴 CRITICAL CONFIRMATION

📊 Data Evidence

A/B Test: 3 ALS variants, offline eval on 5K val users, 256 factors, 30 iters, GPU

Variant A (ALL 5 types, equal weight):
  Pairs: 13M, Users: 810K, Density: 16.1
  Coverage: 100%, Recall@10: 0.0564, NDCG@10: 0.0814

Variant B (REAL 4 types only, no other_interaction):
  Pairs: 2.4M, Users: 335K, Density: 7.1
  Coverage: 75.7%, Recall@10: 0.0186 (-67%!!), NDCG@10: 0.0310

Variant C (Weighted: real=3x, other_interaction=1x):
  Pairs: 13M, Users: 810K, Density: 16.1
  Coverage: 100%, Recall@10: 0.0573 (+1.6%), NDCG@10: 0.0815

other_interaction breakdown:
  90.6M events (94.2% of all contacts)
  796K unique login users
  475K users ONLY have other_interaction (never real contact)
  14,671 test users would LOSE ALS coverage if removed

🏠 Domain Explanation

other_interaction là bất kỳ hành vi tương tác nào ngoài pageview: lưu tin, share, click "quan tâm", v.v. Mặc dù yếu hơn view_phone/chat, nó VẪN LÀ tín hiệu tích cực theo định nghĩa cuộc thi (is_contact=1). Loại bỏ nó giảm ALS density từ 16.1→7.1 (INS-058) và mất 475K users khỏi embedding space.

💡 Strategy Implication

GIỮE other_interaction trong ALS training — đã xác nhận bằng thí nghiệm
Weighted contacts (real=3x) cho cải thiện nhẹ +1.6% — ĐÁNG ÁP DỤNG
H-004-M (modified) trong hypotheses.md là ĐÚNG
Registry này hiện xác nhận bằng A/B test thực tế, không chỉ logic suy luận

🎯 Action

Cập nhật preprocessor để dùng weighted contacts: real=3x, other_interaction=1x
Retrain ALS trên weighted data → kỳ vọng +1.6% Recall@10

[INS-063] — SegPop Ceiling ~1.6% Recall@10 Even with Perfect Segment Knowledge

Discovered in: Round 23 (src/eda/round_23_cold_start_ceiling.py)
Category: Cold-Start / Algorithm Ceiling
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

Theoretical max Recall@10 (perfect city+cat): 0.0158
SegPop hit rates (blind val users, knowing true city+cat):
  Top-10:  1.22%
  Top-20:  2.18%
  Top-50:  4.10%
  Top-100: 6.24%
  Top-200: 9.02%
  Top-500: 14.22%

Blind val users: 13,460 (contacted 28,732 unique items in 3 days)

🏠 Domain Explanation

BĐS có item diversity cực cao — 28,732 items cho 13,460 users trong 3 ngày. Mỗi (city, cat) segment có hàng ngàn items nhưng top-10 chỉ cover fraction rất nhỏ. Khác với e-commerce nơi top-10 popular products chiếm 30%+ purchases, BĐS users tìm kiếm rất long-tail (mỗi căn nhà là unique).

💡 Strategy Implication

CRITICAL: Popularity-based cold-start CANNOT solve the problem alone.
Even with perfect segment knowledge, ceiling is ~1.6% Recall@10.
Top teams reaching 0.32 MUST use a fundamentally different approach:
  - Content-based matching (listing features → user intent)
  - OR they have access to more user signals we're missing
  - OR the metric is computed differently than we assume
Focus should shift to WARM USER RERANKING as primary lever.

[INS-064] — Blind Users Contact Fresh Items (44% ≤7d) and Prefer 1050 (Dự án)

Discovered in: Round 23 (src/eda/round_23_cold_start_ceiling.py)
Category: Cold-Start / User Behavior
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

Blind user contact item age distribution:
  ≤ 1 day:  11.2%
  ≤ 3 days: 27.5%
  ≤ 7 days: 43.9%
  ≤14 days: 59.1%
  ≤30 days: 75.1%

Blind user category distribution:
  1050 (Dự án):   39.6% ← #1 (vs warm users where 1020 dominates)
  1020 (Căn hộ):   30.5%
  1010 (Phòng trọ): 15.9%
  1040 (Đất nền):    7.8%
  1030 (Nhà ở):      6.2%

Blind user city distribution:
  Tp Hồ Chí Minh: 73.8%
  Đà Nẵng:          6.5%
  Hà Nội:           6.4%

🏠 Domain Explanation

Blind users (no training history) are likely NEW users exploring the platform. They disproportionately view 1050 (Dự án/new projects) because these are heavily marketed — billboard ads, Google Ads, social media campaigns drive new users to specific projects. Fresh items dominate because new users arrive via marketing of newly-launched developments.

💡 Strategy Implication

1. SegPop for blind users should overweight 1050 (Dự án) category
   Current hash allocation doesn't reflect this 40% preference
2. Fresh items (≤7d) should be prioritized over historically popular items
3. Consider building a "new user" SegPop variant that:
   - Weights items by (recency × segment_contact_volume)
   - Allocates 4/10 slots to 1050, 3/10 to 1020, 2/10 to 1010, 1/10 to 1040

[INS-065] — Val Distribution ≠ Test Distribution (76.8% warm vs 36%)

Discovered in: Round 24 (scripts/evaluate_aligned.py)
Category: Evaluation / Distribution Mismatch
Impact Level: 🔴🔴🔴 CRITICAL

📊 Data Evidence

Val GT users: 57,907 (classified by pre-split data)
  Warm (contact history):     44,447 (76.8%)
  Cold+signal (login/PCI, no contacts): 2,735 (4.7%)
  Truly blind:                10,725 (18.5%)

Test users: 161,568 (from INS-049)
  Login events:               58,153 (36.0%)
  Non-login only:             12,367 (7.7%)
  Truly blind:                91,048 (56.4%)

🏠 Domain Explanation

Val users selected by having val-period contacts are biased toward active users. Test set includes ALL registered users, many of whom never engaged. Any offline eval using val GT overweights warm users relative to test.

💡 Strategy Implication

Offline eval MUST weight segments to simulate test distribution
Sampled 3,600 warm + 770 cold + 5,630 blind to approximate test ratio
But only 10,725 blind-with-GT users exist in val → limited blind evaluation power

[INS-066] — Model Leak: ALS/SegPop Trained on Full Data Inflate Eval

Discovered in: Round 24 (scripts/evaluate_aligned.py)
Category: Evaluation / Data Leakage
Impact Level: 🔴🔴🔴 CRITICAL

📊 Data Evidence

Truly blind Recall@10 = 0.1654 (model leak present)
INS-063 ceiling:       0.0158 (clean SegPop, same users)
Inflation factor: ~10x

Warm Recall@10 = 0.0712 (model leak present)
Expected clean:  ~0.06 (estimated, matches v10 warm decomposition)

🏠 Domain Explanation

SegPop was fitted on contacts INCLUDING the 3-day val period. Items popular during val period are perfectly ranked for val users. ALS embeddings similarly encode val-period user-item interactions. This creates circular evaluation: model "predicts" data it was trained on.

💡 Strategy Implication

CRITICAL: Must retrain ALS + SegPop on contacts.filter(last_date <= split_date)
before any trustworthy offline eval. Current absolute numbers are MEANINGLESS.
Relative comparisons (A vs B with same leak) may still be directionally valid.

🎯 Actionable Next Steps

Add --retrain_clean to evaluate_aligned.py → retrain SegPop + ALS on train-only
Re-run eval on clean models to establish TRUE baseline
Then ablate: cascade vs hybrid, PCI prefs vs no prefs

[INS-067] — PCI Prefs Provide 30x Recall Uplift for Cold Users (CLEAN EVAL)

Discovered in: Round 24 (scripts/evaluate_aligned.py --retrain_clean)
Category: Evaluation / Cold-Start / CONFIRMED
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

Split-clean eval (ALS+SegPop retrained on data <= split_date):
  Cold + prefs (PCI/PV): Recall@10 = 0.0612 (n=715)  ← HIGHEST in eval
  Cold (no prefs):       Recall@10 = 0.0020 (n=55)
  Uplift: 30.6x

Prefs breakdown:
  Contact-based prefs: 3,600 (warm users only)
  PCI prefs (split-clean): 43
  Pageview prefs: 672
  Total with prefs: 4,315/10,000

🏠 Domain Explanation

Cold users with SIGNAL (pageviews or PCI leads but no contacts) can be effectively served by the cascade recommender when we extract their city+category preference. IntentRecommender matches them to fresh listings in their preferred segment. Without prefs, they fall back to global SegPop which has ~0 recall.

💡 Strategy Implication

CRITICAL: Expanding PCI coverage is the highest-ROI action:
- INS-059 shows 10,654 blind TEST users have PCI data
- Currently only 43 val cold users matched PCI prefs (small sample)
- Each converted blind→cold user could gain 0.06 recall per user
- 10,654 × 0.06 / 161,568 = +0.004 total LB score from PCI alone

[INS-068] — ALS Extremely Sensitive to Most Recent Contacts (5.6x Drop)

Discovered in: Round 24 (scripts/evaluate_aligned.py --retrain_clean)
Category: Model Architecture / Recency
Impact Level: 🔴🔴🔴 CRITICAL

📊 Data Evidence

Production ALS (ALL contacts including val 3d):
  Warm Recall@10 ≈ 0.10 (implied from v14 LB=0.0344)
  Pairs: 13,020,004

Clean ALS (contacts <= split_date only):
  Warm Recall@10 = 0.0179
  Pairs: 12,737,124 (only 2.2% fewer)

Recall drop: 5.6x from removing just 2.2% of most recent data

🏠 Domain Explanation

BĐS market moves fast — the most recent contacts capture current user intent. A user's contacts from 3 months ago may represent a completely different life situation (already bought, changed city, etc.). The 3-day val period contacts are so predictive because they're the MOST recent signal. Removing them forces ALS to extrapolate from older, less relevant interactions.

💡 Strategy Implication

1. Time-weighted ALS: weight recent contacts exponentially higher
   Current: equal weight. Proposed: weight = exp(-days_ago / half_life)
2. For production inference: ALWAYS train on ALL available data up to current date
   The time-split eval artificially handicaps ALS by removing the most valuable signal
3. For offline eval: accept that clean eval underestimates production recall
   True production warm recall ≈ 0.10, not 0.0179

INS-069: LightGBM Overfits to Warm Features

Round: 24
Date: 2026-05-21
Domain: Model Architecture / Reranking
Category: Model Architecture
Impact Level: 🔴🔴🔴 CRITICAL

📊 Data Evidence

In leak-free, aligned offline evaluation:

Cascade-Direct (k=10):
  Warm Recall@10 = 0.0285
  Cold-with-signal Recall@10 = 0.0528

Hybrid Mode (Cascade k=200 -> LightGBM reranker):
  Warm Recall@10 = 0.0668 (+134.4% relative gain)
  Cold-with-signal Recall@10 = 0.0127 (-75.9% relative loss)

🏠 Domain Explanation

A single LightGBM ranking model trained on all data learns to heavily rely on rich user behavior features (such as historical contact rates, active collaborative filtering scores, and total views). For warm users, these features are highly predictive. However, cold-start users (login/PCI signal but no contacts) have sparse/missing values for these behavioral features. The model, trained almost exclusively on warm patterns, interprets the absence of behavioral signals as a negative indicator, penalizing cold candidates. This forces relevant cold listing recommendations to the bottom of the list.

💡 Strategy Implication

1. Deploy a Segmented Inference Policy:
   - For WARM users: Route through Cascade (k=200) -> LightGBM Reranker.
   - For COLD/BLIND users: Route directly from Cascade (k=10) (no LightGBM reranker), or route through a specialized cold-start reranker.
2. A single unified ranking pipeline is mathematically suboptimal when user state distributions (sparse vs dense) are highly skewed.

INS-070: Snapshot 7-Day Demand Is the Best Current Truly-Blind Fallback

Round: 24
Date: 2026-05-21
Domain: Cold-Start / Blind Users / Snapshot Demand
Category: Candidate Generation
Impact Level: 🔴🔴 HIGH

📊 Data Evidence

Targeted EDA on all 10,725 truly-blind validation users compared no-preference fallback strategies:

global_score7 (contacts_7d*20 + views_7d): Recall@10 = 0.001190, hits = 63
snap_hcm_prop_4_3_2_1:                    Recall@10 = 0.000660, hits = 43
contact_weighted_segments:                 Recall@10 = 0.000593, hits = 43
snap_weighted_segments:                    Recall@10 = 0.000575, hits = 35
global_score7_fresh:                       Recall@10 = 0.000538, hits = 27
global_score1_fresh:                       Recall@10 = 0.000510, hits = 21
global_fresh_only:                         Recall@10 = 0.000000

Full split-clean aligned eval after deploying snapshot fallback:

Before snapshot fallback:
  Simulated LB = 0.0271
  Truly blind = 0.0001

After snapshot fallback:
  Simulated LB = 0.0274
  Warm = 0.0633
  Cold-with-signal = 0.0517
  Truly blind = 0.0011

🏠 Domain Explanation

For users with no contact, no login signal, and no PCI preference, there is no reliable user-side personalization. The best available signal is item-side market demand from recent snapshots. Pure posted_date freshness is not enough: users contact listings that are both recent and demand-proven, not merely new.

💡 Strategy Implication

1. Use snapshot last-7-day demand as the default no-preference blind fallback.
2. Do not use pure freshness as a blind strategy.
3. Hash/segment diversity can be used for rank-1 exposure control, but should not replace the top demand item set.
4. The remaining blind ceiling is low unless a new user-side signal source is found.

[INS-071] — 4,215 Truly-Blind Test Users Have Non-Login Pageviews with Extractable Preferences

Discovered in: Round 25 (TASK-030 Cold Signal Discovery)
Category: Cold-Start Signal
Impact Level: 🟡🟡 MEDIUM (requires hypothesis verification — see H-029)

📊 Data Evidence

Total test users:           161,568
Currently truly blind:       94,875 (58.7%)
  - With LOGIN events:           0 (all login users already covered)
  - With NON-LOGIN events:   4,276 (4.5% of blind)
  - With NO events at all:  90,599 (95.5% of blind)

Non-login pageview users (subset of 4,276):
  - Users with pageviews:        4,215
  - All 4,215 have both pref_city AND pref_cat extractable
  - Avg pageviews/user:         12.1 (median: 4)
  - 1,187 users have REAL contacts (view_phone/chat/zalo/sms)

City distribution:
  HCM:       2,983 (70.8%)
  Hà Nội:      300 (7.1%)
  Đà Nẵng:     292 (6.9%)
  Bình Dương:   191 (4.5%)

Category distribution:
  1020 (Căn hộ):   1,647 (39.1%)
  1050 (Dự án):    1,105 (26.2%)
  1010 (Phòng trọ):  775 (18.4%)
  1040 (Đất nền):    401 (9.5%)
  1030 (Nhà ở):      287 (6.8%)

🏠 Domain Explanation

These 4,215 users browsed listings on Chợ Tốt without logging in (device-level sessions). Their user_id is a device/cookie identifier, NOT a logged-in account ID. Per INS-057 and lesson #9, there are two conflicting considerations:

For using these prefs: The user_id IS in test_users.parquet, so Kaggle expects recommendations for them. If Kaggle's GT includes non-login contacts, these users CAN have non-zero recall.
Against using these prefs: If Kaggle's GT only counts login contacts (like our offline eval does), then these device-level user_ids will never have GT contacts → recall contribution = 0 regardless of recommendations.

Key fact: Overlap with both login and non-login: 0 — no user_id appears in both login and non-login events, confirming these are fundamentally different identifier types.

⚠️ Reconciliation with INS-057

INS-057 established that removing is_login from the ENTIRE pipeline (including ALS contact matrix) dropped LB score -59%. However, the proposed action here is DIFFERENT:

INS-057 experiment: Added non-login contacts to ALS TRAINING → diluted embeddings
INS-071 proposal: Add non-login PAGEVIEW PREFERENCES only to cold_user_prefs.parquet → zero impact on ALS

The key question is NOT whether non-login data hurts ALS (it does), but whether Kaggle evaluates non-login user_ids at all. This requires H-029 verification.

💡 Strategy Implication

1. DO NOT change ALS training or contact_pairs (INS-057/058 lesson stands).
2. ONLY modify _process_cold_user_prefs to also extract preferences from non-login pageviews.
3. This is ZERO-RISK to warm/cold-with-signal users (their flow is untouched).
4. Potential upside: 4,215 users × SegPop city+cat recall ≈ 1.6% (INS-063 ceiling) = +0.001 total
5. But if Kaggle ignores non-login GT, upside = 0.
6. Verify via H-029 before spending a submission attempt.

[INS-072] — v17 Leaderboard Breakthrough: ALS 1024 + Cascade Direct Reached 0.2116 / Top5

Discovered in: Round 25 (user leaderboard submit)
Category: Leaderboard / Production Baseline
Impact Level: 🔴🔴🔴 GAME-CHANGING

📊 Data Evidence

Leaderboard:
  v14 previous best: 0.0344
  v17 current best:  0.2116 (Top5)
  Relative gain:     6.15x

Validated artifact:
  File: outputs/submission_1024.zip
  CSV inside zip: submission.csv
  Rows: 1,615,680
  Users: 161,568
  Columns: ID,user_id,rank,item_id
  Unique items: 62,947
  Rank-1 top item: 9,948 users (<10% rule)
  Zip size: 41.37 MB
  Validator: ALL SUBMISSION RULES PASS

Model artifact:
  ALS factors: 1024
  ALS user_factors: (810,411, 1024)
  ALS item_factors: (696,252, 1024)
  ALS model size: 5.8 GB

🏠 Domain Explanation

The winning shift was not another reranker layer. The public leaderboard rewarded a strong high-capacity collaborative retrieval model trained on all available positive contact data, then served through a direct top-10 cascade. Skipping LightGBM removes the warm/cold distribution overfit from INS-069. Increasing ALS capacity from 256 to 1024 factors greatly improves warm-user ranking, while Recency SegPop and intent fallbacks keep cold/blind users valid.

The uppercase ID column is also essential: .agent/submission_rules.md requires ID,user_id,rank,item_id. Earlier lowercase id validation was wrong for this competition.

💡 Strategy Implication

1. Treat v17 as the new production baseline.
2. Keep config.inference_mode = "cascade" unless a new ablation beats 0.2116 on leaderboard.
3. Do not re-enable unified LightGBM for production without a clean segment-specific proof.
4. Future gains should be ablated against v17, not the old 0.034 baseline.
5. Submission validation must enforce uppercase ID and zip/gz packaging.

[INS-073] — Snapshot Blind Fallback Was Leaderboard-Rejected

Discovered in: Round 26
Category: Leaderboard Failure / Cold-Start
Impact Level: 🔴🔴🔴 CRITICAL GUARDRAIL

📊 Data Evidence

Submission: outputs/submission_snapshot_blind.zip
Public LB: 0.0003

Offline context before submission:
  Snapshot demand fallback looked useful in aligned blind eval:
  blind recall improved from ~0.0001 to ~0.0011.

Leaderboard reality:
  0.0003 is near the original broken baseline range and far below v17 0.2116.

🏠 Domain Explanation

The snapshot demand signal was item-side market activity, not user-side intent. For truly-blind users, offline validation can over-reward items that were recently active in the validation window, while public LB appears to reward the high-capacity ALS/cascade list structure much more strongly. This means blind-fallback experiments are especially vulnerable to offline/LB mismatch.

💡 Strategy Implication

1. Do not use snapshot fallback in final production submissions.
2. Keep snapshot code gated away from cascade production unless a tiny LB ablation proves benefit.
3. Treat blind fallback as low-ceiling; protect warm ALS quality first.

[INS-074] — ALS1536 + Time Decay Did Not Beat ALS1024 Baseline

Discovered in: Round 26
Category: Leaderboard Failure / ALS Capacity
Impact Level: 🔴🔴 HIGH

📊 Data Evidence

v17 baseline:
  File: outputs/submission_1024.zip
  Public LB: 0.2116
  ALS factors: 1024
  Mode: cascade direct

v18 experiment:
  File: outputs/submission_1536.zip
  Public LB: 0.2108
  ALS factors: 1536
  Added: time-decay ALS, pci_merge_mode=test_only, non-login cold prefs, recent_cc=5

Delta:
  0.2108 - 0.2116 = -0.0008

🏠 Domain Explanation

More ALS capacity and recency weighting did not automatically improve leaderboard precision. The v18 branch was close but still worse than v17, so the current evidence says ALS1024 is the safest production capacity. The degradation is small and confounded by multiple simultaneous changes, but it is enough to reject v18 as final.

💡 Strategy Implication

1. Keep outputs/submission_1024.zip as the protected best artifact.
2. Do not assume larger ALS factors are better beyond 1024.
3. Any future factor/time-decay work must be isolated one variable at a time.

[INS-075] — Slot-Level Blending Damaged v17

Discovered in: Round 26
Category: Leaderboard Failure / Ensembling
Impact Level: 🔴🔴🔴 CRITICAL GUARDRAIL

📊 Data Evidence

v17 baseline:
  outputs/submission_1024.zip = 0.2116

v19 conservative blend:
  outputs/submission_blend_v17_9_v18_1.zip = 0.1974
  Policy: keep v17 ranks 1-9, replace rank10 with first unique v18 item.

Delta:
  0.1974 - 0.2116 = -0.0142

🏠 Domain Explanation

Even the tenth slot of v17 carries meaningful signal. Replacing only one item per user with v18 introduced enough noise to lose 6.7% relative score. This also suggests the public metric is sensitive to the exact v17 cascade ordering, not just rank-1 items.

💡 Strategy Implication

1. Do not blend by mechanically replacing tail slots.
2. Treat v17 full top-10 as an atomic strong baseline.
3. Ensemble only if a learned or segment-specific policy proves it beats v17 before submission.