Mục đích: Lưu MỌI insight đã phát hiện, trở thành knowledge base tích lũy. Quy tắc: Mỗi insight phải có ID duy nhất, bằng chứng, và feature suggestion. Đọc file này: Trước khi bắt đầu round mới, để không duplicate effort.
| Category | Số insights | Breakthrough? |
|---|---|---|
| Data Quality | 1 | - |
| Data Scale | 1 | - |
| Marketplace Structure | 1 | - |
| Algorithm Architecture | 4 | ✅ GAME-CHANGING |
| Leaderboard Diagnosis | 4 | ✅ ROOT CAUSE |
| Experiment Failures (Round 21) | 5 | ⚠️ LESSONS |
| Experiment Failures (Round 22) | 2 | 🔴 CRITICAL LESSONS |
| PCI Data Discovery (Round 19) | 2 | 🔴🔴🔴 BREAKTHROUGH |
| Cold-Start Ceiling (Round 23) | 2 | 🔴🔴🔴 GAME-CHANGING |
| Eval Infrastructure (Round 24) | 5 | 🔴🔴🔴 CRITICAL |
| Leaderboard Breakthrough (Round 25) | 1 | ✅ GAME-CHANGING |
| Leaderboard Postmortem (Round 26) | 3 | 🔴 GUARDRAILS |
| TOTAL | 31 | - |
| ID | Round | Category | Headline | Impact | Feature Idea? |
|---|---|---|---|---|---|
| INS-001 | 01 | Data Quality | Systematic nullity in dim_listing by property type | 🟡 MED | is_apartment flag |
| INS-002 | 01 | Data Scale | fact_user_events = 161.7M rows, 500 files | 🔴 HIGH | Must pre-aggregate |
| INS-003 | 01 | Marketplace | Agent sellers dominate 83.4% of listings | 🟡 MED | Fairness metric input |
| INS-045 | 19 | Algorithm | Budget-based sequential union beats hard cascade: Recall@200 0.27→0.31 | 🔴🔴🔴 | Budget caps per source |
| INS-046 | 19 | Algorithm | Round-robin interleave HURTS recall vs sequential priority | 🔴🔴 | Keep sequential |
| INS-047 | 19 | Algorithm | als_view (pageview CF) dilutes candidate pool — disable improves Recall@200 | 🔴🔴 | Set als_view budget=0 |
| INS-048 | 20 | Leaderboard | SegPop city name bug: "Hồ Chí Minh" ≠ "Tp Hồ Chí Minh" → 91k users same items | 🔴🔴🔴 | Fix key names |
| INS-049 | 20 | Leaderboard | 56.4% test users have ZERO training events — completely blind | 🔴🔴🔴 | Hash-based segment assignment |
| INS-050 | 20 | Leaderboard | Offline eval doesn't predict leaderboard: best=0.006 vs top1=0.32 (53x gap) | 🔴🔴🔴 | Need test-aligned eval |
| INS-051 | 20 | Leaderboard | 50.3% contacts on items posted ≤7 days → recency > popularity | 🔴🔴 | Recency-weighted SegPop |
| INS-052 | 21 | Experiment | LightGBM reranker trained on EnsembleGen ≠ CascadeGen distribution | 🔴🔴🔴 | Must retrain |
| INS-053 | 21 | Engineering | Training pipeline overwrites segpop.pkl with alltime version | 🔴🔴🔴 | Backup/restore |
| INS-054 | 21 | Experiment | Offset diversity for cold users HURTS: top items are most relevant | 🔴🔴 | Don't offset |
| INS-055 | 21 | Analysis | Warm users already at ~0.10 recall; cold users (66%) ≈ 0 | 🔴🔴🔴 | Cold=primary lever |
| INS-056 | 21 | Experiment | PV-first cascade ≈ ALS-first (0.0999 vs 0.1009) | 🟡 | Keep ALS-first |
| INS-057 | 22 | Experiment | Removing is_login filter HURTS: 0.034→0.014 (-59%). Non-login = noise | 🔴🔴🔴 | KEEP is_login filter |
| INS-058 | 22 | Analysis | ALS matrix density is key: 16.1→7.5 contacts/user killed embeddings | 🔴🔴🔴 | Density > size |
| INS-059 | 19 | Data Source | 10,654 blind test users have PCI data (avg 16.3 items) — convert blind→warm | 🔴🔴🔴 | PCI prefs for blind |
| INS-060 | 19 | Data Source | 644,732 NEW lead pairs from PCI not in ALS training data | 🔴🔴🔴 | Merge PCI into ALS |
| INS-061 | 19 | Architecture | 4-stage pipeline (Cascade→Feature→LightGBM→Reranker) code EXISTS but unused since v11 bug | 🔴🔴🔴 | Retrain LightGBM on cascade |
| INS-063 | 23 | Cold-Start | SegPop ceiling ~1.6% Recall@10 even with PERFECT city+cat knowledge | 🔴🔴🔴 | Popularity alone cannot solve cold-start |
| INS-064 | 23 | Cold-Start | 44% blind contacts on items ≤7d old; 1050 (Dự án) = #1 category for blind users | 🔴🔴🔴 | Freshness-first SegPop, category reweighting |
| INS-065 | 24 | Eval | Val: 76.8% warm / 4.7% cold / 18.5% blind — Test: 36% / 7.7% / 56.4%. Distribution mismatch | 🔴🔴🔴 | Must simulate test ratio |
| INS-066 | 24 | Eval | ALS/SegPop trained on full data leaks val contacts → blind recall inflated 10x (0.165 vs 0.016 ceiling) | 🔴🔴🔴 | Must retrain models on split-clean data |
| INS-067 | 24 | Eval | Cold+PCI prefs = 0.0612 recall vs 0.0020 without (30x uplift). PCI prefs are critical for cold users | 🔴🔴🔴 | Expand PCI coverage to more cold/blind test users |
| INS-068 | 24 | Eval | ALS recall drops 5.6x when 3d val contacts removed (0.10→0.018). Most recent contacts are disproportionately important | 🔴🔴🔴 | Time-weight ALS toward recent contacts |
| INS-069 | 24 | Model Architecture | LightGBM ranker overfits to warm features, severely destroying cold-start recall | 🔴🔴🔴 | Implement Segmented Inference Policy |
| INS-071 | 25 | Cold-Start Signal | 4,215 truly-blind test users have non-login pageviews with extractable city+cat prefs. But INS-057 warns non-login = device-level IDs | 🟡🟡 | H-029: verify if non-login pref injection helps or is irrelevant |
| INS-072 | 25 | Leaderboard | v17 reached 0.2116 LB / top5: ALS 1024 + full-data cascade-direct + uppercase ID submission | 🔴🔴🔴 | Keep cascade mode as production baseline |
| INS-073 | 26 | Leaderboard Failure | Snapshot blind fallback scored 0.0003 on LB despite offline promise | 🔴🔴🔴 | Never use snapshot fallback for final unless LB-ablation proves it |
| INS-074 | 26 | Leaderboard Failure | ALS1536 + time-decay + test-only prefs scored 0.2108, slightly below v17 0.2116 | 🔴🔴 | ALS1024 remains production sweet spot |
| INS-075 | 26 | Leaderboard Failure | v17 top9 + v18 slot10 blend scored 0.1974; even rank10 replacement hurt badly | 🔴🔴🔴 | Do not slot-blend v17 unless full-list eval proves gain |
project_id: 88.71% null (2,756,219 / 3,107,114)
direction: 82.15% null
floors: 70.52% null
furnishing: 54.81% null
house_type: 51.47% null
bathrooms: 44.85% null
bedrooms: 31.78% null
Đất nền (1040) và nhà ở (1030) tự nhiên không có project_id, floors, furnishing. Nullity không phải lỗi data — là reflection của property type.
Feature name: is_apartment
Formula: project_id.is_not_null()
Expected impact: Strong signal cho category classification. LightGBM handles NaN natively.
fact_user_events: 161,731,336 rows, 500 files
fact_listing_snapshot: 19,762,167 rows, 62 files
fact_post_contact_interactions: 25,486,445 rows, 147 files
dim_listing: 3,107,114 rows, 40 files
Strategy: Pre-aggregate fact_user_events to user-level and item-level before joins.
Never operate at raw event level in feature engineering.
Use Polars LazyFrame + column pushdown + date filters.
agent: 2,593,063 (83.5%)
private: 514,051 (16.5%)
BĐS Việt Nam đặc thù: Môi giới (agent) chiếm đa số listing vì cá nhân (private) ít biết cách đăng tin chuyên nghiệp. Fairness metric phải điều chỉnh exposure cho private sellers.
Feature name: seller_type_encoded (binary)
Use in Fairness metric: Target ratio should reflect natural distribution, not 50/50.
Submission: agent=27.3%, private=72.7%
GT contacts: agent=52.0%, private=48.0%
Gap: −24.7 pp (agents heavily under-served)
Agents chiếm 83.5% của dim_listing nhưng collectively chỉ nhận 52% contacts vì private sellers có lead/listing cao hơn 3x. Hệ thống đang đẩy quá nhiều private sellers trong top-10 → agents phản ứng tiêu cực, ảnh hưởng doanh thu B2B của Chợ Tốt.
Feature: seller_type_fairness_correction
Formula: if agent_ratio_current < 0.52: boost agent-seller items in reranker
Impact: Calibrate HealthMetrics.gt_dist với agent_ratio=0.520 (từ data thực)
Agents trả phí premium placement. Under-serving họ = churn risk + doanh thu B2B giảm.
Category | Submission | GT contacts | Gap
1010 | 11.3% | 15.6% | -4.3pp (under)
1020 | 41.2% | 44.6% | -3.4pp (under)
1030 | 8.7% | 6.5% | +2.3pp (over)
1050 | 29.0% | 23.1% | +5.9pp (OVER-SERVE)
Feature: category_exposure_correction
Formula: KL divergence from GT category distribution → boost under-served categories
Used in: MultiObjectiveReranker fairness term γ
Submission — median listing age: 10 days, mean: 36 days
GT contacts — median listing age: 97 days, mean: 106 days
BUT PDF 2 reveals: 69.7% of all contacts happen in the first 7 days.
The 97-day median age for GT contacts is an illusion caused by Survivorship Bias. Bad listings are removed early. Only high-quality listings survive to 90+ days. The true "Golden Moment" is the first 7 days.
Recommendation: Keep ALS half_life at 7d to capture the 69.7% Golden Moment.
DO NOT raise half-life to 30d as originally hypothesized in R09.
Reranker delta: Maintain freshness weight to boost new items.
Items recommended: 115,340 / 3,107,114 = 3.71%
Top-1% items: 81.9% of all recommendation slots
96.3% of catalogue: NEVER recommended
Feedback loop kinh điển: popular items → recommended → more views → more contacts → more popular. New sellers never get traction. Marketplace health degrades over time.
Feature: item_novelty_score = 1 - (popularity_rank / total_items)
Strategy: Add novelty bonus in BurstTrendingRecommender for long-tail items
Target: Raise coverage from 3.71% → 8%+ without sacrificing Recall@10
{
"agent_ratio": 0.520,
"category_dist": { "1010": 0.156, "1020": 0.446, "1030": 0.065, "1040": 0.102, "1050": 0.231 }
}
Saved to: .cache/gt_dist.json — loaded by HealthMetrics automatically.
Replace hardcoded values in HealthMetrics (agent_ratio=0.7, category_dist=generic)
with data-driven values. This is now done automatically via gt_dist_path param.
Before reranking: Diversity entropy = 0.6947, Fairness = 0.273
After reranking: Diversity entropy = 0.6986 (+0.004), Fairness = 0.273 (UNCHANGED)
Root cause: 101,441 cold users (63%) get homogeneous global trending → dominates aggregate
To meaningfully improve health metrics across ALL users:
1. Make BurstTrendingRecommender diversity-aware (inject agent items, balance categories)
2. Or: expand cold-start coverage via better ColdStartProfiler (remove require_login constraint)
3. Or: add novelty injection to global trending (force 20% long-tail items)
GT pairs (last 3 days): 62,893
Repeat contacts (user contacted before): 7,088 / 62,893 = 11.3%
Previously viewed (pageview before): 9,111 / 62,893 = 14.5%
ANY prior interaction: 9,130 / 62,893 = 14.5%
COMPLETELY NEW to user: 53,763 / 62,893 = 85.5%
BĐS khác e-commerce: users không "re-buy" items. Họ liên tục duyệt tin MỚI trong khu vực quan tâm. ALS/CF chỉ giúp 14.5% — phần còn lại phải đến từ segment popularity hoặc content-based matching.
CRITICAL: ALS collaborative filtering là SECONDARY signal, không phải PRIMARY.
PRIMARY signal = popularity within user's preferred (city, category) segment.
This explains why v1-v4 scored 0.006 — they over-relied on CF for 85.5% of GT.
GT pairs with known user prefs: 53,074
Same city as user preference: 48,775 / 53,074 = 91.9%
Same category as user preference: 38,297 / 53,074 = 72.2%
BOTH city + category match: 36,342 / 53,074 = 68.5%
Người tìm BĐS gần như LUÔN tìm trong cùng 1 thành phố (92%). Category consistency cũng cao (72%) — người tìm căn hộ hiếm khi chuyển sang đất nền. Đây là đặc trưng domain BĐS: quyết định mua/thuê = location-first.
Feature: user_preferred_city (mode of contacted cities) → MUST-HAVE filter
Feature: user_preferred_category (mode of contacted categories) → strong filter
Recommendation cascade: (city+cat+district) → (city+cat) → (city) → (cat) → global
Submission unique items: 9,290
GT unique items (last 3d): 28,706
Overlap: 6,211 / 28,706 = 21.6% (only 1 in 5 GT items in submission!)
93.1% of GT users have post_contact history (NOT cold-start problem!)
Popularity bias cực nặng: ta chỉ recommend 9K items cho 161K users. GT cần 28K items. Submission chỉ cover 21.6% GT items → Recall bị cap ở ~0.22 max ngay từ đầu, bất kể ranking quality.
MUST diversify item pool: recommend 50K+ unique items across all users
Reduce popularity concentration: top-1% items should be <30% of slots (was 81.9%)
Use finer-grain segments (city+cat+district) to naturally diversify
| ID | Breakthrough | Impact |
|---|---|---|
| INS-025 | 85.5% GT items are NEW → CF is secondary, segment popularity is primary | 🔴🔴🔴 |
| INS-026 | 91.9% city match → location is the dominant filter | 🔴🔴🔴 |
| INS-027 | Submission covers only 21.6% of GT items → popularity bias kills score | 🔴🔴 |
| INS-022 | Coverage crisis: 3.71% → need long-tail strategy | 🔴🔴 |
| INS-019 | Agent fairness gap: 24.7pp → critical for B2B revenue | 🔴🔴 |
| INS-021 | Freshness paradox: half-life=7d too aggressive | 🔴 |
| INS-024 | Reranker ineffective for cold users → need cold trending diversity | 🔴 |
Positive Rate: 83.9%
Real Lead Rate: 20.5%
Median time to soft interact: 20s. Median to Real Lead: 40-67s.
Users save/share passively but hesitate to contact. Real contact takes 3x the time to decide.
Feature: time_to_contact (proxy for intent). Optimize UI to show price/area/location above the fold.
Đất nền (1040) Positive Rate: 87.6%
Nhà ở (1030) Positive Rate: 70.2%
Dự án (1050) Volume High, CR Low (78.4%)
Đất nền buyers have urgency. Dự án browsers are curious but avoid agents. Nhà ở lacks supply/demand.
Feature: category_urgency_weight. Boost 1040 for fast conversions.
Images: Top 5% listings have >= 8 images.
Furnishing: "Nội thất cao cấp" gives 1.63x lift. "Nhà trống" gives 0.50x.
Legal: "Sổ hồng riêng" gives 1.80x lift. "Giấy tờ viết tay" gives 0.21x.
High-quality images, premium furnishing, and clear legal status reduce buyer risk and increase confidence to contact.
Features: has_so_hong_rieng, has_noi_that_cao_cap, images_count >= 8. Strong predictors for LightGBM.
Cities: Bình Định/Khánh Hoà (180-220% CR) vs HN/HCM (~160%).
Category: Phòng trọ (1.87x lift) vs Dự án (0.38x lift).
Secondary markets have less supply, making each listing perform better. Dự án (Projects) have long nurture periods, while Phòng trọ converts immediately.
Feature: category_conversion_weight. Penalize 1050 in short-term predictions.
New Users = 59.7% of total users.
Retention 30D for New Users = 9.2% (90.8% drop off).
Power Users = 4.1% of total users, but Retention 30D = 89.7%.
New users leave if the first session recommendations do not match their intent. If they do not find relevance immediately, they assume the platform has no supply for them.
Cold-start fallback strategy MUST focus on the most popular, high-quality segments (Căn hộ, Phòng trọ in HCM/HN) to prevent immediate churn.
Baseline conversion to Power User: 2.56%
Conversion if user has 1 Contact in first 7 days: 7.85% (3.1x lift)
Conversion if user reaches >= 3 sessions in first 7 days: 19.65% (7.7x lift)
A single contact often means "Good Churn" (user found a room and uninstalled). Reaching 3 sessions means Habit Formation (user is researching, comparing, and treating the platform as a tool).
Total GT contacts for users with intent: 110,659
GT items present in dim_listing: 2,914 (2.6%)
GT items matching Top 1 Intent (District, Category, Price): 668 (22.9% of active items)
GT items matching Top 3 Intents (District, Category, Price): 931 (31.9% of active items)
GT items matching Top 1 (City, Category): 2,139 (73.4% of active items)
Ngành BĐS có tốc độ thanh khoản cực cao. 97.4% số tin user liên hệ đã không còn trên sàn lúc test. Do đó, thay vì cố gợi ý các tin CŨ từ lịch sử (CF), nếu ta rút trích Chân dung nhu cầu (Intent) từ lịch sử Pageview và match trực tiếp với các tin MỚI NHẤT cùng phân khúc (Quận/Loại hình/Khung giá), ta có thể bắt được 31.9% nhu cầu mua thực tế!
CRITICAL: Intent-Based Recommendation is MANDATORY for cold-start items.
Implement `IntentRecommender` directly targeting `dim_listing`.
Place it high in the cascade hierarchy (Priority 1.5).
"Trending now" in a local area is much more relevant than "All-time popular". BĐS is highly temporal; properties popular 3 months ago are irrelevant.
PV Replay MUST be Priority 1. It represents the user's immediate, explicit intent.
Old pageviews crowd out high-quality fresh recommendations from fallbacks. A user viewing a property 25 days ago has likely moved on.
Optimal ordering by precision: Pageview -> CoContact -> ALS -> RecentCC -> SegPop. Drop CoView.
Real estate inventory is too sparse at the Phường/Xã level. Users are willing to cross Ward boundaries within the same District or City.
Elevate Intent matching to District level minimum.
valid_items via pl.read_parquet(dim_files[0]).0.0003 on the public leaderboard.When 97.5% of active inventory is artificially removed from the candidate pool, the IntentRecommender and CascadeCandidateGenerator are forced to recommend stale or irrelevant properties. Real estate relies heavily on the full breadth of active supply to match nuanced user queries.
ALWAYS load partitioned parquet files via pl.scan_parquet(dim_files).collect() rather than assuming a single file. Fixed in V6, immediately reviving candidate quality.
Cascade (Intent -> Pageview): Recall@10 = 0.0531Cascade (Pageview -> Intent): Recall@10 = 0.1018 (1.9x Lift)While IntentRecommender (District + Cat + Price) is brilliant for filling gaps and cold-start discovery, it CANNOT beat the explicit, exact-match signal of a user clicking on a specific property yesterday (PageviewReplay).
PageviewReplay MUST remain Priority 1. IntentRecommender serves as the ultimate high-quality Fallback (Priority 1.5) to capture the 27% Recall@200 ceiling.
adview_count = 0: Conversion Rate = 0.103adview_count = 30: Conversion Rate drops to 0.087adview_count = 150+: Conversion Rate rises back to 0.101adview_count and total_contacts: 0.7571Listings with very low views but high conversion are often "Hidden Gems" or mispriced properties that get snapped up instantly. Listings with average views (30-50) are typical properties that users browse but hesitate to contact. "Mega-hot" listings (150+ views) are likely highly desirable projects where FOMO drives contact rates back up.
The correlation of 0.7571 proves that views_24h is one of the strongest predictive features for the Reranker. Must include views_24h and a non-linear feature like conversion_rate (contacts_24h / (views_24h + 1)) in LightGBM.
Unlike e-commerce where users might buy a phone then buy a case, real estate users are highly fixed in their intent. A user looking for a house (1030) rarely switches to renting a room (1010). The 87.2% loyalty in 1050 (Dự án) shows that project investors are a very distinct segment from typical residential buyers.
is_same_category_as_last_view.A rigid cascade priority queue is perfect for generating a final Top-10 list, but flawed for generating a Candidate Pool for a Reranker. High-volume generators like ALS or Intent fill up the 200-slot quota instantly, starving high-precision local matches (like Pageview Replay or CoContact) of slots. If ALS is placed first, the final top-10 precision is destroyed because ALS has poor precision in the top ranks.
We must shift from a "hard priority cascade" to a "diverse union generator" for candidate generation. Instead of slot-filling until 200 is reached, we should extract a fixed budget of candidates from each generator (e.g., top 50 from PV, top 50 from ALS, top 50 from Intent, top 50 from KNN) and union them to form a robust, high-recall candidate pool (aiming for Recall@200 > 0.40). We then let the LightGBM Reranker sort the final top-10 list.
Mỗi model recommender có thế mạnh riêng: ALS tốt cho warm users có lịch sử contact, Intent tốt cho fresh listings, RecentCC tốt cho cold-start. Khi dùng hard cascade, model đầu tiên "ăn hết" 200 slots, các model phía sau bị starve hoàn toàn. Budget caps cho phép MỌI model đều đóng góp candidates, tạo pool đa dạng hơn.
Round-robin cho mỗi source 1 item per turn. Với warm users có lịch sử phong phú, SegPop/RecentCC (low-precision fallback) chiếm quá nhiều slots trong các turn đầu, đẩy ra các high-precision personalized candidates từ ALS/Intent. Ví dụ: ALS item rank #5 (rất chính xác) bị thay bằng SegPop item rank #5 (popularity noise). Sequential priority đảm bảo high-precision sources fill trước, low-precision sources chỉ fill remaining slots.
Pageview là tín hiệu rất noisy trong BĐS. Người dùng view 100 tin nhưng chỉ contact 1-2 tin. ALS trained on pageviews sẽ recommend items "giống với những gì user đã xem" — nhưng hầu hết items user xem rồi SKIP (không contact). Trong khi contact-based ALS recommend items "giống với những gì user ĐÃ QUYẾT ĐỊNH liên hệ" — tín hiệu mạnh hơn nhiều. Khi als_view chiếm slots, nó đẩy ra các candidates từ UserKNN, Seller, RecentCC (có precision cao hơn).
SegPop city keys: "Tp Hồ Chí Minh", "Hà Nội", "Đà Nẵng", ...
Cold-start fallback code used: "Hồ Chí Minh", "Hà Nội" → key mismatch!
Result: 96,075/161,568 test users (59.5%) received IDENTICAL 10 items
Top rank-1 item assigned to 96,075 users (should be <10% = 16k max)
SegPop dùng city_name từ dim_listing làm key. Trong data, HCM được lưu là "Tp Hồ Chí Minh" (có prefix "Tp"). Cold-start fallback hardcode "Hồ Chí Minh" (thiếu prefix) → key lookup trả rỗng → tất cả blind users rơi vào global fallback → cùng 10 items.
print(sorted(segpop._city.keys())).agent/submission_rules.md (INS-048 rule)Total test users: 161,568
With contact history (training): 54,502 (33.7%)
With pageview history (training): 70,520 (43.6%)
With ANY training event: 70,520 (43.6%)
Completely blind (ZERO events): 91,048 (56.4%)
Hơn nửa test users là users hoàn toàn mới — chưa bao giờ xuất hiện trong training data. Không có contact, không có pageview, không có bất kỳ signal nào. Với users này, mọi personalized model (ALS, UserKNN, CoContact, PV Replay, Intent) đều KHÔNG hoạt động. Chỉ SegPop/RecentCC có thể serve.
Offline eval (scripts/evaluate.py):
- Val users: 57,907 (time-split, 3 ngày cuối training)
- 100% val users CÓ contact history → warm users only
- Recall@200 (Active GT): 0.3177
- Recall@10 (Active GT): 0.0899
Leaderboard scores (actual submissions):
- v4 ALS half_life=30d factors=256: 0.0060 (BEST)
- v5 ALS half_life=7d filter=True: 0.0036
- Hybrid ALS+SegPop+LightGBM: 0.0033
- Cascade V3 (glob bug fixed): 0.0004
- Cascade V5 (PV-first + SegPop bug): 0.0003
- Top 1 on leaderboard: ~0.32
Gap: best offline Recall@10=0.09 vs best leaderboard=0.006 (15x gap)
vs top1=0.32 (53x gap from our best)
Offline eval chỉ test trên users CÓ contact trong validation period → 100% warm users. Test set có 56.4% completely blind users → pipeline phải handle cold-start mà offline eval không đo được. Thêm vào đó, validation split 3 ngày có thể KHÔNG phản ánh test period (gần 1 tháng).
.agent/submission_rules.md (Section 2.5)Age of contacted items (days since posted, last 7 days of training):
<= 1 day: 133,720 / 589,760 = 22.7%
<= 3 days: 208,375 / 589,760 = 35.3%
<= 7 days: 296,763 / 589,760 = 50.3%
<= 14 days: 377,735 / 589,760 = 64.0%
<= 30 days: 465,175 / 589,760 = 78.9%
<= 90 days: 539,235 / 589,760 = 91.4%
BĐS Việt Nam có thanh khoản cực nhanh — 50% contacts rơi vào items mới đăng trong 7 ngày. Tin cũ hơn 30 ngày chỉ chiếm 21% contacts. Users tích cực tìm tin MỚI, không quay lại tin cũ. Điều này bổ sung INS-035 (Recent Segment > Global) và INS-021 (Freshness Paradox) bằng hard numbers.
recency_score = contact_count / (age_days/7 + 1).cache/recency_segpop.parquet với 139,233 items scored by recencyitem_age_days, is_posted_7d, recency_scoreTraining: EnsembleCandidateGenerator (ALS + SegPop only, ~3 sources)
Inference: CascadeCandidateGenerator (9 sources: ALS, PV, Intent, CoContact, UserKNN, Seller, RecentCC, SegPop)
Features: 28 features including score_als, score_view_als, score_segpop, is_from_*
Result: v11 hybrid (cascade k=200 + LightGBM rerank) = 0.0048 vs v10 (cascade k=10 direct) = 0.0340
LightGBM LambdaRank learned to score candidates based on EnsembleCandidateGenerator distributions — where score_als is the primary discriminator. In CascadeGen, many items come from Intent/PV/CoContact with score_als=0 → ranker incorrectly scores them low → top-10 becomes ALS-only, worse than diverse cascade.
segpop.pkl (recency, 4.5MB) → created 04:02, used for v10 (0.034)
Training pipeline ran at 04:14 → overwrote segpop.pkl with alltime version (6.1MB)
v11 (04:03) and v12 (04:25) used ALLTIME segpop → 0.0048 and 0.005
After restore: v13 (recency segpop) = identical stats to v10
segpop_trained.pkl, keep segpop.pkl as inference artifactv10 (top items from segment pool, no offset): 0.0340
v12 (hash-offset into segment pool for diversity): 0.0050
Cold rank-1 unique items: v10=2,192 → v12=8,037 (+266% diversity)
BUT: max users per rank-1 item: v10=8,144 → v12=642 (12x less concentrated)
SegPop items sorted by popularity/recency score. Position 0-9 in each segment = MOST contacted items. Offset pushes users to position 50-200 = LESS contacted items. More diverse ≠ more relevant. In BĐS, popular items ARE the best cold-start recommendations because popularity = demand signal.
v10 total leaderboard score: 0.034
Warm users: 54,502 (33.7%)
Cold users: 107,066 (66.3%)
Implied warm Recall@10: 0.034 / 0.337 = 0.101 (matches offline eval 0.1009!)
Implied cold Recall@10: ≈ 0 (all SegPop, same items per segment)
Offline eval (warm only, 5k users):
Active GT Recall@10 = 0.1009
Active GT Recall@200 = 0.3393
To reach 0.10 total:
Option A: warm=0.30, cold=0 → total = 0.30 × 0.337 = 0.101 (need 3x warm improvement)
Option B: warm=0.10, cold=0.05 → total = 0.10 × 0.337 + 0.05 × 0.663 = 0.067
Option C: warm=0.15, cold=0.03 → total = 0.15 × 0.337 + 0.03 × 0.663 = 0.070
Warm users: Recall@200=0.34 → can potentially reach 0.15-0.20 Recall@10 with proper reranking
Cold users: Need ANY personalization signal — test user metadata? registration info?
ALS-first (budget=10): Recall@10 (Active GT) = 0.1009
PV-first (budget=3 PV + 7 ALS): Recall@10 (Active GT) = 0.0999
ALS vs PV top-10 overlap: mean=0.5/10 (nearly disjoint)
ALS and PV produce complementary but equally good top-10 lists. PV replays viewed items (14.5% of GT), ALS discovers new similar items (also ~10% hit rate). Neither dominates. The cascade order doesn't matter much because ALS fills all 10 slots for warm users anyway.
WITH is_login filter (v10/v14):
Contact pairs: 13,020,004 (810,411 users, density=16.1 contacts/user)
ALS matrix: 810K × 691K, nnz=13M
Score: 0.0340 / 0.0344
WITHOUT is_login filter (v13):
Contact pairs: 21,192,783 (2,813,537 users, density=7.5 contacts/user)
ALS matrix: 2.8M × 731K, nnz=21M
Score: 0.0140 (-59%!)
Difference: +62.8% more data, BUT score dropped 59%
Non-login events come from anonymous/device-level sessions. These user_ids are NOT the same users evaluated in GT (ground truth only counts login contacts). Adding 2M+ anonymous users to the ALS matrix:
NEVER remove is_login filter from production pipeline.
Non-login events may be useful ONLY as side features (e.g., item popularity boost),
NOT as primary collaborative filtering signal.
Density comparison:
Login-only: 13M pairs / 810K users = 16.1 contacts/user → score 0.034
All users: 21M pairs / 2.8M users = 7.5 contacts/user → score 0.014
Density dropped 53%, score dropped 59%. Near-linear relationship.
256 ALS factors:
810K users → ~0.032% density in factor space
2.8M users → ~0.009% density → 3.5x sparser embeddings
In implicit feedback collaborative filtering, embedding quality depends on:
src/eda/round_19_pci_untapped.py)fact_post_contact_interactions (PCI):
Total: 25,486,445 rows, 1,872,512 users, 574,245 items
Date range: 2025-11-09 to 2026-04-09
Test user coverage:
60,212 test users in PCI (37.3%)
10,654 "blind" test users have PCI data but ZERO in fact_user_events
Blind users PCI signal:
173,651 rows (avg 16.3 items/user)
26,268 rows with lead_count > 0
3,613 rows with chat messages
2,436 rows with purchased = True
Category distribution (blind PCI users):
1020 (Căn hộ/CC): 48.5%
1050 (Dự án): 18.9%
1010 (Phòng trọ): 15.8%
7,670 users have recent data (after 2026-03-01)
PCI is a pre-aggregated daily contact/lead table independent from fact_user_events. Users who submitted lead forms, chatted with agents, or purchased through the platform appear in PCI even if their raw events weren't captured with is_login contacts. These 10,654 users represent HIGH-INTENT buyers/renters with proven commercial behavior.
src/eda/round_19_pci_untapped.py)PCI lead pairs (lead_count > 0): 2,444,156 total
Already in ALS training: 1,799,424 (overlap)
NEW pairs from PCI: 644,732 (25.9% net new)
New unique users: 237,086
Current ALS matrix: 13,020,004 pairs (810,411 users)
After PCI merge: ~13,664,736 pairs (+5%)
Potential users: ~1,047,497 (+29%)
PCI aggregates contact metrics from a different pipeline than fact_user_events. The 644K new pairs represent contacts/leads that were captured through PCI's aggregation but not through fact_user_events is_contact flag. These are HIGH-QUALITY signals (lead_count > 0 = confirmed commercial intent).
CRITICAL: Do NOT blindly add all 237K new users (INS-058 lesson: density > size)
INSTEAD:
Option A: Add PCI pairs ONLY for existing ALS users (increase density per user)
Option B: Add PCI pairs for ALL users but increase ALS factors (512)
Option C: Add PCI pairs for test users only (targeted improvement)
Recommended: Option A first (safe, increases density), then test Option C
A/B Test: 3 ALS variants, offline eval on 5K val users, 256 factors, 30 iters, GPU
Variant A (ALL 5 types, equal weight):
Pairs: 13M, Users: 810K, Density: 16.1
Coverage: 100%, Recall@10: 0.0564, NDCG@10: 0.0814
Variant B (REAL 4 types only, no other_interaction):
Pairs: 2.4M, Users: 335K, Density: 7.1
Coverage: 75.7%, Recall@10: 0.0186 (-67%!!), NDCG@10: 0.0310
Variant C (Weighted: real=3x, other_interaction=1x):
Pairs: 13M, Users: 810K, Density: 16.1
Coverage: 100%, Recall@10: 0.0573 (+1.6%), NDCG@10: 0.0815
other_interaction breakdown:
90.6M events (94.2% of all contacts)
796K unique login users
475K users ONLY have other_interaction (never real contact)
14,671 test users would LOSE ALS coverage if removed
other_interaction là bất kỳ hành vi tương tác nào ngoài pageview: lưu tin, share, click "quan tâm", v.v. Mặc dù yếu hơn view_phone/chat, nó VẪN LÀ tín hiệu tích cực theo định nghĩa cuộc thi (is_contact=1). Loại bỏ nó giảm ALS density từ 16.1→7.1 (INS-058) và mất 475K users khỏi embedding space.
src/eda/round_23_cold_start_ceiling.py)Theoretical max Recall@10 (perfect city+cat): 0.0158
SegPop hit rates (blind val users, knowing true city+cat):
Top-10: 1.22%
Top-20: 2.18%
Top-50: 4.10%
Top-100: 6.24%
Top-200: 9.02%
Top-500: 14.22%
Blind val users: 13,460 (contacted 28,732 unique items in 3 days)
BĐS có item diversity cực cao — 28,732 items cho 13,460 users trong 3 ngày. Mỗi (city, cat) segment có hàng ngàn items nhưng top-10 chỉ cover fraction rất nhỏ. Khác với e-commerce nơi top-10 popular products chiếm 30%+ purchases, BĐS users tìm kiếm rất long-tail (mỗi căn nhà là unique).
CRITICAL: Popularity-based cold-start CANNOT solve the problem alone.
Even with perfect segment knowledge, ceiling is ~1.6% Recall@10.
Top teams reaching 0.32 MUST use a fundamentally different approach:
- Content-based matching (listing features → user intent)
- OR they have access to more user signals we're missing
- OR the metric is computed differently than we assume
Focus should shift to WARM USER RERANKING as primary lever.
src/eda/round_23_cold_start_ceiling.py)Blind user contact item age distribution:
≤ 1 day: 11.2%
≤ 3 days: 27.5%
≤ 7 days: 43.9%
≤14 days: 59.1%
≤30 days: 75.1%
Blind user category distribution:
1050 (Dự án): 39.6% ← #1 (vs warm users where 1020 dominates)
1020 (Căn hộ): 30.5%
1010 (Phòng trọ): 15.9%
1040 (Đất nền): 7.8%
1030 (Nhà ở): 6.2%
Blind user city distribution:
Tp Hồ Chí Minh: 73.8%
Đà Nẵng: 6.5%
Hà Nội: 6.4%
Blind users (no training history) are likely NEW users exploring the platform. They disproportionately view 1050 (Dự án/new projects) because these are heavily marketed — billboard ads, Google Ads, social media campaigns drive new users to specific projects. Fresh items dominate because new users arrive via marketing of newly-launched developments.
1. SegPop for blind users should overweight 1050 (Dự án) category
Current hash allocation doesn't reflect this 40% preference
2. Fresh items (≤7d) should be prioritized over historically popular items
3. Consider building a "new user" SegPop variant that:
- Weights items by (recency × segment_contact_volume)
- Allocates 4/10 slots to 1050, 3/10 to 1020, 2/10 to 1010, 1/10 to 1040
scripts/evaluate_aligned.py)Val GT users: 57,907 (classified by pre-split data)
Warm (contact history): 44,447 (76.8%)
Cold+signal (login/PCI, no contacts): 2,735 (4.7%)
Truly blind: 10,725 (18.5%)
Test users: 161,568 (from INS-049)
Login events: 58,153 (36.0%)
Non-login only: 12,367 (7.7%)
Truly blind: 91,048 (56.4%)
Val users selected by having val-period contacts are biased toward active users. Test set includes ALL registered users, many of whom never engaged. Any offline eval using val GT overweights warm users relative to test.
scripts/evaluate_aligned.py)Truly blind Recall@10 = 0.1654 (model leak present)
INS-063 ceiling: 0.0158 (clean SegPop, same users)
Inflation factor: ~10x
Warm Recall@10 = 0.0712 (model leak present)
Expected clean: ~0.06 (estimated, matches v10 warm decomposition)
SegPop was fitted on contacts INCLUDING the 3-day val period. Items popular during val period are perfectly ranked for val users. ALS embeddings similarly encode val-period user-item interactions. This creates circular evaluation: model "predicts" data it was trained on.
CRITICAL: Must retrain ALS + SegPop on contacts.filter(last_date <= split_date)
before any trustworthy offline eval. Current absolute numbers are MEANINGLESS.
Relative comparisons (A vs B with same leak) may still be directionally valid.
scripts/evaluate_aligned.py --retrain_clean)Split-clean eval (ALS+SegPop retrained on data <= split_date):
Cold + prefs (PCI/PV): Recall@10 = 0.0612 (n=715) ← HIGHEST in eval
Cold (no prefs): Recall@10 = 0.0020 (n=55)
Uplift: 30.6x
Prefs breakdown:
Contact-based prefs: 3,600 (warm users only)
PCI prefs (split-clean): 43
Pageview prefs: 672
Total with prefs: 4,315/10,000
Cold users with SIGNAL (pageviews or PCI leads but no contacts) can be effectively served by the cascade recommender when we extract their city+category preference. IntentRecommender matches them to fresh listings in their preferred segment. Without prefs, they fall back to global SegPop which has ~0 recall.
CRITICAL: Expanding PCI coverage is the highest-ROI action:
- INS-059 shows 10,654 blind TEST users have PCI data
- Currently only 43 val cold users matched PCI prefs (small sample)
- Each converted blind→cold user could gain 0.06 recall per user
- 10,654 × 0.06 / 161,568 = +0.004 total LB score from PCI alone
scripts/evaluate_aligned.py --retrain_clean)Production ALS (ALL contacts including val 3d):
Warm Recall@10 ≈ 0.10 (implied from v14 LB=0.0344)
Pairs: 13,020,004
Clean ALS (contacts <= split_date only):
Warm Recall@10 = 0.0179
Pairs: 12,737,124 (only 2.2% fewer)
Recall drop: 5.6x from removing just 2.2% of most recent data
BĐS market moves fast — the most recent contacts capture current user intent. A user's contacts from 3 months ago may represent a completely different life situation (already bought, changed city, etc.). The 3-day val period contacts are so predictive because they're the MOST recent signal. Removing them forces ALS to extrapolate from older, less relevant interactions.
1. Time-weighted ALS: weight recent contacts exponentially higher
Current: equal weight. Proposed: weight = exp(-days_ago / half_life)
2. For production inference: ALWAYS train on ALL available data up to current date
The time-split eval artificially handicaps ALS by removing the most valuable signal
3. For offline eval: accept that clean eval underestimates production recall
True production warm recall ≈ 0.10, not 0.0179
In leak-free, aligned offline evaluation:
Cascade-Direct (k=10):
Warm Recall@10 = 0.0285
Cold-with-signal Recall@10 = 0.0528
Hybrid Mode (Cascade k=200 -> LightGBM reranker):
Warm Recall@10 = 0.0668 (+134.4% relative gain)
Cold-with-signal Recall@10 = 0.0127 (-75.9% relative loss)
A single LightGBM ranking model trained on all data learns to heavily rely on rich user behavior features (such as historical contact rates, active collaborative filtering scores, and total views). For warm users, these features are highly predictive. However, cold-start users (login/PCI signal but no contacts) have sparse/missing values for these behavioral features. The model, trained almost exclusively on warm patterns, interprets the absence of behavioral signals as a negative indicator, penalizing cold candidates. This forces relevant cold listing recommendations to the bottom of the list.
1. Deploy a Segmented Inference Policy:
- For WARM users: Route through Cascade (k=200) -> LightGBM Reranker.
- For COLD/BLIND users: Route directly from Cascade (k=10) (no LightGBM reranker), or route through a specialized cold-start reranker.
2. A single unified ranking pipeline is mathematically suboptimal when user state distributions (sparse vs dense) are highly skewed.
Targeted EDA on all 10,725 truly-blind validation users compared no-preference fallback strategies:
global_score7 (contacts_7d*20 + views_7d): Recall@10 = 0.001190, hits = 63
snap_hcm_prop_4_3_2_1: Recall@10 = 0.000660, hits = 43
contact_weighted_segments: Recall@10 = 0.000593, hits = 43
snap_weighted_segments: Recall@10 = 0.000575, hits = 35
global_score7_fresh: Recall@10 = 0.000538, hits = 27
global_score1_fresh: Recall@10 = 0.000510, hits = 21
global_fresh_only: Recall@10 = 0.000000
Full split-clean aligned eval after deploying snapshot fallback:
Before snapshot fallback:
Simulated LB = 0.0271
Truly blind = 0.0001
After snapshot fallback:
Simulated LB = 0.0274
Warm = 0.0633
Cold-with-signal = 0.0517
Truly blind = 0.0011
For users with no contact, no login signal, and no PCI preference, there is no reliable user-side personalization. The best available signal is item-side market demand from recent snapshots. Pure posted_date freshness is not enough: users contact listings that are both recent and demand-proven, not merely new.
1. Use snapshot last-7-day demand as the default no-preference blind fallback.
2. Do not use pure freshness as a blind strategy.
3. Hash/segment diversity can be used for rank-1 exposure control, but should not replace the top demand item set.
4. The remaining blind ceiling is low unless a new user-side signal source is found.
Total test users: 161,568
Currently truly blind: 94,875 (58.7%)
- With LOGIN events: 0 (all login users already covered)
- With NON-LOGIN events: 4,276 (4.5% of blind)
- With NO events at all: 90,599 (95.5% of blind)
Non-login pageview users (subset of 4,276):
- Users with pageviews: 4,215
- All 4,215 have both pref_city AND pref_cat extractable
- Avg pageviews/user: 12.1 (median: 4)
- 1,187 users have REAL contacts (view_phone/chat/zalo/sms)
City distribution:
HCM: 2,983 (70.8%)
Hà Nội: 300 (7.1%)
Đà Nẵng: 292 (6.9%)
Bình Dương: 191 (4.5%)
Category distribution:
1020 (Căn hộ): 1,647 (39.1%)
1050 (Dự án): 1,105 (26.2%)
1010 (Phòng trọ): 775 (18.4%)
1040 (Đất nền): 401 (9.5%)
1030 (Nhà ở): 287 (6.8%)
These 4,215 users browsed listings on Chợ Tốt without logging in (device-level sessions). Their user_id is a device/cookie identifier, NOT a logged-in account ID. Per INS-057 and lesson #9, there are two conflicting considerations:
test_users.parquet, so Kaggle expects recommendations for them. If Kaggle's GT includes non-login contacts, these users CAN have non-zero recall.Key fact: Overlap with both login and non-login: 0 — no user_id appears in both login and non-login events, confirming these are fundamentally different identifier types.
INS-057 established that removing is_login from the ENTIRE pipeline (including ALS contact matrix) dropped LB score -59%. However, the proposed action here is DIFFERENT:
cold_user_prefs.parquet → zero impact on ALSThe key question is NOT whether non-login data hurts ALS (it does), but whether Kaggle evaluates non-login user_ids at all. This requires H-029 verification.
1. DO NOT change ALS training or contact_pairs (INS-057/058 lesson stands).
2. ONLY modify _process_cold_user_prefs to also extract preferences from non-login pageviews.
3. This is ZERO-RISK to warm/cold-with-signal users (their flow is untouched).
4. Potential upside: 4,215 users × SegPop city+cat recall ≈ 1.6% (INS-063 ceiling) = +0.001 total
5. But if Kaggle ignores non-login GT, upside = 0.
6. Verify via H-029 before spending a submission attempt.
Leaderboard:
v14 previous best: 0.0344
v17 current best: 0.2116 (Top5)
Relative gain: 6.15x
Validated artifact:
File: outputs/submission_1024.zip
CSV inside zip: submission.csv
Rows: 1,615,680
Users: 161,568
Columns: ID,user_id,rank,item_id
Unique items: 62,947
Rank-1 top item: 9,948 users (<10% rule)
Zip size: 41.37 MB
Validator: ALL SUBMISSION RULES PASS
Model artifact:
ALS factors: 1024
ALS user_factors: (810,411, 1024)
ALS item_factors: (696,252, 1024)
ALS model size: 5.8 GB
The winning shift was not another reranker layer. The public leaderboard rewarded a strong high-capacity collaborative retrieval model trained on all available positive contact data, then served through a direct top-10 cascade. Skipping LightGBM removes the warm/cold distribution overfit from INS-069. Increasing ALS capacity from 256 to 1024 factors greatly improves warm-user ranking, while Recency SegPop and intent fallbacks keep cold/blind users valid.
The uppercase ID column is also essential: .agent/submission_rules.md requires ID,user_id,rank,item_id. Earlier lowercase id validation was wrong for this competition.
1. Treat v17 as the new production baseline.
2. Keep config.inference_mode = "cascade" unless a new ablation beats 0.2116 on leaderboard.
3. Do not re-enable unified LightGBM for production without a clean segment-specific proof.
4. Future gains should be ablated against v17, not the old 0.034 baseline.
5. Submission validation must enforce uppercase ID and zip/gz packaging.
Submission: outputs/submission_snapshot_blind.zip
Public LB: 0.0003
Offline context before submission:
Snapshot demand fallback looked useful in aligned blind eval:
blind recall improved from ~0.0001 to ~0.0011.
Leaderboard reality:
0.0003 is near the original broken baseline range and far below v17 0.2116.
The snapshot demand signal was item-side market activity, not user-side intent. For truly-blind users, offline validation can over-reward items that were recently active in the validation window, while public LB appears to reward the high-capacity ALS/cascade list structure much more strongly. This means blind-fallback experiments are especially vulnerable to offline/LB mismatch.
1. Do not use snapshot fallback in final production submissions.
2. Keep snapshot code gated away from cascade production unless a tiny LB ablation proves benefit.
3. Treat blind fallback as low-ceiling; protect warm ALS quality first.
v17 baseline:
File: outputs/submission_1024.zip
Public LB: 0.2116
ALS factors: 1024
Mode: cascade direct
v18 experiment:
File: outputs/submission_1536.zip
Public LB: 0.2108
ALS factors: 1536
Added: time-decay ALS, pci_merge_mode=test_only, non-login cold prefs, recent_cc=5
Delta:
0.2108 - 0.2116 = -0.0008
More ALS capacity and recency weighting did not automatically improve leaderboard precision. The v18 branch was close but still worse than v17, so the current evidence says ALS1024 is the safest production capacity. The degradation is small and confounded by multiple simultaneous changes, but it is enough to reject v18 as final.
1. Keep outputs/submission_1024.zip as the protected best artifact.
2. Do not assume larger ALS factors are better beyond 1024.
3. Any future factor/time-decay work must be isolated one variable at a time.
v17 baseline:
outputs/submission_1024.zip = 0.2116
v19 conservative blend:
outputs/submission_blend_v17_9_v18_1.zip = 0.1974
Policy: keep v17 ranks 1-9, replace rank10 with first unique v18 item.
Delta:
0.1974 - 0.2116 = -0.0142
Even the tenth slot of v17 carries meaningful signal. Replacing only one item per user with v18 introduced enough noise to lose 6.7% relative score. This also suggests the public metric is sensitive to the exact v17 cascade ordering, not just rank-1 items.
1. Do not blend by mechanically replacing tail slots.
2. Treat v17 full top-10 as an atomic strong baseline.
3. Ensemble only if a learned or segment-specific policy proves it beats v17 before submission.