Final Algorithm Story Report

Project: Real Estate Recommendation System
Date: 2026-05-21
Best protected submission: outputs/submission_1024.zip
Best public leaderboard: 0.2116
Final production philosophy: strong retrieval first, segment-aware fallback second, no unproven reranking.


0. Executive Summary

The final useful solution is not a single clever model. It is a carefully constrained recommendation system built from the data's own behavior:

Raw marketplace logs
  -> positive signal cleaning
  -> compact user/item caches
  -> high-density ContactALS retrieval
  -> intent and co-contact fallback sources
  -> segment popularity last resort
  -> direct top-10 cascade
  -> strict submission validation

The winning public artifact is:

outputs/submission_1024.zip
publicScore = 0.2116
rank = Top5 at time of submission

The decisive turn was moving from a fragile "train a big reranker over everything" mindset to a simpler marketplace philosophy:

In this task, the most valuable object is not a feature table.
It is a reliable ordered candidate list, produced by the strongest available signal for each user.

The best leaderboard result came from ALS1024 + cascade-direct, not from LightGBM hybrid, snapshot blind fallback, or ensembling:

Attempt Public LB Lesson
v14 clean cascade baseline 0.0344 Warm users work, cold/blind mostly dead
v17 ALS1024 cascade-direct 0.2116 Strong CF retrieval + direct cascade is the best known path
snapshot blind fallback 0.0003 Offline blind gains did not transfer
v18 ALS1536/time-decay branch 0.2108 Bigger ALS was not better than v17
v19 v17-top9/v18-slot10 blend 0.1974 Even v17 rank10 carries useful signal

The protected final answer is therefore:

Use v17 artifact: outputs/submission_1024.zip
Do not replace it with v18, snapshot, hybrid, or slot-blend variants.

1. The Story: What The Data Taught Us

At the beginning, the problem looked like a standard recommender competition:

  1. Build user-item positives.
  2. Train collaborative filtering.
  3. Generate candidates.
  4. Train a ranker.
  5. Submit top-10.

That plan was directionally reasonable but incomplete. Real estate recommendation is not like recommending movies or music. A user is not trying to consume many similar items forever. A user is searching inside a constrained marketplace:

city -> district -> property type -> budget -> listing freshness -> seller/contact behavior

The EDA changed the design in four major ways.

1.1 The User Usually Wants A New Item, Not A Repeated Item

One of the strongest findings was:

85.5% of ground-truth items are completely new to the user.

That means a pure replay strategy cannot win. The model should not only repeat viewed/contacted listings. It must discover plausible new listings that are close to the user's intent.

Design consequence:

Use replay as a precise fallback, but make the main engine a retrieval model
that can discover new relevant items.

This is why ContactALS became central. ALS can recommend items that the user has never touched, based on similar users' contact behavior.

1.2 Geography Is The Main Spine Of Intent

EDA found:

91.9% of GT items match user's preferred city.
72.2% of GT items match user's preferred category.
city + category combined explain a large part of relevance.

In real estate, location is not a weak metadata feature. It is the frame in which almost all user intent lives.

Design consequence:

Every cold-start fallback must preserve city/category intent.
Every ranker feature or segment fallback must understand pref_city and pref_cat.

This is why the pipeline builds cold_user_prefs.parquet, why SegmentPopularity is keyed by city/category/district, and why a city-name bug was catastrophic.

1.3 Cold-Start Is The Main Enemy, But Popularity Alone Cannot Solve It

The public test distribution exposed a severe user segmentation problem:

Test users: 161,568
Warm users: 54,502  (33.7%)
Cold+prefs: 12,191  (7.5%)
Blind users: 94,875 (58.7%)

A majority of test users had no usable training contact history. But the offline experiments also showed:

Segment popularity ceiling for truly blind users: around 1.6% Recall@10
Clean truly-blind recall: near zero in realistic split-clean eval

Design consequence:

Do not over-invest in blind popularity hacks.
Protect warm-user ALS quality first.
Use segment fallback only as a necessary last resort.

This is why snapshot blind fallback was attractive offline but dangerous in production. Public LB later confirmed it was rejected:

outputs/submission_snapshot_blind.zip -> 0.0003

1.4 The Leaderboard Punished Cleverness Without Alignment

Several experiments sounded good but failed:

Idea Result Why it failed
Unified LightGBM reranker v11 = 0.0048 Trained/deployed distribution mismatch and cold-start overfit
Offset diversity v12 = 0.0050 Pushed users away from the most relevant popular items
Remove is_login globally v13 = 0.0140 Added noisy device-level/non-login interactions into ALS
Snapshot blind fallback 0.0003 Item-side demand did not transfer to public LB
ALS1536 branch 0.2108 Bigger capacity plus bundled changes did not beat ALS1024
Tail-slot blend 0.1974 v17 top-10 ordering was already strong

The final design became more conservative:

If a change does not preserve the exact retrieval distribution that produced LB gain,
it must be treated as risky until proven.

2. Design Philosophy

The final solution is built around seven principles.

Principle 1: Ground Truth Beats Problem Text

The statement around other_interaction was misleading. The data showed:

other_interaction has is_contact = 1
pageview has is_contact = 0

So other_interaction is a positive signal. It must be included in positive events.

Implementation:

positive_events = [
    "view_phone",
    "contact_chat",
    "contact_zalo",
    "contact_sms",
    "other_interaction",
]

Philosophy:

When the competition description and the ground-truth flag disagree,
trust the label-generating data.

Principle 2: Pre-Aggregate Before Modeling

The raw tables are too large for casual joins:

fact_user_events: 161,731,336 rows
fact_listing_snapshot: 19,762,167 rows
fact_post_contact_interactions: 25,486,445 rows
dim_listing: 3,107,114 rows

The system therefore turns raw events into compact caches:

Cache Purpose
.cache/contact_pairs.parquet login positive contacts with city/category/count/last_date
.cache/als_contact_pairs.parquet user-item contact matrix for ALS
.cache/als_weighted_contact.parquet weighted ALS matrix, real contacts > soft contacts
.cache/als_pageview_pairs.parquet pageview matrix, available but not used in final top-10
.cache/session_items.parquet session co-occurrence substrate
.cache/cold_user_prefs.parquet city/category preferences for non-warm users
.cache/snapshot_stats.parquet item-side recent views/contacts/trends

Philosophy:

The model should not fight the raw data scale every time it trains.
Make the expensive facts small, typed, and reusable.

Principle 3: Retrieval Quality Is More Important Than Feature Decoration

The strongest leaderboard jump did not come from adding more feature columns. It came from upgrading the retrieval backbone:

v14: 0.0344
v17: 0.2116
relative gain: 6.15x

The main change was a high-capacity ContactALS model and a direct cascade path.

Philosophy:

At top-10, a beautiful ranker cannot rescue a bad candidate pool.
First retrieve the right 10-200 items; only then worry about reranking.

Principle 4: Density Beats Naive Size

Removing the is_login filter made the matrix bigger, but worse:

Clean login baseline: 0.034
No is_login filter:   0.014
relative drop:        -59%

The model learned that non-login identifiers behave differently from logged-in accounts. Adding them globally diluted collaborative structure.

Philosophy:

More rows are useful only if they preserve the identity semantics of the user_id.
Noisy scale is not signal.

Principle 5: Cascade Should Be Sequential, Not Democratic

Round-robin and diversity-heavy policies were worse. Budget-based sequential union won because it respects source strength:

Budget-based sequential union Recall@200: 0.3177
Round-robin interleave: worse
Offset diversity: LB collapse

Philosophy:

Sources are not equal. Let the strongest available signal speak first.
Fallbacks should fill holes, not compete equally with the best source.

Principle 6: Segment Cold-Start, But Do Not Overfit To It

Cold-start matters because many test users are cold or blind. But the ceiling for truly blind popularity is very low.

Philosophy:

Cold-start support is mandatory for valid submissions,
but the score is won by preserving the high-confidence warm path.

This explains why final production chooses cascade-direct for all users instead of a unified LightGBM reranker.

Principle 7: The Best Top-10 Is An Ordered Object

The v19 blend kept v17 ranks 1-9 and changed only rank10. It still dropped:

v17: 0.2116
v19: 0.1974
delta: -0.0142

Philosophy:

Do not treat the tail item as disposable.
The whole v17 top-10 ordering contains learned signal.

3. Data Sources And Their Roles

3.1 fact_user_events

This is the main behavioral table. It contains pageviews and contact-like positive events.

Used for:

  1. Contact pairs for ALS.
  2. User contact histories.
  3. Pageview replay.
  4. Intent extraction.
  5. Co-contact and user-neighbor structures.
  6. Cold user city/category preferences.

Important filters:

For ALS/contact training:
  is_login == "login"
  is_contact == 1

For positive event definitions:
  event_type in [
    view_phone,
    contact_chat,
    contact_zalo,
    contact_sms,
    other_interaction
  ]

For pageview-derived preferences:
  event_type == pageview

Important correction:

dwell_time_sec is actually milliseconds.
3 seconds means raw threshold 3000ms.

3.2 dim_listing

This is the item metadata table.

Used for:

  1. Valid item universe.
  2. City/category/district matching.
  3. Seller expansion.
  4. Feature engineering for ranker experiments.
  5. SegmentPopularity fallback.

Important fields:

item_id
city_name
district_name
category
price
seller_id / seller fields
listing quality fields

Important lesson:

Null metadata is not always bad data.
Many real estate fields are naturally sparse by property type.

3.3 fact_post_contact_interactions / PCI

PCI was discovered as a major hidden signal source.

Key findings:

10,654 blind test users had PCI data.
644,732 new lead pairs were not in ALS training data.
Cold+PCI prefs achieved about 30x uplift in clean eval:
  cold+prefs:     0.0612
  cold-no-prefs:  0.0020

Used for:

  1. Enriching ALS pairs with lead interactions.
  2. Building city/category preferences for blind users.
  3. Weighting purchased leads more strongly.

Caution:

PCI helps when it increases useful density.
Blind/cold PCI prefs are valuable, but broad uncontrolled merging can change matrix semantics.

3.4 fact_listing_snapshot

Snapshot contains item-side recent demand:

views_24h
contacts_24h
recent trend signals
active item signals

It was useful as a feature source in experiments, but the public LB rejected using snapshot demand as the final blind fallback:

snapshot blind fallback public LB = 0.0003

Final interpretation:

Snapshot is a useful diagnostic and possible feature source,
but it should not replace the protected v17 cascade path.

3.5 test_users.parquet

The required prediction universe:

161,568 users
10 rows per user
1,615,680 total submission rows

Test users drive:

  1. Which users need recommendations.
  2. Which cold prefs matter.
  3. Which PCI/test-only enrichment choices are possible.
  4. Final submission validation.

4. Preprocessing Flow

The preprocessing module is the bridge between raw logs and modeling.

Entry point:

DataPreprocessor.process_and_cache(lf, snapshot_path)

It runs the following transformations.

4.1 Contact Pairs

Output:

.cache/contact_pairs.parquet

Logic:

filter is_login == login
filter event_type in positive_events
group by user_id, item_id, city_name, category
aggregate count and last_date

Purpose:

  1. Build user histories.
  2. Train SegmentPopularity.
  3. Build co-contact structures.
  4. Identify warm users.
  5. Build preference modes.

Why it matters:

This cache preserves city/category with the interaction,
so downstream fallback can stay location-aware.

4.2 ALS Contact Pairs

Output:

.cache/als_contact_pairs.parquet

Logic:

filter is_login == login
filter is_contact == 1
group by user_id, item_id
score = count

Purpose:

Sparse implicit-feedback matrix for ContactALS.

4.3 Weighted ALS Pairs

Output:

.cache/als_weighted_contact.parquet

Logic:

real contacts:
  view_phone, contact_chat, contact_zalo, contact_sms -> weight 3

soft positive:
  other_interaction -> weight 1

group by user_id, item_id
score = sum(weight)

Purpose:

Give stronger weight to high-intent contact actions,
while keeping other_interaction as a valid but softer positive.

4.4 Pageview Pairs

Output:

.cache/als_pageview_pairs.parquet

Logic:

filter is_login == login
filter event_type == pageview
group by user_id, item_id
aggregate view_count and avg_dwell

Purpose:

  1. Available for ViewALS.
  2. Used by feature engineering.
  3. Helps diagnose browsing intent.

Final decision:

ViewALS is disabled in final cascade budgets because it diluted the candidate pool.

4.5 Session Items

Output:

.cache/session_items.parquet

Logic:

filter login events with session_id
group item_ids by session_id
keep sessions with 2 <= n_items <= 30

Purpose:

Support session-level co-occurrence ideas and diagnostics.

4.6 Cold User Preferences

Output:

.cache/cold_user_prefs.parquet

Logic:

identify warm users from positive contacts
for non-warm users, aggregate pageview city/category modes
pref_city = mode(city_name)
pref_cat  = mode(category)

Purpose:

Convert users without contact history into users with at least city/category intent.

Important warning:

The current working code includes the H-029 extension that allows non-login
pageviews for preference extraction. That is different from removing is_login
globally from ALS. Global non-login ALS was already rejected.

4.7 Snapshot Stats

Output:

.cache/snapshot_stats.parquet

Features:

item_avg_views_7d
item_avg_contacts_7d
item_conversion_rate
item_trend_score
item_is_active

Final decision:

Useful for analysis and hybrid feature parity,
but snapshot blind fallback is not part of the protected v17 final.

5. From Insight To Feature

This section maps each major insight to its algorithmic consequence.

5.1 other_interaction Is Positive

Insight:

other_interaction has is_contact=1.

Feature/model consequence:

Add other_interaction to positive_events.
Use it in contact pairs and ALS.
Weight it lower than direct contacts in weighted ALS.

Why:

It contains positive behavioral information, but it is weaker than a phone/chat/Zalo/SMS contact.

5.2 City And Category Dominate Relevance

Insight:

91.9% city match.
72.2% category match.

Feature/model consequence:

Build pref_city and pref_cat for every user possible.
Use pref_city/pref_cat in SegmentPopularity.
Use city_match and cat_match in ranker experiments.
Build RecentCC by (city, category).

Why:

Real estate intent is geographically anchored.
No fallback should ignore geography if any user signal exists.

5.3 Most GT Items Are New To The User

Insight:

85.5% GT items are new to user.

Feature/model consequence:

Do not rely on pure replay.
Use ContactALS, IntentRecommender, CoContact, UserKNN, SellerExpansion.

Why:

The model must generalize from past behavior to unseen listings.

5.4 Test Distribution Is Cold-Heavy

Insight:

Warm: 33.7%
Cold+prefs: 7.5%
Blind: 58.7%

Feature/model consequence:

Every user must have 10 valid recommendations.
Build a fallback chain ending in SegmentPopularity.
Use PCI/cold prefs to rescue any cold user possible.

Why:

A pure ALS solution leaves too many users uncovered.

5.5 Popularity Has A Low Ceiling

Insight:

Segment popularity ceiling for blind users is around 1.6% Recall@10.

Feature/model consequence:

Use SegPop as last resort, not as the main scorer.
Protect ALS-first warm recommendations.

Why:

Blind users have no user-side intent. Popularity can keep submissions valid,
but cannot carry a top solution alone.

5.6 Login Identity Is Cleaner Than Non-Login Scale

Insight:

Removing is_login globally dropped LB from 0.034 to 0.014.

Feature/model consequence:

Keep login-only ALS/contact matrix.
Do not train collaborative embeddings on mixed identity semantics.
Only consider non-login pageviews for isolated city/category preference extraction.

Why:

Collaborative filtering needs stable user identity.
Anonymous/device-like IDs can pollute the matrix.

5.7 ALS Capacity Was The Breakthrough

Insight:

ALS1024 cascade-direct reached 0.2116.
ALS1536 branch reached 0.2108.

Feature/model consequence:

Use ALS1024 as protected production baseline.
Do not assume larger factors improve public LB.

Why:

Capacity helps until it does not. The only proven high-score artifact is 1024.

5.8 Reranker Risk Is Real

Insight:

LightGBM hybrid destroyed cold recall and failed public LB in earlier submissions.

Feature/model consequence:

Final inference_mode = cascade.
Do not route final production through unified LightGBM.

Why:

The ranker can overfit warm dense features and mis-handle cold sparse candidates.

6. Candidate Sources

The final system is an ensemble at the retrieval level, not at the score-blending level.

6.1 ContactALS

Role:

Main warm-user retrieval engine.

Input:

user_id, item_id, score

Where score comes from:

weighted positive contacts:
  real contacts weight 3
  other_interaction weight 1
optional PCI lead weights in experimental branches

Final protected model:

ALS factors: 1024
iterations: 30
regularization: 0.01
artifact: outputs/models/als/
model size: about 5.8GB
user_factors: (810,411, 1024)
item_factors: (696,252, 1024)

Why it works:

It converts sparse contact histories into dense user/item embeddings,
allowing discovery of unseen listings.

Why it is first in the final cascade:

For warm users, ALS recommendations are the highest-confidence top-10 source.

6.2 IntentRecommender

Role:

Intent matching from pageviews to similar current listings.

Core idea:

If a user browses listings in a district/category/price region,
recommend other valid listings in that intent bucket.

Feature basis:

pageview item_id -> listing metadata -> district/category/price intent

Why it exists:

Pageviews are weaker than contacts, but they expose search intent,
especially for users without contacts.

6.3 PageviewReplay

Role:

Replay recently viewed items as a precise but narrow signal.

Window:

14 days

Why it is not the main source:

85.5% of GT items are new to the user, so pure replay cannot dominate.

6.4 CoContact

Role:

Item-to-item expansion from recent contact history.

Core idea:

Users who contacted item A also contacted item B.
If current user contacted A, recommend B.

Window:

30 days

Why it exists:

Real estate shoppers often compare listings in the same micro-market.
Co-contact captures that local comparison behavior.

6.5 UserKNN

Role:

Neighbor-based collaborative fallback.

Core idea:

Find users who overlap on contacted items, then recommend their other items.

Why it exists:

It is a simpler local CF signal that can complement ALS.

6.6 SellerExpansion

Role:

Recommend other listings from sellers the user has interacted with.

Why it exists:

Real estate sellers often list similar properties in the same area or category.

Risk:

Seller affinity is useful as fallback, but weaker than user/item CF.

6.7 RecentCC

Role:

Recent popular contacts by (city, category).

Window:

7 days

Why it exists:

Real estate inventory is time-sensitive.
Recent demand in the same city/category is a better fallback than old global popularity.

6.8 SegmentPopularity

Role:

Last resort fallback.

Cascade levels:

(city, category, district)
  -> (city, category)
  -> city
  -> category
  -> global

Pool sizes:

global_k = 500
segment_k = 500
cc_k = 500
ccd_k = 100

Why it exists:

Every test user must receive 10 valid unique items.
When all personalized sources fail, SegPop guarantees coverage.

Why it is last:

Popularity alone has a low ceiling for truly blind users.

7. Cascade Algorithm

The cascade is the core serving algorithm.

It is not a boosting model, not a LightGBM cascade, and not a ranker by itself. It is a deterministic priority-based slot filler.

7.1 Direct Top-10 Mode

Final mode:

inference_mode = cascade
k = 10

Current top-10 source order:

1. ALS
2. Intent
3. CoContact
4. PageviewReplay
5. UserKNN
6. SellerExpansion
7. RecentCC
8. SegmentPopularity

Important nuance:

Budgets are caps, not guaranteed allocations.
The cascade stops as soon as it has 10 unique valid items.

So for a warm user with good ALS coverage:

ALS may fill all 10 slots.
No lower source is needed.

For a cold user:

ALS returns nothing or little.
Intent/pageview/recent_cc/segpop fill the remaining slots.

Pseudo-code:

def cascade_generate(user, k=10):
    seen = set()
    recs = []

    for source in source_order_top10:
        budget = budget_top10[source]
        if budget <= 0:
            continue

        candidates = source.recommend(user, budget)

        for item in candidates:
            if item not in seen and item in valid_items:
                recs.append(item)
                seen.add(item)
                if len(recs) == k:
                    return recs

    return recs[:k]

7.2 Why Sequential Beats Round-Robin

Round-robin assumes sources are equally trustworthy. EDA rejected that assumption.

Sequential priority works because:

ALS is stronger for warm users.
Intent/PV are useful only when user behavior supports them.
SegPop is a fallback, not a peer to ALS.

The design gives each source a role:

Source type Role
ALS Primary high-confidence retrieval
Intent/PV Behavioral intent recovery
CoContact/UserKNN/Seller Collaborative/local expansion
RecentCC/SegPop Coverage and cold fallback

7.3 Why Direct Cascade Beat Hybrid In Production

Hybrid mode exists:

Cascade k=200 -> feature engineering -> LightGBM LambdaRank -> top10

But final production does not use it because:

Earlier LightGBM submissions collapsed on LB.
Unified ranker overfit warm feature density.
Cold-start candidates lack many behavioral features.
Segmented hybrid was still risky.

Direct cascade preserves the retrieval distribution that the leaderboard rewarded.


8. Training Flow

The training pipeline has two possible personalities:

  1. Cascade production: train retrieval artifacts and stop.
  2. Hybrid experimental: train retrieval artifacts, generate candidate pools, build features, train LightGBM.

The protected best solution uses the first personality.

8.1 Split And Ground Truth

For offline evaluation, contacts are split by time:

train_contacts = contacts with last_date <= split_date
val_contacts   = contacts with last_date > split_date

Why this matters:

If ALS/SegPop are trained on full data including validation period,
offline blind recall is inflated.

Confirmed leak:

Blind recall dropped from 0.1654 to 0.0004 after split-clean retraining.

8.2 SegmentPopularity Training

Input:

train_contacts
valid_items from dim_listing
listing metadata

Output:

outputs/models/segpop.pkl

Function:

Build popular item lists by city/category/district hierarchy.

Critical warning:

Training pipeline can overwrite segpop.pkl.
The protected production state must use the recency-aware SegPop artifact.

8.3 ContactALS Training

Input:

ALS user-item pairs

Best protected configuration:

factors = 1024
iterations = 30
regularization = 0.01
GPU = enabled

Output:

outputs/models/als/

Why 1024:

v17 ALS1024 achieved 0.2116.
v18 ALS1536 achieved 0.2108.
Therefore 1024 is the best proven capacity.

8.4 ViewALS Decision

ViewALS trains collaborative filtering on pageview pairs.

It was disabled because:

als_view diluted the candidate pool.
Disabling it improved Recall@200 by about 5.4%.
It also creates memory pressure.

Final config behavior:

als_view budget = 0
do not load stale als_view artifact

8.5 LightGBM Ranker Is Experimental, Not Final

Ranker features include:

source flags
ALS scores
user behavior stats
item contact/view stats
item quality metadata
city/category/price match features
snapshot stats
seller affinity
recent history

But final mode skips this path:

inference_mode = cascade

Reason:

The public leaderboard repeatedly punished ranker/hybrid variants.

9. Inference Flow

The final inference pipeline does the following.

9.1 Load Required Data

Inputs:

test_users.parquet
dim_listing
.cache/contact_pairs.parquet
.cache/cold_user_prefs.parquet
outputs/models/als/
outputs/models/segpop.pkl

It also fits or loads runtime candidate sources:

PageviewReplay
CoContact
RecentCC
IntentRecommender
UserKNN
SellerExpansion

9.2 Build User Preferences

Preference priority:

1. Contact history preferences for warm users
2. cold_user_prefs for users without contacts
3. no prefs for truly blind users

Fields:

pref_city
pref_cat

Usage:

RecentCC and SegPop use these preferences for location/category-aware fallback.

9.3 Generate Recommendations

For every test user:

call CascadeCandidateGenerator.generate_batch(...)
request k = 10
validate item_id in valid_items
deduplicate per user

The result is a dictionary:

user_id -> [item_1, item_2, ..., item_10]

9.4 Build Submission

Required format:

ID,user_id,rank,item_id

Required shape:

161,568 users * 10 ranks = 1,615,680 rows
rank in 1..10
no duplicate item per user
item_id must exist in dim_listing
rank-1 top item must not exceed 10% of users
zip/gz size <= 100MB

v17 validation:

Rows: 1,615,680
Users: 161,568
Unique items: 62,947
Rank-1 top item: 9,948 users
Zip size: 41.37MB
Validator: PASS

10. Final Model As A System Diagram

                         RAW TABLES
                             |
        ------------------------------------------------
        |                    |                         |
 fact_user_events     dim_listing       fact_post_contact_interactions
        |                    |                         |
        |                    |                         |
        v                    v                         v
  DataPreprocessor      valid item universe       PCILoader
        |                    |                         |
        |                    |                         |
        +-------- compact caches ----------------------+
                             |
                             v
             ---------------------------------
             |                               |
        ContactALS 1024              SegmentPopularity
             |                               |
             |                               |
             +--------------+----------------+
                            |
                            v
                  CascadeCandidateGenerator
                            |
      -------------------------------------------------
      |       |       |       |       |       |       |
     ALS   Intent  CoContact   PV   UserKNN Seller RecentCC
      |       |       |       |       |       |       |
      +------------------- sequential union ----------+
                            |
                            v
                     Top-10 direct list
                            |
                            v
                  submission_1024.zip
                            |
                            v
                     Public LB = 0.2116

11. Why The Final Solution Works

11.1 It Solves Warm Users Strongly

Warm users are the segment with the richest signal. ContactALS uses their positive interactions to retrieve new items with similar collaborative structure.

Why this matters:

Warm users were the segment that already explained most early LB score.
Improving warm retrieval created the largest jump from 0.0344 to 0.2116.

11.2 It Does Not Break Cold Users To Help Warm Users

Unified LightGBM tried to use dense warm features for everyone. That harmed cold users.

The final cascade avoids that:

If a source has no signal for a user, it simply contributes nothing.
The next source fills the slots.

This is safer than forcing every candidate through one global scoring function.

11.3 It Keeps The Fallback Chain Valid

Even if a user has no ALS recommendations:

Intent -> PV -> CoContact -> UserKNN -> Seller -> RecentCC -> SegPop

will eventually produce valid items.

This matters because submission failure is not just low recall. Invalid shape, duplicate items, or missing users would kill the run.

11.4 It Respects The Marketplace

The model is not just mathematical. It encodes marketplace facts:

location matters
category matters
recent demand matters
seller context can matter
positive contact behavior is stronger than browsing
anonymous browsing is noisy for CF

That is the difference between a generic recommender and a real estate recommender.


12. What Was Rejected And Why

12.1 Unified LightGBM Reranker

Rejected evidence:

v11 = 0.0048

Problems:

trained on one candidate distribution, inferred on another
overfit warm behavioral features
cold-start candidates were feature-sparse

Lesson:

Do not use a global ranker unless its training candidates exactly match inference candidates
and segment-level behavior is validated.

12.2 Offset Diversity

Rejected evidence:

v12 = 0.0050

Problem:

Diversity pushed users away from the most relevant popular items.

Lesson:

In cold-start fallback, the top popular items are often top for a reason.
Do not diversify blindly.

12.3 Removing is_login

Rejected evidence:

v13 = 0.0140
previous clean baseline about 0.034
relative drop about -59%

Problem:

Non-login IDs do not behave like stable account IDs.
They polluted ALS density and lowered embedding quality.

Lesson:

Keep login-only collaborative training.

12.4 Snapshot Blind Fallback

Rejected evidence:

public LB = 0.0003

Problem:

Offline blind uplift did not transfer.
Item-side demand was not enough to solve user-side missing intent.

Lesson:

Do not let blind fallback experiments override the strong warm retrieval path.

12.5 ALS1536 / Time-Decay Branch

Rejected evidence:

v18 = 0.2108
v17 = 0.2116

Problem:

The branch bundled multiple changes and did not beat the protected baseline.

Lesson:

ALS1024 remains the best proven capacity.
Future factor/time-decay work must isolate one variable at a time.

12.6 Slot-Level Blend

Rejected evidence:

v19 = 0.1974

Policy:

keep v17 ranks 1-9
replace rank10 with first unique v18 item

Problem:

Even this tiny replacement damaged the list.

Lesson:

Treat v17 top-10 as an ordered object, not as nine good items plus one disposable slot.

13. Evaluation Philosophy

The project learned that offline evaluation can be misleading unless it matches test conditions.

13.1 The Original Offline Eval Was Too Warm

Problem:

val_users = users with validation contacts

This selects users who are active in the validation period, which over-represents warm users.

Test reality:

Test is much colder and more blind.

13.2 Full-Data Models Leaked Validation

If ALS/SegPop are trained on full data, validation-period contacts leak into item popularity and embeddings.

Evidence:

Blind recall before clean retrain: 0.1654
Blind recall after clean retrain:  0.0004

Lesson:

Any offline claim must specify whether models were split-clean retrained.

13.3 Leaderboard Is The Final Arbiter For Distribution Mismatch

Even good clean eval cannot fully simulate public LB. The snapshot fallback is the clearest example:

offline looked directionally useful
public LB = 0.0003

Final rule:

Use offline eval to reject bad ideas cheaply.
Use LB only for isolated, high-confidence changes.
Protect the best proven artifact.

14. Current Artifact And Reproducibility Warnings

14.1 Protected Best Artifact

The best known file is:

outputs/submission_1024.zip

It is already validated and scored:

publicScore = 0.2116

14.2 Current Working Tree May Contain Experimental Knobs

At the time of this report, the code/config had later experiment settings such as:

als_factors = 1536
als_time_decay_half_life = 30.0
pci_merge_mode = test_only
non-login pageview preference extraction

These settings are not the protected v17 proof by themselves.

Important distinction:

Best artifact: outputs/submission_1024.zip
Current code: may include post-v17 experiments

If reproducing v17 exactly, restore/freeze the v17-compatible configuration before retraining or packaging.

14.3 Submission Contract Must Be Preserved

The official format requires uppercase ID:

ID,user_id,rank,item_id

This matters. A lowercase id validator would be wrong for this competition.


15. The Final Algorithm In Plain Language

For each test user:

  1. Check whether the user has historical positive contacts.
  2. Build the user's preferred city/category if possible.
  3. Ask the ALS model for the best unseen listings.
  4. If ALS cannot fill 10 slots, use intent/pageview/co-contact/user-neighbor/seller sources.
  5. If still short, use recent city-category popular listings.
  6. If still short, use segment/global popularity.
  7. Deduplicate items.
  8. Keep the first 10 items in cascade order.
  9. Write ranks 1 to 10.

The algorithm is simple by design:

The first source that knows something reliable about the user gets priority.
The fallback source only acts when the stronger source is silent.

16. Future Work, If More Submissions Existed

Future work should be conservative and isolated.

16.1 Restore And Freeze v17

Before any final submission:

submit outputs/submission_1024.zip directly
or restore exact v17 config and regenerate only if necessary

16.2 Isolated Factor Ablation

Do not bundle:

factors + time decay + PCI mode + cold prefs

Test one at a time:

ALS768 vs ALS1024
ALS1024 no decay vs ALS1024 decay
ALS1024 existing_only PCI vs ALS1024 no PCI

16.3 Segment-Specific Ranker Only

If LightGBM returns:

train separate rankers by segment
warm ranker only for warm users
cold ranker only for cold-with-pref users
never route truly blind users through warm feature assumptions

16.4 Better Cold Signal Discovery

The only promising cold path is not better popularity. It is finding actual user-side signal:

PCI prefs
login pageviews
safe non-login pref extraction
query/session-derived intent
device/account mapping if valid

16.5 Stronger Evaluation Protocol

Required before trusting any future idea:

test-aligned user mix
split-clean model retraining
segment-level recall
artifact-level submission validation
comparison against v17, not old v14

17. Final Design Statement

The final system is built on a restrained idea:

Use the strongest personalized retrieval signal when it exists.
Use intent-aware fallback when it does not.
Use popularity only to guarantee coverage.
Do not let a complex reranker or clever ensemble disturb a proven top-10 list.

The project's central lesson is that recommendation quality came less from adding layers and more from respecting the structure of the marketplace:

real estate is local,
contact is stronger than browsing,
identity quality matters more than row count,
cold-start has a hard ceiling without user intent,
and the final top-10 order is precious.

That is why the protected final answer remains:

outputs/submission_1024.zip
public LB: 0.2116
algorithm: ALS1024 + direct priority cascade