Steam Recommender System

Recommending Games Based on User Playtime

Data Science

Machine Learning

Recommender Systems

Video Games

End-to-end recommender system pipeline: ingesting 5M+ Steam playtime records, applying implicit-feedback ALS, and achieving 79.79% NDCG@10 on held-out data.

Author

Matthew Burns

Published

January 19, 2026

Modified

May 26, 2026

TL;DR

Objective was to build a collaborative-filtering recommender system for Steam games using an implicit-feedback Alternating Least Squares (ALS) model.
The full pipeline covers data ingestion from a raw gzipped JSON-lines file, cleaning and reshaping 5M+ user-item interactions, hyperparameter tuning via grid search, & final evaluation.
Key result: Model achieved NDCG@10 = 79.79% on a held-out test set.

GitHub repo

What This Demonstrates

End-to-end DS workflow: data ingestion → cleaning/reshaping → modeling → evaluation
Working with real-world messy nested data (Steam user-item histories)
Recommender-system fundamentals:
- Implicit feedback (playtime) vs explicit ratings
- Sparse user-item matrices
- ALS factorization and practical tuning
Pragmatic model validation using ranking metrics (NDCG@K)

Part 1 — Data Collection

Much of the code in this project is organized into functions, even where a one-liner would work. This reflects a preference for functional-style data transformations: it makes each step easier to reason about and keeps the pipeline explicit when all transformations are composed together later.

Data Source

The dataset comes from Julian McAuley and Wang-Cheng Kang at UC San Diego, hosted on McAuley’s website. The V1 User-Items dataset describes the hours that Australian Steam users played each of their games. It arrives as a .json.gz file — 70.6 MB compressed, 527.5 MB extracted.

Code

from pathlib import Path

root_dir = Path.cwd().resolve().parent
raw_data_dir = root_dir / 'data' / 'raw'
raw_data_dir.mkdir(parents=True, exist_ok=True)

GZIP_PATH = raw_data_dir / 'ucsd_playtime.json.gz'
URL = 'https://mcauleylab.ucsd.edu/public_datasets/data/steam/australian_users_items.json.gz'

Download

Code

import requests


def download(verbose: bool) -> None:
    resp = requests.get(URL)
    resp.raise_for_status()
    with GZIP_PATH.open('wb') as file:
        size = file.write(resp.content)
    if verbose:
        print(f"Wrote {size / 1024**2:,.1f} MB to '{GZIP_PATH.relative_to(root_dir)}'")


download(verbose=True)
# Wrote 70.6 MB to 'data/raw/ucsd_playtime.json.gz'

Extract from GZIP

Code

import gzip

JSON_PATH = raw_data_dir / 'ucsd_playtime.json'


def extract_from_gzip(verbose: bool) -> None:
    chunk_size = 1048576   # 1 MB
    size = 0
    with gzip.open(GZIP_PATH) as gzip_file, JSON_PATH.open('wb') as json_file:
        while True:
            chunk = gzip_file.read(chunk_size)
            if not chunk:
                break
            size += json_file.write(chunk)
    if verbose:
        print(f"Wrote {size / 1024**2:,.1f} MB to '{JSON_PATH.relative_to(root_dir)}'")


extract_from_gzip(verbose=True)
# Wrote 527.5 MB to 'data/raw/ucsd_playtime.json'

Handling the JSON-by-Line Format

The file has 88,310 records, one per line — but each line is a Python dict literal with single quotes, not valid JSON. Calling pd.read_json() or json.load() on it fails immediately.

This format is almost certainly an artifact of how McAuley & Kang collected it: a script that appended each API response to the file one record at a time. That’s sensible for fault-tolerant scraping but leaves us with two problems:

Records must be split by newline, not treated as a single JSON document.
Each line is plain text and must be converted to a Python dict — ast.literal_eval() handles both the single-quote quoting and the nested structure cleanly.

A single parsed record looks like this:

{
  'user_id': 'Leaf_Light_Moscow',
  'items_count': 5,
  'steam_id': '76561198305694024',
  'user_url': 'http://steamcommunity.com/id/Leaf_Light_Moscow',
  'items': [
    {'item_id': '4000',  'item_name': "Garry's Mod", 'playtime_forever': 4548, 'playtime_2weeks': 1729},
    {'item_id': '221100','item_name': 'DayZ',         'playtime_forever': 48,   'playtime_2weeks': 0},
    ...
  ]
}

The nested items list will be expanded in Part 2. For now, we parse all 88,310 records into a DataFrame and pickle it (pickle preserves the nested list column; feather does not).

Code

import ast
import pandas as pd
from tqdm import tqdm


def parse_json_by_line(verbose: bool) -> pd.DataFrame:
    with JSON_PATH.open('r', encoding='utf-8') as file:
        if verbose:
            total = sum(1 for _ in file)
            file.seek(0)
            file = tqdm(file, 'Parsing JSON-by-Line', total, colour='green')
        records = []
        for line in file:
            line = line.strip()
            if not line:
                continue
            records.append(ast.literal_eval(line))
    return pd.DataFrame.from_records(records)


df = parse_json_by_line(verbose=True)
# 88,310 rows  ·  5 columns: user_id, items_count, steam_id, user_url, items

PICKLE_PATH = raw_data_dir / 'ucsd_playtime.pkl'
df.to_pickle(PICKLE_PATH)
# Wrote 488.93 MB to 'data/raw/ucsd_playtime.pkl'

Part 2 — Data Preparation

The raw data has four issues to resolve before it can feed a recommender model:

Naming confusion. The user_id column is actually a username; steam_id is the true unique integer identifier. We swap the names.
Nested structure. Each row is one user with a list of items. We need one row per user–item pair.
The cold-start problem. Users with very few interactions produce unstable recommendations. We drop users with fewer than 10 games.
Heavy-tailed playtime. A handful of users have logged tens of thousands of hours. We apply log(1 + playtime) to compress the tail.
Non-contiguous IDs. The ALS implementation expects 0-indexed integer IDs. We build and save mapping files so original IDs remain recoverable.

Helper: DataFrame Summary

Code

from typing import Optional
import numpy as np
from IPython.display import display, display_markdown


def summarize_df(df, name=None, nulls=True, head=5):
    summary = pd.DataFrame({
        'DType': df.dtypes,
        'Null': df.isna().sum().map('{:,.0f}'.format),
        'Total': len(df),
        '% Null': df.isna().mean().map('{:.2%}'.format),
    })
    if name:
        display_markdown(f'### {name}', raw=True)
    if nulls:
        display(summary)
    if head:
        display(df.head(head))

Cleaning: Expand Nested Data

Three functions applied in sequence:

Code

def correct_naming_of_user_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Drop username column; rename steam_id → user_id."""
    return df.drop(columns='user_id').rename(columns={'steam_id': 'user_id'})


def expand_nested_items(df: pd.DataFrame) -> pd.DataFrame:
    df = df.explode('items').reset_index(drop=True)
    items_df = pd.json_normalize(df['items'].tolist())
    return pd.concat([df.drop(columns='items'), items_df], axis=1)


def clean_expanded_data(df: pd.DataFrame) -> pd.DataFrame:
    df = df.drop(columns=['items_count', 'user_url', 'playtime_2weeks'])
    df = df.rename(columns={'playtime_forever': 'playtime'})
    df = df.dropna(subset=['user_id', 'item_id']).drop_duplicates(subset=['user_id', 'item_id'])
    df = df.astype({'user_id': int, 'item_id': int, 'playtime': float})
    return df.sort_values(by=['user_id', 'item_id']).reset_index(drop=True)

After expanding, we have 5,094,082 user–item rows across 88,310 users and 10,976 unique games.

Cold-Start Mitigation

Code

def filter_users_by_num_items(df: pd.DataFrame, min_items: int) -> pd.DataFrame:
    """Keep only users with at least min_items interactions."""
    n_items_by_user = df.groupby('user_id')['item_id'].count()
    eligible = n_items_by_user[n_items_by_user >= min_items].index
    return df[df['user_id'].isin(eligible)].reset_index(drop=True)

Filtering to users with ≥ 10 games retains 57,333 users and 5,038,365 interactions.

Log Playtime Transform

Code

def convert_to_log_playtime(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['playtime'] = np.log(1 + df['playtime'])
    return df

Contiguous ID Mapping

Code

def map_to_continuous_id(data: pd.Series) -> pd.Series:
    current_ids = sorted(data.unique())
    return data.map(dict(zip(current_ids, range(len(current_ids)))))

Separate functions save the user and item ID maps (with item names) as feather files so original Steam IDs remain recoverable after modeling.

Train / Validation / Test Splits

Standard random splits don’t work here — if a user appears only in the test set, the model has never seen them. Instead we split each user’s items independently, so every user is present in all three splits.

Code

def train_valid_test_split_by_item(
        df: pd.DataFrame,
        test_size: float = 0.2,
        valid_size: float = 0.2,
        random_state: Optional[int] = None,
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    rng = np.random.default_rng(random_state)
    train_dfs, valid_dfs, test_dfs = [], [], []
    for _, df_user in df.groupby('user_id'):
        n = len(df_user)
        n_test = max(1, round(n * test_size))
        n_valid = max(1, round(n * valid_size))
        idx = np.arange(n)
        rng.shuffle(idx)
        train_dfs.append(df_user.iloc[idx[n_test + n_valid:]])
        valid_dfs.append(df_user.iloc[idx[n_test:n_test + n_valid]])
        test_dfs.append(df_user.iloc[idx[:n_test]])
    return (
        pd.concat(train_dfs, ignore_index=True),
        pd.concat(valid_dfs, ignore_index=True),
        pd.concat(test_dfs, ignore_index=True),
    )

Sample Sizes

With 57,333 users × 10,976 items the full user–item matrix has 629M elements. Training on the full dataset during exploration would be slow, so we also generate a 1% user sample for iteration:

Sample	Users	Items	Matrix Elements
0.1%	57	1,590	90,630
1%	573	4,110	2,355,030
10%	5,733	7,911	45,353,763
100%	57,333	10,976	629,287,008

All modeling in Part 3 uses the 1% sample.

Part 3 — Implicit ALS Model

Collaborative filtering recommends items based on the behavior of similar users. Alternating Least Squares (Hu et al., 2007) factorizes the user–item matrix into latent user and item factor matrices. Because we have playtime (not star ratings), this is an implicit feedback problem — the confidence weight $c_{ui}$ controls how strongly the model trusts each observation.

\[ \mathcal{L}(X,Y) = \sum_{u,i} c_{ui}\bigl(p_{ui} - x_u^\top y_i\bigr)^2 + \lambda\!\left(\sum_u \|x_u\|^2 + \sum_i \|y_i\|^2\right) \]

where $p_{ui} = 1$ for observed pairs and $c_{ui} > 1$ scales confidence on observed entries.

We use the implicit library, which exposes this model via SciPy CSR matrices.

Sparse Matrix Construction

Code

from scipy.sparse import csr_matrix
import json

USER_COL, ITEM_COL, RATING_COL = 'user_id', 'item_id', 'playtime'


def get_user_item_csr_matrix(df, n_users, n_items) -> csr_matrix:
    ratings = df.pivot(index=USER_COL, columns=ITEM_COL, values=RATING_COL)
    observed_users = ratings.index.values.reshape(-1, 1)
    observed_items = ratings.columns.values
    R = np.full((n_users, n_items), 0.0)
    R[observed_users, observed_items] = ratings.values
    R[np.isnan(R)] = 0
    return csr_matrix(R)

Initial Training

Code

import matplotlib.pyplot as plt
from implicit.cpu.als import AlternatingLeastSquares as ALSModel


class LossCallback:
    def __init__(self):
        self.iterations, self.losses = [], []

    def __call__(self, iteration, time, loss):
        self.iterations.append(iteration)
        self.losses.append(loss)


loss_callback = LossCallback()
model = ALSModel(factors=32, iterations=15, calculate_training_loss=True, random_state=0)
model.fit(R_train, show_progress=False, callback=loss_callback)

Training loss decreases smoothly over 15 iterations with no signs of instability — a good baseline before tuning.

Hyperparameter Tuning

Grid search over 125 combinations (5 × 5 × 5):

Hyperparameter	Values
Latent factors	16, 24, 32, 48, 64
Regularization λ	0.001, 0.01, 0.1, 10, 100
Confidence weight α	0.1, 0.5, 1.0, 5.0, 10.0

Code

from itertools import product
from sklearn.metrics import ndcg_score
from numpy.typing import NDArray
from tqdm.auto import tqdm


def predict_als(model: ALSModel, R_true: csr_matrix) -> NDArray:
    R_pred = model.user_factors @ model.item_factors.T
    R_pred[R_true.toarray() == 0] = 0
    return R_pred


class ALSGridSearch:
    def __init__(self, R_train, R_valid, iterations=15, random_state=None):
        self.R_train = R_train
        self.R_valid = R_valid
        self.iterations = iterations
        self.random_state = random_state

    def run(self, factors, regularization, alpha, verbose=True):
        parameters = list(product(factors, regularization, alpha))
        if verbose:
            parameters = tqdm(parameters, 'Running grid search')
        results = [self._run_once(f, l, a) for f, l, a in parameters]
        df = pd.DataFrame(list(product(factors, regularization, alpha)),
                          columns=['factors', 'regularization', 'alpha'])
        df[['loss', 'metric']] = results
        return df

    def _run_once(self, factors, regularization, alpha):
        cb = LossCallback()
        m = ALSModel(factors=factors, regularization=regularization, alpha=alpha,
                     iterations=self.iterations, calculate_training_loss=True,
                     random_state=self.random_state)
        m.fit(self.R_train, show_progress=False, callback=cb)
        R_pred = predict_als(m, self.R_valid)
        return cb.losses[-1], ndcg_score(self.R_valid.toarray(), R_pred, k=10)

Top 5 results by NDCG@10 on the validation set:

factors	regularization	alpha	NDCG@10
64	10.0	0.5	89.47%
48	10.0	0.5	89.04%
32	10.0	0.5	88.65%
64	10.0	1.0	88.62%
24	10.0	0.5	88.47%

Heatmaps of pairwise parameter interactions confirm that regularization is the dominant factor — λ = 10 outperforms all other values by a wide margin, while the model is relatively robust to moderate changes in factor count and alpha.

Final Model

Train on the combined train + validation sets using the best configuration, then evaluate on the held-out test set:

Code

R_train_full = R_train + R_valid   # no overlap, so simple addition is safe

model = ALSModel(factors=64, regularization=10.0, alpha=0.5, iterations=15, random_state=0)
model.fit(R_train_full, show_progress=False)

R_pred = predict_als(model, R_test)
ndcg_10 = ndcg_score(R_test.toarray(), R_pred, k=10)
print(f'Final NDCG@10:  {ndcg_10:.2%}')
# Final NDCG@10:  79.79%

Final NDCG@10: 79.79% — strong ranking quality on unseen data given the sparsity of the 1% sample.

Key Implementation Notes

Cold-start mitigation. Filtering to users with ≥ 10 interactions before training reduces noise from extremely sparse users and makes evaluation more meaningful — a model can’t be meaningfully evaluated on users it has almost no signal for.

Implicit feedback formulation. Playtime is treated as a confidence-weighted preference signal, not an explicit rating. The ALS confidence weight $c_{ui}$ lets the model distinguish “user played this for 1000 hours” from “user played this for 1 hour” without treating either as a negative signal. The log transform log(1 + playtime) additionally compresses the heavy tail so power users don’t dominate the factorization.

Evaluation. NDCG@10 measures ranking quality — whether the items a user actually played appear near the top of the predicted ranking. It is more informative than accuracy for recommender systems where the top-K list is what the user sees.

--- title: Steam Recommender System subtitle: Recommending Games Based on User Playtime description: "End-to-end recommender system pipeline: ingesting 5M+ Steam playtime records, applying implicit-feedback ALS, and achieving 79.79% NDCG@10 on held-out data." author: Matthew Burns date: 2026-01-19 date-modified: 2026-05-26 image: sparse_matrix_thumbnail.svg include-before-body: text: '<img src="sparse_matrix_header.svg" style="width:100%; margin-bottom:1rem;">' categories: - Data Science - Machine Learning - Recommender Systems - Video Games format: html: toc: true toc-depth: 3 code-fold: true code-tools: true page-layout: full execute: eval: false freeze: auto --- ::: {.callout-note title="TL;DR"} - Objective was to build a **collaborative-filtering recommender system** for Steam games using an **implicit-feedback Alternating Least Squares (ALS)** model. - The full pipeline covers data ingestion from a raw gzipped JSON-lines file, cleaning and reshaping 5M+ user-item interactions, hyperparameter tuning via grid search, & final evaluation. - **Key result:** Model achieved **NDCG@10 = 79.79%** on a held-out test set. ::: [GitHub repo](https://github.com/msburns24/Steam-Recommender-System) ## What This Demonstrates - End-to-end DS workflow: **data ingestion → cleaning/reshaping → modeling → evaluation** - Working with real-world messy nested data (Steam user-item histories) - Recommender-system fundamentals: - **Implicit feedback** (playtime) vs explicit ratings - Sparse user-item matrices - ALS factorization and practical tuning - Pragmatic model validation using ranking metrics (**NDCG@K**) ## Part 1 — Data Collection Much of the code in this project is organized into functions, even where a one-liner would work. This reflects a preference for functional-style data transformations: it makes each step easier to reason about and keeps the pipeline explicit when all transformations are composed together later. ### Data Source The dataset comes from Julian McAuley and Wang-Cheng Kang at UC San Diego, hosted on [McAuley's website](https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data). The **V1 User-Items** dataset describes the hours that Australian Steam users played each of their games. It arrives as a `.json.gz` file — 70.6 MB compressed, 527.5 MB extracted. ```{python} from pathlib import Path root_dir = Path.cwd().resolve().parent raw_data_dir = root_dir / 'data' / 'raw' raw_data_dir.mkdir(parents=True, exist_ok=True) GZIP_PATH = raw_data_dir / 'ucsd_playtime.json.gz' URL = 'https://mcauleylab.ucsd.edu/public_datasets/data/steam/australian_users_items.json.gz' ``` ### Download ```{python} import requests def download(verbose: bool) -> None: resp = requests.get(URL) resp.raise_for_status() with GZIP_PATH.open('wb') as file: size = file.write(resp.content) if verbose: print(f"Wrote {size / 1024**2:,.1f} MB to '{GZIP_PATH.relative_to(root_dir)}'") download(verbose=True) # Wrote 70.6 MB to 'data/raw/ucsd_playtime.json.gz' ``` ### Extract from GZIP ```{python} import gzip JSON_PATH = raw_data_dir / 'ucsd_playtime.json' def extract_from_gzip(verbose: bool) -> None: chunk_size = 1048576 # 1 MB size = 0 with gzip.open(GZIP_PATH) as gzip_file, JSON_PATH.open('wb') as json_file: while True: chunk = gzip_file.read(chunk_size) if not chunk: break size += json_file.write(chunk) if verbose: print(f"Wrote {size / 1024**2:,.1f} MB to '{JSON_PATH.relative_to(root_dir)}'") extract_from_gzip(verbose=True) # Wrote 527.5 MB to 'data/raw/ucsd_playtime.json' ``` ### Handling the JSON-by-Line Format The file has 88,310 records, one per line — but each line is a Python dict literal with single quotes, not valid JSON. Calling `pd.read_json()` or `json.load()` on it fails immediately. This format is almost certainly an artifact of how McAuley & Kang collected it: a script that appended each API response to the file one record at a time. That's sensible for fault-tolerant scraping but leaves us with two problems: 1. Records must be split by newline, not treated as a single JSON document. 2. Each line is plain text and must be converted to a Python dict — `ast.literal_eval()` handles both the single-quote quoting and the nested structure cleanly. A single parsed record looks like this: ```python { 'user_id': 'Leaf_Light_Moscow', 'items_count': 5, 'steam_id': '76561198305694024', 'user_url': 'http://steamcommunity.com/id/Leaf_Light_Moscow', 'items': [ {'item_id': '4000', 'item_name': "Garry's Mod", 'playtime_forever': 4548, 'playtime_2weeks': 1729}, {'item_id': '221100','item_name': 'DayZ', 'playtime_forever': 48, 'playtime_2weeks': 0}, ... ] } ``` The nested `items` list will be expanded in Part 2. For now, we parse all 88,310 records into a DataFrame and pickle it (pickle preserves the nested list column; feather does not). ```{python} import ast import pandas as pd from tqdm import tqdm def parse_json_by_line(verbose: bool) -> pd.DataFrame: with JSON_PATH.open('r', encoding='utf-8') as file: if verbose: total = sum(1 for _ in file) file.seek(0) file = tqdm(file, 'Parsing JSON-by-Line', total, colour='green') records = [] for line in file: line = line.strip() if not line: continue records.append(ast.literal_eval(line)) return pd.DataFrame.from_records(records) df = parse_json_by_line(verbose=True) # 88,310 rows · 5 columns: user_id, items_count, steam_id, user_url, items PICKLE_PATH = raw_data_dir / 'ucsd_playtime.pkl' df.to_pickle(PICKLE_PATH) # Wrote 488.93 MB to 'data/raw/ucsd_playtime.pkl' ``` ## Part 2 — Data Preparation The raw data has four issues to resolve before it can feed a recommender model: 1. **Naming confusion.** The `user_id` column is actually a username; `steam_id` is the true unique integer identifier. We swap the names. 2. **Nested structure.** Each row is one user with a list of items. We need one row per user–item pair. 3. **The cold-start problem.** Users with very few interactions produce unstable recommendations. We drop users with fewer than 10 games. 4. **Heavy-tailed playtime.** A handful of users have logged tens of thousands of hours. We apply `log(1 + playtime)` to compress the tail. 5. **Non-contiguous IDs.** The ALS implementation expects 0-indexed integer IDs. We build and save mapping files so original IDs remain recoverable. ### Helper: DataFrame Summary ```{python} from typing import Optional import numpy as np from IPython.display import display, display_markdown def summarize_df(df, name=None, nulls=True, head=5): summary = pd.DataFrame({ 'DType': df.dtypes, 'Null': df.isna().sum().map('{:,.0f}'.format), 'Total': len(df), '% Null': df.isna().mean().map('{:.2%}'.format), }) if name: display_markdown(f'### {name}', raw=True) if nulls: display(summary) if head: display(df.head(head)) ``` ### Cleaning: Expand Nested Data Three functions applied in sequence: ```{python} def correct_naming_of_user_columns(df: pd.DataFrame) -> pd.DataFrame: """Drop username column; rename steam_id → user_id.""" return df.drop(columns='user_id').rename(columns={'steam_id': 'user_id'}) def expand_nested_items(df: pd.DataFrame) -> pd.DataFrame: df = df.explode('items').reset_index(drop=True) items_df = pd.json_normalize(df['items'].tolist()) return pd.concat([df.drop(columns='items'), items_df], axis=1) def clean_expanded_data(df: pd.DataFrame) -> pd.DataFrame: df = df.drop(columns=['items_count', 'user_url', 'playtime_2weeks']) df = df.rename(columns={'playtime_forever': 'playtime'}) df = df.dropna(subset=['user_id', 'item_id']).drop_duplicates(subset=['user_id', 'item_id']) df = df.astype({'user_id': int, 'item_id': int, 'playtime': float}) return df.sort_values(by=['user_id', 'item_id']).reset_index(drop=True) ``` After expanding, we have **5,094,082** user–item rows across 88,310 users and 10,976 unique games. ### Cold-Start Mitigation ```{python} def filter_users_by_num_items(df: pd.DataFrame, min_items: int) -> pd.DataFrame: """Keep only users with at least min_items interactions.""" n_items_by_user = df.groupby('user_id')['item_id'].count() eligible = n_items_by_user[n_items_by_user >= min_items].index return df[df['user_id'].isin(eligible)].reset_index(drop=True) ``` Filtering to users with ≥ 10 games retains **57,333 users** and **5,038,365 interactions**. ### Log Playtime Transform ```{python} def convert_to_log_playtime(df: pd.DataFrame) -> pd.DataFrame: df = df.copy() df['playtime'] = np.log(1 + df['playtime']) return df ``` ### Contiguous ID Mapping ```{python} def map_to_continuous_id(data: pd.Series) -> pd.Series: current_ids = sorted(data.unique()) return data.map(dict(zip(current_ids, range(len(current_ids))))) ``` Separate functions save the user and item ID maps (with item names) as feather files so original Steam IDs remain recoverable after modeling. ### Train / Validation / Test Splits Standard random splits don't work here — if a user appears only in the test set, the model has never seen them. Instead we split *each user's items* independently, so every user is present in all three splits. ```{python} def train_valid_test_split_by_item( df: pd.DataFrame, test_size: float = 0.2, valid_size: float = 0.2, random_state: Optional[int] = None, ) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: rng = np.random.default_rng(random_state) train_dfs, valid_dfs, test_dfs = [], [], [] for _, df_user in df.groupby('user_id'): n = len(df_user) n_test = max(1, round(n * test_size)) n_valid = max(1, round(n * valid_size)) idx = np.arange(n) rng.shuffle(idx) train_dfs.append(df_user.iloc[idx[n_test + n_valid:]]) valid_dfs.append(df_user.iloc[idx[n_test:n_test + n_valid]]) test_dfs.append(df_user.iloc[idx[:n_test]]) return ( pd.concat(train_dfs, ignore_index=True), pd.concat(valid_dfs, ignore_index=True), pd.concat(test_dfs, ignore_index=True), ) ``` ### Sample Sizes With 57,333 users × 10,976 items the full user–item matrix has **629M elements**. Training on the full dataset during exploration would be slow, so we also generate a 1% user sample for iteration: | Sample | Users | Items | Matrix Elements | |--------|------:|------:|----------------:| | 0.1% | 57 | 1,590 | 90,630 | | 1% | 573 | 4,110 | 2,355,030 | | 10% | 5,733 | 7,911 | 45,353,763 | | 100% | 57,333| 10,976| 629,287,008 | All modeling in Part 3 uses the **1% sample**. ## Part 3 — Implicit ALS Model *Collaborative filtering* recommends items based on the behavior of similar users. *Alternating Least Squares* (Hu et al., 2007) factorizes the user–item matrix into latent user and item factor matrices. Because we have playtime (not star ratings), this is an **implicit feedback** problem — the confidence weight $c_{ui}$ controls how strongly the model trusts each observation. $$ \mathcal{L}(X,Y) = \sum_{u,i} c_{ui}\bigl(p_{ui} - x_u^\top y_i\bigr)^2 + \lambda\!\left(\sum_u \|x_u\|^2 + \sum_i \|y_i\|^2\right) $$ where $p_{ui} = 1$ for observed pairs and $c_{ui} > 1$ scales confidence on observed entries. We use the [`implicit`](https://github.com/benfred/implicit) library, which exposes this model via SciPy CSR matrices. ### Sparse Matrix Construction ```{python} from scipy.sparse import csr_matrix import json USER_COL, ITEM_COL, RATING_COL = 'user_id', 'item_id', 'playtime' def get_user_item_csr_matrix(df, n_users, n_items) -> csr_matrix: ratings = df.pivot(index=USER_COL, columns=ITEM_COL, values=RATING_COL) observed_users = ratings.index.values.reshape(-1, 1) observed_items = ratings.columns.values R = np.full((n_users, n_items), 0.0) R[observed_users, observed_items] = ratings.values R[np.isnan(R)] = 0 return csr_matrix(R) ``` ### Initial Training ```{python} import matplotlib.pyplot as plt from implicit.cpu.als import AlternatingLeastSquares as ALSModel class LossCallback: def __init__(self): self.iterations, self.losses = [], [] def __call__(self, iteration, time, loss): self.iterations.append(iteration) self.losses.append(loss) loss_callback = LossCallback() model = ALSModel(factors=32, iterations=15, calculate_training_loss=True, random_state=0) model.fit(R_train, show_progress=False, callback=loss_callback) ``` Training loss decreases smoothly over 15 iterations with no signs of instability — a good baseline before tuning. ### Hyperparameter Tuning Grid search over 125 combinations (5 × 5 × 5): | Hyperparameter | Values | |---|---| | Latent factors | 16, 24, 32, 48, 64 | | Regularization λ | 0.001, 0.01, 0.1, 10, 100 | | Confidence weight α | 0.1, 0.5, 1.0, 5.0, 10.0 | ```{python} from itertools import product from sklearn.metrics import ndcg_score from numpy.typing import NDArray from tqdm.auto import tqdm def predict_als(model: ALSModel, R_true: csr_matrix) -> NDArray: R_pred = model.user_factors @ model.item_factors.T R_pred[R_true.toarray() == 0] = 0 return R_pred class ALSGridSearch: def __init__(self, R_train, R_valid, iterations=15, random_state=None): self.R_train = R_train self.R_valid = R_valid self.iterations = iterations self.random_state = random_state def run(self, factors, regularization, alpha, verbose=True): parameters = list(product(factors, regularization, alpha)) if verbose: parameters = tqdm(parameters, 'Running grid search') results = [self._run_once(f, l, a) for f, l, a in parameters] df = pd.DataFrame(list(product(factors, regularization, alpha)), columns=['factors', 'regularization', 'alpha']) df[['loss', 'metric']] = results return df def _run_once(self, factors, regularization, alpha): cb = LossCallback() m = ALSModel(factors=factors, regularization=regularization, alpha=alpha, iterations=self.iterations, calculate_training_loss=True, random_state=self.random_state) m.fit(self.R_train, show_progress=False, callback=cb) R_pred = predict_als(m, self.R_valid) return cb.losses[-1], ndcg_score(self.R_valid.toarray(), R_pred, k=10) ``` Top 5 results by NDCG@10 on the validation set: | factors | regularization | alpha | NDCG@10 | |--------:|---------------:|------:|--------:| | 64 | 10.0 | 0.5 | 89.47% | | 48 | 10.0 | 0.5 | 89.04% | | 32 | 10.0 | 0.5 | 88.65% | | 64 | 10.0 | 1.0 | 88.62% | | 24 | 10.0 | 0.5 | 88.47% | Heatmaps of pairwise parameter interactions confirm that regularization is the dominant factor — λ = 10 outperforms all other values by a wide margin, while the model is relatively robust to moderate changes in factor count and alpha. ### Final Model Train on the combined train + validation sets using the best configuration, then evaluate on the held-out test set: ```{python} R_train_full = R_train + R_valid # no overlap, so simple addition is safe model = ALSModel(factors=64, regularization=10.0, alpha=0.5, iterations=15, random_state=0) model.fit(R_train_full, show_progress=False) R_pred = predict_als(model, R_test) ndcg_10 = ndcg_score(R_test.toarray(), R_pred, k=10) print(f'Final NDCG@10: {ndcg_10:.2%}') # Final NDCG@10: 79.79% ``` **Final NDCG@10: 79.79%** — strong ranking quality on unseen data given the sparsity of the 1% sample. ## Key Implementation Notes **Cold-start mitigation.** Filtering to users with ≥ 10 interactions before training reduces noise from extremely sparse users and makes evaluation more meaningful — a model can't be meaningfully evaluated on users it has almost no signal for. **Implicit feedback formulation.** Playtime is treated as a confidence-weighted preference signal, not an explicit rating. The ALS confidence weight $c_{ui}$ lets the model distinguish "user played this for 1000 hours" from "user played this for 1 hour" without treating either as a negative signal. The log transform `log(1 + playtime)` additionally compresses the heavy tail so power users don't dominate the factorization. **Evaluation.** NDCG@10 measures ranking quality — whether the items a user actually played appear near the top of the predicted ranking. It is more informative than accuracy for recommender systems where the top-K list is what the user sees.