matthew burns
  • Home
  • About
  • Posts
  • Projects

On this page

  • What This Demonstrates
  • Part 1 — Data Collection
    • Data Source
    • Download
    • Extract from GZIP
    • Handling the JSON-by-Line Format
  • Part 2 — Data Preparation
    • Helper: DataFrame Summary
    • Cleaning: Expand Nested Data
    • Cold-Start Mitigation
    • Log Playtime Transform
    • Contiguous ID Mapping
    • Train / Validation / Test Splits
    • Sample Sizes
  • Part 3 — Implicit ALS Model
    • Sparse Matrix Construction
    • Initial Training
    • Hyperparameter Tuning
    • Final Model
  • Key Implementation Notes

Steam Recommender System

  • Show All Code
  • Hide All Code

  • View Source

Recommending Games Based on User Playtime

Data Science
Machine Learning
Recommender Systems
Video Games
End-to-end recommender system pipeline: ingesting 5M+ Steam playtime records, applying implicit-feedback ALS, and achieving 79.79% NDCG@10 on held-out data.
Author

Matthew Burns

Published

January 19, 2026

Modified

May 26, 2026

TL;DR
  • Objective was to build a collaborative-filtering recommender system for Steam games using an implicit-feedback Alternating Least Squares (ALS) model.

  • The full pipeline covers data ingestion from a raw gzipped JSON-lines file, cleaning and reshaping 5M+ user-item interactions, hyperparameter tuning via grid search, & final evaluation.

  • Key result: Model achieved NDCG@10 = 79.79% on a held-out test set.

GitHub repo

What This Demonstrates

  • End-to-end DS workflow: data ingestion → cleaning/reshaping → modeling → evaluation
  • Working with real-world messy nested data (Steam user-item histories)
  • Recommender-system fundamentals:
    • Implicit feedback (playtime) vs explicit ratings
    • Sparse user-item matrices
    • ALS factorization and practical tuning
  • Pragmatic model validation using ranking metrics (NDCG@K)

Part 1 — Data Collection

Much of the code in this project is organized into functions, even where a one-liner would work. This reflects a preference for functional-style data transformations: it makes each step easier to reason about and keeps the pipeline explicit when all transformations are composed together later.

Data Source

The dataset comes from Julian McAuley and Wang-Cheng Kang at UC San Diego, hosted on McAuley’s website. The V1 User-Items dataset describes the hours that Australian Steam users played each of their games. It arrives as a .json.gz file — 70.6 MB compressed, 527.5 MB extracted.

Code
from pathlib import Path

root_dir = Path.cwd().resolve().parent
raw_data_dir = root_dir / 'data' / 'raw'
raw_data_dir.mkdir(parents=True, exist_ok=True)

GZIP_PATH = raw_data_dir / 'ucsd_playtime.json.gz'
URL = 'https://mcauleylab.ucsd.edu/public_datasets/data/steam/australian_users_items.json.gz'

Download

Code
import requests


def download(verbose: bool) -> None:
    resp = requests.get(URL)
    resp.raise_for_status()
    with GZIP_PATH.open('wb') as file:
        size = file.write(resp.content)
    if verbose:
        print(f"Wrote {size / 1024**2:,.1f} MB to '{GZIP_PATH.relative_to(root_dir)}'")


download(verbose=True)
# Wrote 70.6 MB to 'data/raw/ucsd_playtime.json.gz'

Extract from GZIP

Code
import gzip

JSON_PATH = raw_data_dir / 'ucsd_playtime.json'


def extract_from_gzip(verbose: bool) -> None:
    chunk_size = 1048576   # 1 MB
    size = 0
    with gzip.open(GZIP_PATH) as gzip_file, JSON_PATH.open('wb') as json_file:
        while True:
            chunk = gzip_file.read(chunk_size)
            if not chunk:
                break
            size += json_file.write(chunk)
    if verbose:
        print(f"Wrote {size / 1024**2:,.1f} MB to '{JSON_PATH.relative_to(root_dir)}'")


extract_from_gzip(verbose=True)
# Wrote 527.5 MB to 'data/raw/ucsd_playtime.json'

Handling the JSON-by-Line Format

The file has 88,310 records, one per line — but each line is a Python dict literal with single quotes, not valid JSON. Calling pd.read_json() or json.load() on it fails immediately.

This format is almost certainly an artifact of how McAuley & Kang collected it: a script that appended each API response to the file one record at a time. That’s sensible for fault-tolerant scraping but leaves us with two problems:

  1. Records must be split by newline, not treated as a single JSON document.
  2. Each line is plain text and must be converted to a Python dict — ast.literal_eval() handles both the single-quote quoting and the nested structure cleanly.

A single parsed record looks like this:

{
  'user_id': 'Leaf_Light_Moscow',
  'items_count': 5,
  'steam_id': '76561198305694024',
  'user_url': 'http://steamcommunity.com/id/Leaf_Light_Moscow',
  'items': [
    {'item_id': '4000',  'item_name': "Garry's Mod", 'playtime_forever': 4548, 'playtime_2weeks': 1729},
    {'item_id': '221100','item_name': 'DayZ',         'playtime_forever': 48,   'playtime_2weeks': 0},
    ...
  ]
}

The nested items list will be expanded in Part 2. For now, we parse all 88,310 records into a DataFrame and pickle it (pickle preserves the nested list column; feather does not).

Code
import ast
import pandas as pd
from tqdm import tqdm


def parse_json_by_line(verbose: bool) -> pd.DataFrame:
    with JSON_PATH.open('r', encoding='utf-8') as file:
        if verbose:
            total = sum(1 for _ in file)
            file.seek(0)
            file = tqdm(file, 'Parsing JSON-by-Line', total, colour='green')
        records = []
        for line in file:
            line = line.strip()
            if not line:
                continue
            records.append(ast.literal_eval(line))
    return pd.DataFrame.from_records(records)


df = parse_json_by_line(verbose=True)
# 88,310 rows  ·  5 columns: user_id, items_count, steam_id, user_url, items

PICKLE_PATH = raw_data_dir / 'ucsd_playtime.pkl'
df.to_pickle(PICKLE_PATH)
# Wrote 488.93 MB to 'data/raw/ucsd_playtime.pkl'

Part 2 — Data Preparation

The raw data has four issues to resolve before it can feed a recommender model:

  1. Naming confusion. The user_id column is actually a username; steam_id is the true unique integer identifier. We swap the names.
  2. Nested structure. Each row is one user with a list of items. We need one row per user–item pair.
  3. The cold-start problem. Users with very few interactions produce unstable recommendations. We drop users with fewer than 10 games.
  4. Heavy-tailed playtime. A handful of users have logged tens of thousands of hours. We apply log(1 + playtime) to compress the tail.
  5. Non-contiguous IDs. The ALS implementation expects 0-indexed integer IDs. We build and save mapping files so original IDs remain recoverable.

Helper: DataFrame Summary

Code
from typing import Optional
import numpy as np
from IPython.display import display, display_markdown


def summarize_df(df, name=None, nulls=True, head=5):
    summary = pd.DataFrame({
        'DType': df.dtypes,
        'Null': df.isna().sum().map('{:,.0f}'.format),
        'Total': len(df),
        '% Null': df.isna().mean().map('{:.2%}'.format),
    })
    if name:
        display_markdown(f'### {name}', raw=True)
    if nulls:
        display(summary)
    if head:
        display(df.head(head))

Cleaning: Expand Nested Data

Three functions applied in sequence:

Code
def correct_naming_of_user_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Drop username column; rename steam_id → user_id."""
    return df.drop(columns='user_id').rename(columns={'steam_id': 'user_id'})


def expand_nested_items(df: pd.DataFrame) -> pd.DataFrame:
    df = df.explode('items').reset_index(drop=True)
    items_df = pd.json_normalize(df['items'].tolist())
    return pd.concat([df.drop(columns='items'), items_df], axis=1)


def clean_expanded_data(df: pd.DataFrame) -> pd.DataFrame:
    df = df.drop(columns=['items_count', 'user_url', 'playtime_2weeks'])
    df = df.rename(columns={'playtime_forever': 'playtime'})
    df = df.dropna(subset=['user_id', 'item_id']).drop_duplicates(subset=['user_id', 'item_id'])
    df = df.astype({'user_id': int, 'item_id': int, 'playtime': float})
    return df.sort_values(by=['user_id', 'item_id']).reset_index(drop=True)

After expanding, we have 5,094,082 user–item rows across 88,310 users and 10,976 unique games.

Cold-Start Mitigation

Code
def filter_users_by_num_items(df: pd.DataFrame, min_items: int) -> pd.DataFrame:
    """Keep only users with at least min_items interactions."""
    n_items_by_user = df.groupby('user_id')['item_id'].count()
    eligible = n_items_by_user[n_items_by_user >= min_items].index
    return df[df['user_id'].isin(eligible)].reset_index(drop=True)

Filtering to users with ≥ 10 games retains 57,333 users and 5,038,365 interactions.

Log Playtime Transform

Code
def convert_to_log_playtime(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['playtime'] = np.log(1 + df['playtime'])
    return df

Contiguous ID Mapping

Code
def map_to_continuous_id(data: pd.Series) -> pd.Series:
    current_ids = sorted(data.unique())
    return data.map(dict(zip(current_ids, range(len(current_ids)))))

Separate functions save the user and item ID maps (with item names) as feather files so original Steam IDs remain recoverable after modeling.

Train / Validation / Test Splits

Standard random splits don’t work here — if a user appears only in the test set, the model has never seen them. Instead we split each user’s items independently, so every user is present in all three splits.

Code
def train_valid_test_split_by_item(
        df: pd.DataFrame,
        test_size: float = 0.2,
        valid_size: float = 0.2,
        random_state: Optional[int] = None,
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    rng = np.random.default_rng(random_state)
    train_dfs, valid_dfs, test_dfs = [], [], []
    for _, df_user in df.groupby('user_id'):
        n = len(df_user)
        n_test = max(1, round(n * test_size))
        n_valid = max(1, round(n * valid_size))
        idx = np.arange(n)
        rng.shuffle(idx)
        train_dfs.append(df_user.iloc[idx[n_test + n_valid:]])
        valid_dfs.append(df_user.iloc[idx[n_test:n_test + n_valid]])
        test_dfs.append(df_user.iloc[idx[:n_test]])
    return (
        pd.concat(train_dfs, ignore_index=True),
        pd.concat(valid_dfs, ignore_index=True),
        pd.concat(test_dfs, ignore_index=True),
    )

Sample Sizes

With 57,333 users × 10,976 items the full user–item matrix has 629M elements. Training on the full dataset during exploration would be slow, so we also generate a 1% user sample for iteration:

Sample Users Items Matrix Elements
0.1% 57 1,590 90,630
1% 573 4,110 2,355,030
10% 5,733 7,911 45,353,763
100% 57,333 10,976 629,287,008

All modeling in Part 3 uses the 1% sample.

Part 3 — Implicit ALS Model

Collaborative filtering recommends items based on the behavior of similar users. Alternating Least Squares (Hu et al., 2007) factorizes the user–item matrix into latent user and item factor matrices. Because we have playtime (not star ratings), this is an implicit feedback problem — the confidence weight \(c_{ui}\) controls how strongly the model trusts each observation.

\[ \mathcal{L}(X,Y) = \sum_{u,i} c_{ui}\bigl(p_{ui} - x_u^\top y_i\bigr)^2 + \lambda\!\left(\sum_u \|x_u\|^2 + \sum_i \|y_i\|^2\right) \]

where \(p_{ui} = 1\) for observed pairs and \(c_{ui} > 1\) scales confidence on observed entries.

We use the implicit library, which exposes this model via SciPy CSR matrices.

Sparse Matrix Construction

Code
from scipy.sparse import csr_matrix
import json

USER_COL, ITEM_COL, RATING_COL = 'user_id', 'item_id', 'playtime'


def get_user_item_csr_matrix(df, n_users, n_items) -> csr_matrix:
    ratings = df.pivot(index=USER_COL, columns=ITEM_COL, values=RATING_COL)
    observed_users = ratings.index.values.reshape(-1, 1)
    observed_items = ratings.columns.values
    R = np.full((n_users, n_items), 0.0)
    R[observed_users, observed_items] = ratings.values
    R[np.isnan(R)] = 0
    return csr_matrix(R)

Initial Training

Code
import matplotlib.pyplot as plt
from implicit.cpu.als import AlternatingLeastSquares as ALSModel


class LossCallback:
    def __init__(self):
        self.iterations, self.losses = [], []

    def __call__(self, iteration, time, loss):
        self.iterations.append(iteration)
        self.losses.append(loss)


loss_callback = LossCallback()
model = ALSModel(factors=32, iterations=15, calculate_training_loss=True, random_state=0)
model.fit(R_train, show_progress=False, callback=loss_callback)

Training loss decreases smoothly over 15 iterations with no signs of instability — a good baseline before tuning.

Hyperparameter Tuning

Grid search over 125 combinations (5 × 5 × 5):

Hyperparameter Values
Latent factors 16, 24, 32, 48, 64
Regularization λ 0.001, 0.01, 0.1, 10, 100
Confidence weight α 0.1, 0.5, 1.0, 5.0, 10.0
Code
from itertools import product
from sklearn.metrics import ndcg_score
from numpy.typing import NDArray
from tqdm.auto import tqdm


def predict_als(model: ALSModel, R_true: csr_matrix) -> NDArray:
    R_pred = model.user_factors @ model.item_factors.T
    R_pred[R_true.toarray() == 0] = 0
    return R_pred


class ALSGridSearch:
    def __init__(self, R_train, R_valid, iterations=15, random_state=None):
        self.R_train = R_train
        self.R_valid = R_valid
        self.iterations = iterations
        self.random_state = random_state

    def run(self, factors, regularization, alpha, verbose=True):
        parameters = list(product(factors, regularization, alpha))
        if verbose:
            parameters = tqdm(parameters, 'Running grid search')
        results = [self._run_once(f, l, a) for f, l, a in parameters]
        df = pd.DataFrame(list(product(factors, regularization, alpha)),
                          columns=['factors', 'regularization', 'alpha'])
        df[['loss', 'metric']] = results
        return df

    def _run_once(self, factors, regularization, alpha):
        cb = LossCallback()
        m = ALSModel(factors=factors, regularization=regularization, alpha=alpha,
                     iterations=self.iterations, calculate_training_loss=True,
                     random_state=self.random_state)
        m.fit(self.R_train, show_progress=False, callback=cb)
        R_pred = predict_als(m, self.R_valid)
        return cb.losses[-1], ndcg_score(self.R_valid.toarray(), R_pred, k=10)

Top 5 results by NDCG@10 on the validation set:

factors regularization alpha NDCG@10
64 10.0 0.5 89.47%
48 10.0 0.5 89.04%
32 10.0 0.5 88.65%
64 10.0 1.0 88.62%
24 10.0 0.5 88.47%

Heatmaps of pairwise parameter interactions confirm that regularization is the dominant factor — λ = 10 outperforms all other values by a wide margin, while the model is relatively robust to moderate changes in factor count and alpha.

Final Model

Train on the combined train + validation sets using the best configuration, then evaluate on the held-out test set:

Code
R_train_full = R_train + R_valid   # no overlap, so simple addition is safe

model = ALSModel(factors=64, regularization=10.0, alpha=0.5, iterations=15, random_state=0)
model.fit(R_train_full, show_progress=False)

R_pred = predict_als(model, R_test)
ndcg_10 = ndcg_score(R_test.toarray(), R_pred, k=10)
print(f'Final NDCG@10:  {ndcg_10:.2%}')
# Final NDCG@10:  79.79%

Final NDCG@10: 79.79% — strong ranking quality on unseen data given the sparsity of the 1% sample.

Key Implementation Notes

Cold-start mitigation. Filtering to users with ≥ 10 interactions before training reduces noise from extremely sparse users and makes evaluation more meaningful — a model can’t be meaningfully evaluated on users it has almost no signal for.

Implicit feedback formulation. Playtime is treated as a confidence-weighted preference signal, not an explicit rating. The ALS confidence weight \(c_{ui}\) lets the model distinguish “user played this for 1000 hours” from “user played this for 1 hour” without treating either as a negative signal. The log transform log(1 + playtime) additionally compresses the heavy tail so power users don’t dominate the factorization.

Evaluation. NDCG@10 measures ranking quality — whether the items a user actually played appear near the top of the predicted ranking. It is more informative than accuracy for recommender systems where the top-K list is what the user sees.

Source Code
---
title: Steam Recommender System
subtitle: Recommending Games Based on User Playtime
description: "End-to-end recommender system pipeline: ingesting 5M+ Steam playtime records, applying implicit-feedback ALS, and achieving 79.79% NDCG@10 on held-out data."
author: Matthew Burns
date: 2026-01-19
date-modified: 2026-05-26
image: sparse_matrix_thumbnail.svg
include-before-body:
  text: '<img src="sparse_matrix_header.svg" style="width:100%; margin-bottom:1rem;">'
categories:
  - Data Science
  - Machine Learning
  - Recommender Systems
  - Video Games
format:
  html:
    toc: true
    toc-depth: 3
    code-fold: true
    code-tools: true
    page-layout: full
execute:
  eval: false
freeze: auto
---

::: {.callout-note title="TL;DR"}
- Objective was to build a **collaborative-filtering recommender system** for Steam games using an **implicit-feedback Alternating Least Squares (ALS)** model.

- The full pipeline covers data ingestion from a raw gzipped JSON-lines file, cleaning and reshaping 5M+ user-item interactions, hyperparameter tuning via grid search, & final evaluation.

- **Key result:** Model achieved **NDCG@10 = 79.79%** on a held-out test set.
:::

[GitHub repo](https://github.com/msburns24/Steam-Recommender-System)

## What This Demonstrates

- End-to-end DS workflow: **data ingestion → cleaning/reshaping → modeling → evaluation**
- Working with real-world messy nested data (Steam user-item histories)
- Recommender-system fundamentals:
  - **Implicit feedback** (playtime) vs explicit ratings
  - Sparse user-item matrices
  - ALS factorization and practical tuning
- Pragmatic model validation using ranking metrics (**NDCG@K**)

## Part 1 — Data Collection

Much of the code in this project is organized into functions, even where a
one-liner would work. This reflects a preference for functional-style data
transformations: it makes each step easier to reason about and keeps the
pipeline explicit when all transformations are composed together later.

### Data Source

The dataset comes from Julian McAuley and Wang-Cheng Kang at UC San Diego,
hosted on [McAuley's website](https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data).
The **V1 User-Items** dataset describes the hours that Australian Steam users
played each of their games. It arrives as a `.json.gz` file — 70.6 MB
compressed, 527.5 MB extracted.

```{python}
from pathlib import Path

root_dir = Path.cwd().resolve().parent
raw_data_dir = root_dir / 'data' / 'raw'
raw_data_dir.mkdir(parents=True, exist_ok=True)

GZIP_PATH = raw_data_dir / 'ucsd_playtime.json.gz'
URL = 'https://mcauleylab.ucsd.edu/public_datasets/data/steam/australian_users_items.json.gz'
```

### Download

```{python}
import requests


def download(verbose: bool) -> None:
    resp = requests.get(URL)
    resp.raise_for_status()
    with GZIP_PATH.open('wb') as file:
        size = file.write(resp.content)
    if verbose:
        print(f"Wrote {size / 1024**2:,.1f} MB to '{GZIP_PATH.relative_to(root_dir)}'")


download(verbose=True)
# Wrote 70.6 MB to 'data/raw/ucsd_playtime.json.gz'
```

### Extract from GZIP

```{python}
import gzip

JSON_PATH = raw_data_dir / 'ucsd_playtime.json'


def extract_from_gzip(verbose: bool) -> None:
    chunk_size = 1048576   # 1 MB
    size = 0
    with gzip.open(GZIP_PATH) as gzip_file, JSON_PATH.open('wb') as json_file:
        while True:
            chunk = gzip_file.read(chunk_size)
            if not chunk:
                break
            size += json_file.write(chunk)
    if verbose:
        print(f"Wrote {size / 1024**2:,.1f} MB to '{JSON_PATH.relative_to(root_dir)}'")


extract_from_gzip(verbose=True)
# Wrote 527.5 MB to 'data/raw/ucsd_playtime.json'
```

### Handling the JSON-by-Line Format

The file has 88,310 records, one per line — but each line is a Python dict
literal with single quotes, not valid JSON. Calling `pd.read_json()` or
`json.load()` on it fails immediately.

This format is almost certainly an artifact of how McAuley & Kang collected it:
a script that appended each API response to the file one record at a time.
That's sensible for fault-tolerant scraping but leaves us with two problems:

1. Records must be split by newline, not treated as a single JSON document.
2. Each line is plain text and must be converted to a Python dict — `ast.literal_eval()` handles both the single-quote quoting and the nested structure cleanly.

A single parsed record looks like this:

```python
{
  'user_id': 'Leaf_Light_Moscow',
  'items_count': 5,
  'steam_id': '76561198305694024',
  'user_url': 'http://steamcommunity.com/id/Leaf_Light_Moscow',
  'items': [
    {'item_id': '4000',  'item_name': "Garry's Mod", 'playtime_forever': 4548, 'playtime_2weeks': 1729},
    {'item_id': '221100','item_name': 'DayZ',         'playtime_forever': 48,   'playtime_2weeks': 0},
    ...
  ]
}
```

The nested `items` list will be expanded in Part 2. For now, we parse all
88,310 records into a DataFrame and pickle it (pickle preserves the nested
list column; feather does not).

```{python}
import ast
import pandas as pd
from tqdm import tqdm


def parse_json_by_line(verbose: bool) -> pd.DataFrame:
    with JSON_PATH.open('r', encoding='utf-8') as file:
        if verbose:
            total = sum(1 for _ in file)
            file.seek(0)
            file = tqdm(file, 'Parsing JSON-by-Line', total, colour='green')
        records = []
        for line in file:
            line = line.strip()
            if not line:
                continue
            records.append(ast.literal_eval(line))
    return pd.DataFrame.from_records(records)


df = parse_json_by_line(verbose=True)
# 88,310 rows  ·  5 columns: user_id, items_count, steam_id, user_url, items

PICKLE_PATH = raw_data_dir / 'ucsd_playtime.pkl'
df.to_pickle(PICKLE_PATH)
# Wrote 488.93 MB to 'data/raw/ucsd_playtime.pkl'
```

## Part 2 — Data Preparation

The raw data has four issues to resolve before it can feed a recommender model:

1. **Naming confusion.** The `user_id` column is actually a username; `steam_id`
   is the true unique integer identifier. We swap the names.
2. **Nested structure.** Each row is one user with a list of items. We need one
   row per user–item pair.
3. **The cold-start problem.** Users with very few interactions produce unstable
   recommendations. We drop users with fewer than 10 games.
4. **Heavy-tailed playtime.** A handful of users have logged tens of thousands
   of hours. We apply `log(1 + playtime)` to compress the tail.
5. **Non-contiguous IDs.** The ALS implementation expects 0-indexed integer IDs.
   We build and save mapping files so original IDs remain recoverable.

### Helper: DataFrame Summary

```{python}
from typing import Optional
import numpy as np
from IPython.display import display, display_markdown


def summarize_df(df, name=None, nulls=True, head=5):
    summary = pd.DataFrame({
        'DType': df.dtypes,
        'Null': df.isna().sum().map('{:,.0f}'.format),
        'Total': len(df),
        '% Null': df.isna().mean().map('{:.2%}'.format),
    })
    if name:
        display_markdown(f'### {name}', raw=True)
    if nulls:
        display(summary)
    if head:
        display(df.head(head))
```

### Cleaning: Expand Nested Data

Three functions applied in sequence:

```{python}
def correct_naming_of_user_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Drop username column; rename steam_id → user_id."""
    return df.drop(columns='user_id').rename(columns={'steam_id': 'user_id'})


def expand_nested_items(df: pd.DataFrame) -> pd.DataFrame:
    df = df.explode('items').reset_index(drop=True)
    items_df = pd.json_normalize(df['items'].tolist())
    return pd.concat([df.drop(columns='items'), items_df], axis=1)


def clean_expanded_data(df: pd.DataFrame) -> pd.DataFrame:
    df = df.drop(columns=['items_count', 'user_url', 'playtime_2weeks'])
    df = df.rename(columns={'playtime_forever': 'playtime'})
    df = df.dropna(subset=['user_id', 'item_id']).drop_duplicates(subset=['user_id', 'item_id'])
    df = df.astype({'user_id': int, 'item_id': int, 'playtime': float})
    return df.sort_values(by=['user_id', 'item_id']).reset_index(drop=True)
```

After expanding, we have **5,094,082** user–item rows across 88,310 users and
10,976 unique games.

### Cold-Start Mitigation

```{python}
def filter_users_by_num_items(df: pd.DataFrame, min_items: int) -> pd.DataFrame:
    """Keep only users with at least min_items interactions."""
    n_items_by_user = df.groupby('user_id')['item_id'].count()
    eligible = n_items_by_user[n_items_by_user >= min_items].index
    return df[df['user_id'].isin(eligible)].reset_index(drop=True)
```

Filtering to users with ≥ 10 games retains **57,333 users** and
**5,038,365 interactions**.

### Log Playtime Transform

```{python}
def convert_to_log_playtime(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['playtime'] = np.log(1 + df['playtime'])
    return df
```

### Contiguous ID Mapping

```{python}
def map_to_continuous_id(data: pd.Series) -> pd.Series:
    current_ids = sorted(data.unique())
    return data.map(dict(zip(current_ids, range(len(current_ids)))))
```

Separate functions save the user and item ID maps (with item names) as feather
files so original Steam IDs remain recoverable after modeling.

### Train / Validation / Test Splits

Standard random splits don't work here — if a user appears only in the test
set, the model has never seen them. Instead we split *each user's items*
independently, so every user is present in all three splits.

```{python}
def train_valid_test_split_by_item(
        df: pd.DataFrame,
        test_size: float = 0.2,
        valid_size: float = 0.2,
        random_state: Optional[int] = None,
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    rng = np.random.default_rng(random_state)
    train_dfs, valid_dfs, test_dfs = [], [], []
    for _, df_user in df.groupby('user_id'):
        n = len(df_user)
        n_test = max(1, round(n * test_size))
        n_valid = max(1, round(n * valid_size))
        idx = np.arange(n)
        rng.shuffle(idx)
        train_dfs.append(df_user.iloc[idx[n_test + n_valid:]])
        valid_dfs.append(df_user.iloc[idx[n_test:n_test + n_valid]])
        test_dfs.append(df_user.iloc[idx[:n_test]])
    return (
        pd.concat(train_dfs, ignore_index=True),
        pd.concat(valid_dfs, ignore_index=True),
        pd.concat(test_dfs, ignore_index=True),
    )
```

### Sample Sizes

With 57,333 users × 10,976 items the full user–item matrix has **629M elements**.
Training on the full dataset during exploration would be slow, so we also
generate a 1% user sample for iteration:

| Sample | Users | Items | Matrix Elements |
|--------|------:|------:|----------------:|
| 0.1%   | 57    | 1,590 | 90,630          |
| 1%     | 573   | 4,110 | 2,355,030       |
| 10%    | 5,733 | 7,911 | 45,353,763      |
| 100%   | 57,333| 10,976| 629,287,008     |

All modeling in Part 3 uses the **1% sample**.

## Part 3 — Implicit ALS Model

*Collaborative filtering* recommends items based on the behavior of similar
users. *Alternating Least Squares* (Hu et al., 2007) factorizes the user–item
matrix into latent user and item factor matrices. Because we have playtime
(not star ratings), this is an **implicit feedback** problem — the confidence
weight $c_{ui}$ controls how strongly the model trusts each observation.

$$
\mathcal{L}(X,Y) = \sum_{u,i} c_{ui}\bigl(p_{ui} - x_u^\top y_i\bigr)^2
+ \lambda\!\left(\sum_u \|x_u\|^2 + \sum_i \|y_i\|^2\right)
$$

where $p_{ui} = 1$ for observed pairs and $c_{ui} > 1$ scales confidence on
observed entries.

We use the [`implicit`](https://github.com/benfred/implicit) library, which
exposes this model via SciPy CSR matrices.

### Sparse Matrix Construction

```{python}
from scipy.sparse import csr_matrix
import json

USER_COL, ITEM_COL, RATING_COL = 'user_id', 'item_id', 'playtime'


def get_user_item_csr_matrix(df, n_users, n_items) -> csr_matrix:
    ratings = df.pivot(index=USER_COL, columns=ITEM_COL, values=RATING_COL)
    observed_users = ratings.index.values.reshape(-1, 1)
    observed_items = ratings.columns.values
    R = np.full((n_users, n_items), 0.0)
    R[observed_users, observed_items] = ratings.values
    R[np.isnan(R)] = 0
    return csr_matrix(R)
```

### Initial Training

```{python}
import matplotlib.pyplot as plt
from implicit.cpu.als import AlternatingLeastSquares as ALSModel


class LossCallback:
    def __init__(self):
        self.iterations, self.losses = [], []

    def __call__(self, iteration, time, loss):
        self.iterations.append(iteration)
        self.losses.append(loss)


loss_callback = LossCallback()
model = ALSModel(factors=32, iterations=15, calculate_training_loss=True, random_state=0)
model.fit(R_train, show_progress=False, callback=loss_callback)
```

Training loss decreases smoothly over 15 iterations with no signs of
instability — a good baseline before tuning.

### Hyperparameter Tuning

Grid search over 125 combinations (5 × 5 × 5):

| Hyperparameter | Values |
|---|---|
| Latent factors | 16, 24, 32, 48, 64 |
| Regularization λ | 0.001, 0.01, 0.1, 10, 100 |
| Confidence weight α | 0.1, 0.5, 1.0, 5.0, 10.0 |

```{python}
from itertools import product
from sklearn.metrics import ndcg_score
from numpy.typing import NDArray
from tqdm.auto import tqdm


def predict_als(model: ALSModel, R_true: csr_matrix) -> NDArray:
    R_pred = model.user_factors @ model.item_factors.T
    R_pred[R_true.toarray() == 0] = 0
    return R_pred


class ALSGridSearch:
    def __init__(self, R_train, R_valid, iterations=15, random_state=None):
        self.R_train = R_train
        self.R_valid = R_valid
        self.iterations = iterations
        self.random_state = random_state

    def run(self, factors, regularization, alpha, verbose=True):
        parameters = list(product(factors, regularization, alpha))
        if verbose:
            parameters = tqdm(parameters, 'Running grid search')
        results = [self._run_once(f, l, a) for f, l, a in parameters]
        df = pd.DataFrame(list(product(factors, regularization, alpha)),
                          columns=['factors', 'regularization', 'alpha'])
        df[['loss', 'metric']] = results
        return df

    def _run_once(self, factors, regularization, alpha):
        cb = LossCallback()
        m = ALSModel(factors=factors, regularization=regularization, alpha=alpha,
                     iterations=self.iterations, calculate_training_loss=True,
                     random_state=self.random_state)
        m.fit(self.R_train, show_progress=False, callback=cb)
        R_pred = predict_als(m, self.R_valid)
        return cb.losses[-1], ndcg_score(self.R_valid.toarray(), R_pred, k=10)
```

Top 5 results by NDCG@10 on the validation set:

| factors | regularization | alpha | NDCG@10 |
|--------:|---------------:|------:|--------:|
| 64      | 10.0           | 0.5   | 89.47%  |
| 48      | 10.0           | 0.5   | 89.04%  |
| 32      | 10.0           | 0.5   | 88.65%  |
| 64      | 10.0           | 1.0   | 88.62%  |
| 24      | 10.0           | 0.5   | 88.47%  |

Heatmaps of pairwise parameter interactions confirm that regularization is the
dominant factor — λ = 10 outperforms all other values by a wide margin, while
the model is relatively robust to moderate changes in factor count and alpha.

### Final Model

Train on the combined train + validation sets using the best configuration,
then evaluate on the held-out test set:

```{python}
R_train_full = R_train + R_valid   # no overlap, so simple addition is safe

model = ALSModel(factors=64, regularization=10.0, alpha=0.5, iterations=15, random_state=0)
model.fit(R_train_full, show_progress=False)

R_pred = predict_als(model, R_test)
ndcg_10 = ndcg_score(R_test.toarray(), R_pred, k=10)
print(f'Final NDCG@10:  {ndcg_10:.2%}')
# Final NDCG@10:  79.79%
```

**Final NDCG@10: 79.79%** — strong ranking quality on unseen data given the
sparsity of the 1% sample.

## Key Implementation Notes

**Cold-start mitigation.** Filtering to users with ≥ 10 interactions before
training reduces noise from extremely sparse users and makes evaluation more
meaningful — a model can't be meaningfully evaluated on users it has almost no
signal for.

**Implicit feedback formulation.** Playtime is treated as a confidence-weighted
preference signal, not an explicit rating. The ALS confidence weight $c_{ui}$
lets the model distinguish "user played this for 1000 hours" from "user played
this for 1 hour" without treating either as a negative signal. The log transform
`log(1 + playtime)` additionally compresses the heavy tail so power users don't
dominate the factorization.

**Evaluation.** NDCG@10 measures ranking quality — whether the items a user
actually played appear near the top of the predicted ranking. It is more
informative than accuracy for recommender systems where the top-K list is what
the user sees.