End-to-end recommender system pipeline: ingesting 5M+ Steam playtime records, applying implicit-feedback ALS, and achieving 79.79% NDCG@10 on held-out data.
Author
Matthew Burns
Published
January 19, 2026
Modified
May 26, 2026
TL;DR
Objective was to build a collaborative-filtering recommender system for Steam games using an implicit-feedback Alternating Least Squares (ALS) model.
The full pipeline covers data ingestion from a raw gzipped JSON-lines file, cleaning and reshaping 5M+ user-item interactions, hyperparameter tuning via grid search, & final evaluation.
Key result: Model achieved NDCG@10 = 79.79% on a held-out test set.
Working with real-world messy nested data (Steam user-item histories)
Recommender-system fundamentals:
Implicit feedback (playtime) vs explicit ratings
Sparse user-item matrices
ALS factorization and practical tuning
Pragmatic model validation using ranking metrics (NDCG@K)
Part 1 — Data Collection
Much of the code in this project is organized into functions, even where a one-liner would work. This reflects a preference for functional-style data transformations: it makes each step easier to reason about and keeps the pipeline explicit when all transformations are composed together later.
Data Source
The dataset comes from Julian McAuley and Wang-Cheng Kang at UC San Diego, hosted on McAuley’s website. The V1 User-Items dataset describes the hours that Australian Steam users played each of their games. It arrives as a .json.gz file — 70.6 MB compressed, 527.5 MB extracted.
import requestsdef download(verbose: bool) ->None: resp = requests.get(URL) resp.raise_for_status()with GZIP_PATH.open('wb') asfile: size =file.write(resp.content)if verbose:print(f"Wrote {size /1024**2:,.1f} MB to '{GZIP_PATH.relative_to(root_dir)}'")download(verbose=True)# Wrote 70.6 MB to 'data/raw/ucsd_playtime.json.gz'
Extract from GZIP
Code
import gzipJSON_PATH = raw_data_dir /'ucsd_playtime.json'def extract_from_gzip(verbose: bool) ->None: chunk_size =1048576# 1 MB size =0with gzip.open(GZIP_PATH) as gzip_file, JSON_PATH.open('wb') as json_file:whileTrue: chunk = gzip_file.read(chunk_size)ifnot chunk:break size += json_file.write(chunk)if verbose:print(f"Wrote {size /1024**2:,.1f} MB to '{JSON_PATH.relative_to(root_dir)}'")extract_from_gzip(verbose=True)# Wrote 527.5 MB to 'data/raw/ucsd_playtime.json'
Handling the JSON-by-Line Format
The file has 88,310 records, one per line — but each line is a Python dict literal with single quotes, not valid JSON. Calling pd.read_json() or json.load() on it fails immediately.
This format is almost certainly an artifact of how McAuley & Kang collected it: a script that appended each API response to the file one record at a time. That’s sensible for fault-tolerant scraping but leaves us with two problems:
Records must be split by newline, not treated as a single JSON document.
Each line is plain text and must be converted to a Python dict — ast.literal_eval() handles both the single-quote quoting and the nested structure cleanly.
The nested items list will be expanded in Part 2. For now, we parse all 88,310 records into a DataFrame and pickle it (pickle preserves the nested list column; feather does not).
Code
import astimport pandas as pdfrom tqdm import tqdmdef parse_json_by_line(verbose: bool) -> pd.DataFrame:with JSON_PATH.open('r', encoding='utf-8') asfile:if verbose: total =sum(1for _ infile)file.seek(0)file= tqdm(file, 'Parsing JSON-by-Line', total, colour='green') records = []for line infile: line = line.strip()ifnot line:continue records.append(ast.literal_eval(line))return pd.DataFrame.from_records(records)df = parse_json_by_line(verbose=True)# 88,310 rows · 5 columns: user_id, items_count, steam_id, user_url, itemsPICKLE_PATH = raw_data_dir /'ucsd_playtime.pkl'df.to_pickle(PICKLE_PATH)# Wrote 488.93 MB to 'data/raw/ucsd_playtime.pkl'
Part 2 — Data Preparation
The raw data has four issues to resolve before it can feed a recommender model:
Naming confusion. The user_id column is actually a username; steam_id is the true unique integer identifier. We swap the names.
Nested structure. Each row is one user with a list of items. We need one row per user–item pair.
The cold-start problem. Users with very few interactions produce unstable recommendations. We drop users with fewer than 10 games.
Heavy-tailed playtime. A handful of users have logged tens of thousands of hours. We apply log(1 + playtime) to compress the tail.
Non-contiguous IDs. The ALS implementation expects 0-indexed integer IDs. We build and save mapping files so original IDs remain recoverable.
Separate functions save the user and item ID maps (with item names) as feather files so original Steam IDs remain recoverable after modeling.
Train / Validation / Test Splits
Standard random splits don’t work here — if a user appears only in the test set, the model has never seen them. Instead we split each user’s items independently, so every user is present in all three splits.
With 57,333 users × 10,976 items the full user–item matrix has 629M elements. Training on the full dataset during exploration would be slow, so we also generate a 1% user sample for iteration:
Sample
Users
Items
Matrix Elements
0.1%
57
1,590
90,630
1%
573
4,110
2,355,030
10%
5,733
7,911
45,353,763
100%
57,333
10,976
629,287,008
All modeling in Part 3 uses the 1% sample.
Part 3 — Implicit ALS Model
Collaborative filtering recommends items based on the behavior of similar users. Alternating Least Squares (Hu et al., 2007) factorizes the user–item matrix into latent user and item factor matrices. Because we have playtime (not star ratings), this is an implicit feedback problem — the confidence weight \(c_{ui}\) controls how strongly the model trusts each observation.
Heatmaps of pairwise parameter interactions confirm that regularization is the dominant factor — λ = 10 outperforms all other values by a wide margin, while the model is relatively robust to moderate changes in factor count and alpha.
Final Model
Train on the combined train + validation sets using the best configuration, then evaluate on the held-out test set:
Code
R_train_full = R_train + R_valid # no overlap, so simple addition is safemodel = ALSModel(factors=64, regularization=10.0, alpha=0.5, iterations=15, random_state=0)model.fit(R_train_full, show_progress=False)R_pred = predict_als(model, R_test)ndcg_10 = ndcg_score(R_test.toarray(), R_pred, k=10)print(f'Final NDCG@10: {ndcg_10:.2%}')# Final NDCG@10: 79.79%
Final NDCG@10: 79.79% — strong ranking quality on unseen data given the sparsity of the 1% sample.
Key Implementation Notes
Cold-start mitigation. Filtering to users with ≥ 10 interactions before training reduces noise from extremely sparse users and makes evaluation more meaningful — a model can’t be meaningfully evaluated on users it has almost no signal for.
Implicit feedback formulation. Playtime is treated as a confidence-weighted preference signal, not an explicit rating. The ALS confidence weight \(c_{ui}\) lets the model distinguish “user played this for 1000 hours” from “user played this for 1 hour” without treating either as a negative signal. The log transform log(1 + playtime) additionally compresses the heavy tail so power users don’t dominate the factorization.
Evaluation. NDCG@10 measures ranking quality — whether the items a user actually played appear near the top of the predicted ranking. It is more informative than accuracy for recommender systems where the top-K list is what the user sees.
Source Code
---title: Steam Recommender Systemsubtitle: Recommending Games Based on User Playtimedescription: "End-to-end recommender system pipeline: ingesting 5M+ Steam playtime records, applying implicit-feedback ALS, and achieving 79.79% NDCG@10 on held-out data."author: Matthew Burnsdate: 2026-01-19date-modified: 2026-05-26image: sparse_matrix_thumbnail.svginclude-before-body: text: '<img src="sparse_matrix_header.svg" style="width:100%; margin-bottom:1rem;">'categories: - Data Science - Machine Learning - Recommender Systems - Video Gamesformat: html: toc: true toc-depth: 3 code-fold: true code-tools: true page-layout: fullexecute: eval: falsefreeze: auto---::: {.callout-note title="TL;DR"}- Objective was to build a **collaborative-filtering recommender system** for Steam games using an **implicit-feedback Alternating Least Squares (ALS)** model.- The full pipeline covers data ingestion from a raw gzipped JSON-lines file, cleaning and reshaping 5M+ user-item interactions, hyperparameter tuning via grid search, & final evaluation.- **Key result:** Model achieved **NDCG@10 = 79.79%** on a held-out test set.:::[GitHub repo](https://github.com/msburns24/Steam-Recommender-System)## What This Demonstrates- End-to-end DS workflow: **data ingestion → cleaning/reshaping → modeling → evaluation**- Working with real-world messy nested data (Steam user-item histories)- Recommender-system fundamentals: - **Implicit feedback** (playtime) vs explicit ratings - Sparse user-item matrices - ALS factorization and practical tuning- Pragmatic model validation using ranking metrics (**NDCG@K**)## Part 1 — Data CollectionMuch of the code in this project is organized into functions, even where aone-liner would work. This reflects a preference for functional-style datatransformations: it makes each step easier to reason about and keeps thepipeline explicit when all transformations are composed together later.### Data SourceThe dataset comes from Julian McAuley and Wang-Cheng Kang at UC San Diego,hosted on [McAuley's website](https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data).The **V1 User-Items** dataset describes the hours that Australian Steam usersplayed each of their games. It arrives as a `.json.gz` file — 70.6 MBcompressed, 527.5 MB extracted.```{python}from pathlib import Pathroot_dir = Path.cwd().resolve().parentraw_data_dir = root_dir /'data'/'raw'raw_data_dir.mkdir(parents=True, exist_ok=True)GZIP_PATH = raw_data_dir /'ucsd_playtime.json.gz'URL ='https://mcauleylab.ucsd.edu/public_datasets/data/steam/australian_users_items.json.gz'```### Download```{python}import requestsdef download(verbose: bool) ->None: resp = requests.get(URL) resp.raise_for_status()with GZIP_PATH.open('wb') asfile: size =file.write(resp.content)if verbose:print(f"Wrote {size /1024**2:,.1f} MB to '{GZIP_PATH.relative_to(root_dir)}'")download(verbose=True)# Wrote 70.6 MB to 'data/raw/ucsd_playtime.json.gz'```### Extract from GZIP```{python}import gzipJSON_PATH = raw_data_dir /'ucsd_playtime.json'def extract_from_gzip(verbose: bool) ->None: chunk_size =1048576# 1 MB size =0with gzip.open(GZIP_PATH) as gzip_file, JSON_PATH.open('wb') as json_file:whileTrue: chunk = gzip_file.read(chunk_size)ifnot chunk:break size += json_file.write(chunk)if verbose:print(f"Wrote {size /1024**2:,.1f} MB to '{JSON_PATH.relative_to(root_dir)}'")extract_from_gzip(verbose=True)# Wrote 527.5 MB to 'data/raw/ucsd_playtime.json'```### Handling the JSON-by-Line FormatThe file has 88,310 records, one per line — but each line is a Python dictliteral with single quotes, not valid JSON. Calling `pd.read_json()` or`json.load()` on it fails immediately.This format is almost certainly an artifact of how McAuley & Kang collected it:a script that appended each API response to the file one record at a time.That's sensible for fault-tolerant scraping but leaves us with two problems:1. Records must be split by newline, not treated as a single JSON document.2. Each line is plain text and must be converted to a Python dict — `ast.literal_eval()` handles both the single-quote quoting and the nested structure cleanly.A single parsed record looks like this:```python{'user_id': 'Leaf_Light_Moscow','items_count': 5,'steam_id': '76561198305694024','user_url': 'http://steamcommunity.com/id/Leaf_Light_Moscow','items': [ {'item_id': '4000', 'item_name': "Garry's Mod", 'playtime_forever': 4548, 'playtime_2weeks': 1729}, {'item_id': '221100','item_name': 'DayZ', 'playtime_forever': 48, 'playtime_2weeks': 0}, ... ]}```The nested `items` list will be expanded in Part 2. For now, we parse all88,310 records into a DataFrame and pickle it (pickle preserves the nestedlist column; feather does not).```{python}import astimport pandas as pdfrom tqdm import tqdmdef parse_json_by_line(verbose: bool) -> pd.DataFrame:with JSON_PATH.open('r', encoding='utf-8') asfile:if verbose: total =sum(1for _ infile)file.seek(0)file= tqdm(file, 'Parsing JSON-by-Line', total, colour='green') records = []for line infile: line = line.strip()ifnot line:continue records.append(ast.literal_eval(line))return pd.DataFrame.from_records(records)df = parse_json_by_line(verbose=True)# 88,310 rows · 5 columns: user_id, items_count, steam_id, user_url, itemsPICKLE_PATH = raw_data_dir /'ucsd_playtime.pkl'df.to_pickle(PICKLE_PATH)# Wrote 488.93 MB to 'data/raw/ucsd_playtime.pkl'```## Part 2 — Data PreparationThe raw data has four issues to resolve before it can feed a recommender model:1. **Naming confusion.** The `user_id` column is actually a username; `steam_id` is the true unique integer identifier. We swap the names.2. **Nested structure.** Each row is one user with a list of items. We need one row per user–item pair.3. **The cold-start problem.** Users with very few interactions produce unstable recommendations. We drop users with fewer than 10 games.4. **Heavy-tailed playtime.** A handful of users have logged tens of thousands of hours. We apply `log(1 + playtime)` to compress the tail.5. **Non-contiguous IDs.** The ALS implementation expects 0-indexed integer IDs. We build and save mapping files so original IDs remain recoverable.### Helper: DataFrame Summary```{python}from typing import Optionalimport numpy as npfrom IPython.display import display, display_markdowndef summarize_df(df, name=None, nulls=True, head=5): summary = pd.DataFrame({'DType': df.dtypes,'Null': df.isna().sum().map('{:,.0f}'.format),'Total': len(df),'% Null': df.isna().mean().map('{:.2%}'.format), })if name: display_markdown(f'### {name}', raw=True)if nulls: display(summary)if head: display(df.head(head))```### Cleaning: Expand Nested DataThree functions applied in sequence:```{python}def correct_naming_of_user_columns(df: pd.DataFrame) -> pd.DataFrame:"""Drop username column; rename steam_id → user_id."""return df.drop(columns='user_id').rename(columns={'steam_id': 'user_id'})def expand_nested_items(df: pd.DataFrame) -> pd.DataFrame: df = df.explode('items').reset_index(drop=True) items_df = pd.json_normalize(df['items'].tolist())return pd.concat([df.drop(columns='items'), items_df], axis=1)def clean_expanded_data(df: pd.DataFrame) -> pd.DataFrame: df = df.drop(columns=['items_count', 'user_url', 'playtime_2weeks']) df = df.rename(columns={'playtime_forever': 'playtime'}) df = df.dropna(subset=['user_id', 'item_id']).drop_duplicates(subset=['user_id', 'item_id']) df = df.astype({'user_id': int, 'item_id': int, 'playtime': float})return df.sort_values(by=['user_id', 'item_id']).reset_index(drop=True)```After expanding, we have **5,094,082** user–item rows across 88,310 users and10,976 unique games.### Cold-Start Mitigation```{python}def filter_users_by_num_items(df: pd.DataFrame, min_items: int) -> pd.DataFrame:"""Keep only users with at least min_items interactions.""" n_items_by_user = df.groupby('user_id')['item_id'].count() eligible = n_items_by_user[n_items_by_user >= min_items].indexreturn df[df['user_id'].isin(eligible)].reset_index(drop=True)```Filtering to users with ≥ 10 games retains **57,333 users** and**5,038,365 interactions**.### Log Playtime Transform```{python}def convert_to_log_playtime(df: pd.DataFrame) -> pd.DataFrame: df = df.copy() df['playtime'] = np.log(1+ df['playtime'])return df```### Contiguous ID Mapping```{python}def map_to_continuous_id(data: pd.Series) -> pd.Series: current_ids =sorted(data.unique())return data.map(dict(zip(current_ids, range(len(current_ids)))))```Separate functions save the user and item ID maps (with item names) as featherfiles so original Steam IDs remain recoverable after modeling.### Train / Validation / Test SplitsStandard random splits don't work here — if a user appears only in the testset, the model has never seen them. Instead we split *each user's items*independently, so every user is present in all three splits.```{python}def train_valid_test_split_by_item( df: pd.DataFrame, test_size: float=0.2, valid_size: float=0.2, random_state: Optional[int] =None,) ->tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: rng = np.random.default_rng(random_state) train_dfs, valid_dfs, test_dfs = [], [], []for _, df_user in df.groupby('user_id'): n =len(df_user) n_test =max(1, round(n * test_size)) n_valid =max(1, round(n * valid_size)) idx = np.arange(n) rng.shuffle(idx) train_dfs.append(df_user.iloc[idx[n_test + n_valid:]]) valid_dfs.append(df_user.iloc[idx[n_test:n_test + n_valid]]) test_dfs.append(df_user.iloc[idx[:n_test]])return ( pd.concat(train_dfs, ignore_index=True), pd.concat(valid_dfs, ignore_index=True), pd.concat(test_dfs, ignore_index=True), )```### Sample SizesWith 57,333 users × 10,976 items the full user–item matrix has **629M elements**.Training on the full dataset during exploration would be slow, so we alsogenerate a 1% user sample for iteration:| Sample | Users | Items | Matrix Elements ||--------|------:|------:|----------------:|| 0.1% | 57 | 1,590 | 90,630 || 1% | 573 | 4,110 | 2,355,030 || 10% | 5,733 | 7,911 | 45,353,763 || 100% | 57,333| 10,976| 629,287,008 |All modeling in Part 3 uses the **1% sample**.## Part 3 — Implicit ALS Model*Collaborative filtering* recommends items based on the behavior of similarusers. *Alternating Least Squares* (Hu et al., 2007) factorizes the user–itemmatrix into latent user and item factor matrices. Because we have playtime(not star ratings), this is an **implicit feedback** problem — the confidenceweight $c_{ui}$ controls how strongly the model trusts each observation.$$\mathcal{L}(X,Y) = \sum_{u,i} c_{ui}\bigl(p_{ui} - x_u^\top y_i\bigr)^2+ \lambda\!\left(\sum_u \|x_u\|^2 + \sum_i \|y_i\|^2\right)$$where $p_{ui} = 1$ for observed pairs and $c_{ui} > 1$ scales confidence onobserved entries.We use the [`implicit`](https://github.com/benfred/implicit) library, whichexposes this model via SciPy CSR matrices.### Sparse Matrix Construction```{python}from scipy.sparse import csr_matriximport jsonUSER_COL, ITEM_COL, RATING_COL ='user_id', 'item_id', 'playtime'def get_user_item_csr_matrix(df, n_users, n_items) -> csr_matrix: ratings = df.pivot(index=USER_COL, columns=ITEM_COL, values=RATING_COL) observed_users = ratings.index.values.reshape(-1, 1) observed_items = ratings.columns.values R = np.full((n_users, n_items), 0.0) R[observed_users, observed_items] = ratings.values R[np.isnan(R)] =0return csr_matrix(R)```### Initial Training```{python}import matplotlib.pyplot as pltfrom implicit.cpu.als import AlternatingLeastSquares as ALSModelclass LossCallback:def__init__(self):self.iterations, self.losses = [], []def__call__(self, iteration, time, loss):self.iterations.append(iteration)self.losses.append(loss)loss_callback = LossCallback()model = ALSModel(factors=32, iterations=15, calculate_training_loss=True, random_state=0)model.fit(R_train, show_progress=False, callback=loss_callback)```Training loss decreases smoothly over 15 iterations with no signs ofinstability — a good baseline before tuning.### Hyperparameter TuningGrid search over 125 combinations (5 × 5 × 5):| Hyperparameter | Values ||---|---|| Latent factors | 16, 24, 32, 48, 64 || Regularization λ | 0.001, 0.01, 0.1, 10, 100 || Confidence weight α | 0.1, 0.5, 1.0, 5.0, 10.0 |```{python}from itertools import productfrom sklearn.metrics import ndcg_scorefrom numpy.typing import NDArrayfrom tqdm.auto import tqdmdef predict_als(model: ALSModel, R_true: csr_matrix) -> NDArray: R_pred = model.user_factors @ model.item_factors.T R_pred[R_true.toarray() ==0] =0return R_predclass ALSGridSearch:def__init__(self, R_train, R_valid, iterations=15, random_state=None):self.R_train = R_trainself.R_valid = R_validself.iterations = iterationsself.random_state = random_statedef run(self, factors, regularization, alpha, verbose=True): parameters =list(product(factors, regularization, alpha))if verbose: parameters = tqdm(parameters, 'Running grid search') results = [self._run_once(f, l, a) for f, l, a in parameters] df = pd.DataFrame(list(product(factors, regularization, alpha)), columns=['factors', 'regularization', 'alpha']) df[['loss', 'metric']] = resultsreturn dfdef _run_once(self, factors, regularization, alpha): cb = LossCallback() m = ALSModel(factors=factors, regularization=regularization, alpha=alpha, iterations=self.iterations, calculate_training_loss=True, random_state=self.random_state) m.fit(self.R_train, show_progress=False, callback=cb) R_pred = predict_als(m, self.R_valid)return cb.losses[-1], ndcg_score(self.R_valid.toarray(), R_pred, k=10)```Top 5 results by NDCG@10 on the validation set:| factors | regularization | alpha | NDCG@10 ||--------:|---------------:|------:|--------:|| 64 | 10.0 | 0.5 | 89.47% || 48 | 10.0 | 0.5 | 89.04% || 32 | 10.0 | 0.5 | 88.65% || 64 | 10.0 | 1.0 | 88.62% || 24 | 10.0 | 0.5 | 88.47% |Heatmaps of pairwise parameter interactions confirm that regularization is thedominant factor — λ = 10 outperforms all other values by a wide margin, whilethe model is relatively robust to moderate changes in factor count and alpha.### Final ModelTrain on the combined train + validation sets using the best configuration,then evaluate on the held-out test set:```{python}R_train_full = R_train + R_valid # no overlap, so simple addition is safemodel = ALSModel(factors=64, regularization=10.0, alpha=0.5, iterations=15, random_state=0)model.fit(R_train_full, show_progress=False)R_pred = predict_als(model, R_test)ndcg_10 = ndcg_score(R_test.toarray(), R_pred, k=10)print(f'Final NDCG@10: {ndcg_10:.2%}')# Final NDCG@10: 79.79%```**Final NDCG@10: 79.79%** — strong ranking quality on unseen data given thesparsity of the 1% sample.## Key Implementation Notes**Cold-start mitigation.** Filtering to users with ≥ 10 interactions beforetraining reduces noise from extremely sparse users and makes evaluation moremeaningful — a model can't be meaningfully evaluated on users it has almost nosignal for.**Implicit feedback formulation.** Playtime is treated as a confidence-weightedpreference signal, not an explicit rating. The ALS confidence weight $c_{ui}$lets the model distinguish "user played this for 1000 hours" from "user playedthis for 1 hour" without treating either as a negative signal. The log transform`log(1 + playtime)` additionally compresses the heavy tail so power users don'tdominate the factorization.**Evaluation.** NDCG@10 measures ranking quality — whether the items a useractually played appear near the top of the predicted ranking. It is moreinformative than accuracy for recommender systems where the top-K list is whatthe user sees.