🛒 Simple recommender with matrix factorization, graph, and NLP. Beating the regular collaborative filtering baseline.

eugeneyan, updated 🕥 2023-02-10 23:08:42


Undocumented code for personal project on simple recsys via matrix factorization (part 1), and nlp and graph techniques (part 2). Sharing as part of meet-up follow along.

Associated articles:
- Part 1: Building a Strong Baseline Recommender in PyTorch
- Part 2: Beating the Baseline Recommender with Graph & NLP in Pytorch

Talk and Slides:
- DataScience SG Meetup - RecSys, Beyond the Baseline
- Slideshare


Electronics and books data from the Amazon dataset (May 1996 – July 2014) was used. Here's how an example JSON entry looks like.

{ "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17, "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "related”: { "also_bought":[ "B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", ... "B007R2RM8W" ], "also_viewed":[ "B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", ... "B00BFXLZ8M" ], "bought_together":[ "B002BZX8Z6" ] }, "salesRank": { "Toys & Games":211836 }, "brand": "Coxlures", "categories":[ [ "Sports & Outdoors", "Other Sports", "Dance" ] ] }

Comparing Matrix Factorization to Skip-gram (Node2Vec)

Overall results for Electronics dataset

| | All Products | Seen Products Only | Runtime (min) | |--------------------------------------------- |-------------- |-------------------- |--------------- | | PyTorch Matrix Factorization | 0.7951 | - | 45 | | Node2Vec | NA | NA | NA | | Gensim Word2Vec | 0.9082 | 0.9735 | 2.58 | | PyTorch Word2Vec | 0.9554 | 0.9855 | 23.63 | | PyTorch Word2Vec with Side Info | NA | NA | NA | | PyTorch Matrix Factorization With Sequences | 0.9320 | - | 70.39 | | Alibaba Paper* | 0.9327 | - | - |

Overall results for Books dataset

| | All Products | Seen Products Only | Runtime (min) | |--------------------------------------------- |-------------- |-------------------- |--------------- | | PyTorch Matrix Factorization | 0.4996 | - | 1353.12 | | Gensim Word2Vec | 0.9701 | 0.9892 | 16.24 | | PyTorch Word2Vec | 0.9775 | - | 122.66 | | PyTorch Word2Vec with Side Info | NA | NA | NA | | PyTorch Matrix Factorization With Sequences | 0.7196 | - | 1393.08 |

*Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba

1. Matrix Factorization (iteratively pair by pair)

At a high level, for each pair:

  • Get the embedding for each product
  • Multiply embeddings and sum the resulting vector (this is the prediction)
  • Reduce the difference between predicted score and actual score (via gradient descent and a loss function like mean squared error or BCE)

Here's some pseudo-code on how it would work.

``` for product_pair, label in train_set: # Get embedding for each product product1_emb = embedding(product1) product2_emb = embedding(product2)

# Predict product-pair score (interaction term and sum)
prediction = sig(sum(product1_emb * product2_emb, dim=1))
l2_reg = lambda * sum(embedding.weight ** 2)

# Minimize loss
loss = BinaryCrossEntropyLoss(prediction, label)
loss += l2_reg



For the training schedule, we run it over 5 epochs with cosine annealing. For each epoch, learning rate starts high (0.01) and drops rapidly to a minimum value near zero, before being reset for to the next epoch.

One epoch seems sufficient to achive close to optimal ROC-AUC.

However, if we look at the precision-recall curves below, we see that at around 0.5 we hit the “cliff of death”. If we estimate the threshold slightly too low, precision drops from close to 1.0 to 0.5; slightly too high and recall is poor.

2. Matrix Factorization with Bias

Adding bias reduces the steepness of the curves where they intersect, making it more production-friendly. (Though AUC-ROC decreases slightly, this implementation is preferable.)

3. Node2Vec

I tried using the implementation of Node2Vec here but it was too memory intensive and slow. It didn't run to completion, even on a 64gb instance.

Digging deeper, I found that its approach to generating sequences was traversing the graph. If you allowed networkx to use multiple threads, it would spawn multiple processes to create sequences and cache them temporarily in memory. In short, very memory hungry. Overall, this didn’t work for the datasets I had.

4. gensim.word2vec

Gensim has an implementation of w2v that takes in a list of sequences and can be multi-threaded. It was very easy to use and the fastest to complete five epochs.

But the precision-recall curve shows a sharp cliff around threshold == 0.73. This is due to out-of-vocabulary products in our validation datasets (which don't have embeddings).

If we only evaluate in-vocabulary items, performance improves significantly.

5. PyTorch word2vec

We implement Skip-gram in PyTorch. Here's some simplified code of how it looks.

``` class SkipGram(nn.Module): def init(self, emb_size, emb_dim): self.center_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True) self.context_embeddings = nn.Embedding(emb_size, emb_dim, sparse=True)

def forward(self, center, context, neg_context):
    emb_center, emb_context, emb_neg_context = self.get_embeddings()

    # Get score for positive pairs
    score = torch.sum(emb_center * emb_context, dim=1)
    score = -F.logsigmoid(score)

    # Get score for negative pairs
    neg_score = torch.bmm(emb_neg_context, emb_center.unsqueeze(2)).squeeze()
    neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1)

    # Return combined score
    return torch.mean(score + neg_score)


It performed better than gensim when considering all products.

If considering only seen products, it's still an improvement, but less dramatic.

When examining the learning curves, it seems that a single epoch is sufficient. In contrast to the learning curves from matrix factorization (implementation 1), the AUC-ROC doesn't drop drastically with each learning rate reset.

6. PyTorch word2vec with side info

Why did we build the skip-gram model from scratch? Because we wanted to extend it with side information (e.g., brand, category, price).

B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56 Television Sound bar Lamp Standing Fan Sony Sony Phillips Dyson 500 – 600 200 – 300 50 – 75 300 - 400

Perhaps by learning on these we can create better embeddings?

Unfortunately, it didn't work out. Here's how the learning curve looks.

One possible reason for this non-result is the sparsity of the meta data. Out of 418,749 electronic products, we only had metadata for 162,023 (39%). Of these, brand was 51% empty.

7. Sequences + Matrix Factorization

Why did the w2v approach do so much better than matrix factorization? Was it due to the skipgram model, or due to the training data format (i.e., sequences)?

To understand this better, I tried the previous matrix factorization with bias implementation (AUC-ROC = 0.7951) with the new sequences and dataloader. It worked very well.

Oddly though, the matrix factorization approach still exhibits the effect of “forgetting” as learning rate resets with each epoch (Fig 9.), though not as pronounced as Figure 3 in the previous post.

I wonder if this is due to using the same embeddings for both center and context.


Questions around end-to-end execution

opened on 2023-03-16 18:13:23 by Chris-hughes10

Hi @eugeneyan,

This is fantastic work! I am looking to run the code alongside reading your blog posts; specifically, reproducing the results on the electronics dataset. I was wondering if you could help clarify a couple of points.

In the shell script, there are many references to electronics_edges_val_samp.csv , but following the prep steps, this doesn't seem to have been created. Where should this take place?

Additionally, when I try to run train_torch_embedding I get the error: No such file or directory: 'model/word2id', should this be created in one of the prep steps?

Thanks for your help with this!

Bump ipython from 7.16.3 to 8.10.0

opened on 2023-02-10 23:08:41 by dependabot[bot]

Bumps ipython from 7.16.3 to 8.10.0.

Release notes

Sourced from ipython's releases.

See https://pypi.org/project/ipython/

We do not use GitHub release anymore. Please see PyPI https://pypi.org/project/ipython/

  • 15ea1ed release 8.10.0
  • 560ad10 DOC: Update what's new for 8.10 (#13939)
  • 7557ade DOC: Update what's new for 8.10
  • 385d693 Merge pull request from GHSA-29gw-9793-fvw7
  • e548ee2 Swallow potential exceptions from showtraceback() (#13934)
  • 0694b08 MAINT: mock slowest test. (#13885)
  • 8655912 MAINT: mock slowest test.
  • a011765 Isolate the attack tests with setUp and tearDown methods
  • c7a9470 Add some regression tests for this change
  • fd34cf5 Swallow potential exceptions from showtraceback()
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/eugeneyan/recsys-nlp-graph/network/alerts).
Eugene Yan

I build machine learning systems at a bookstore and write at eugeneyan.com and applyingml.com.

GitHub Repository Homepage

recommender-system pytorch nlp graph matrix-factorization