Zeno's Notes


Posts tagged machine learning

0 notes

Large Scale Machine Learning and Other Animals: Steffen Rendle on Factorization Machines

Steffen and his method/software finally get some well-deserved attention.

Factorization Machines (FMs) are basically factorized polynomial prediction (regression/classification/ranking).

They work really really well for applications like recommendation, where the input data is sparse, and many feature combinations at prediction time (e.g. user-item pairs) are never observed during training.

And the cool thing is, you can mimic many advanced factorization models just by feature engineering for FMs. That means you can reuse the existing training algorithms — no need to derive and implement a new algorithm for a new prediction problem…

Filed under factorization machines FM machine learning data mining recsys recommender system matrix factorization regression ranking classification personalization kaggle KDD Cup 2012

0 notes

Using MyMediaLite for the Million Song Dataset Challenge

The Million Song Dataset Challenge is a contest hosted on Kaggle. Its goal is to predict the songs that 100,000 users will listen to, given their listening history and additional listening histories and data about the songs.

Predicting held-out past user choices is a proxy for another task that cannot be directly evaluated without using a live system: personalized recommendation.

MyMediaLite is a tool/library containing state-of-the-art recommendation algorithms. In this post, I explain how MyMediaLite can be used to make predictions for the Million Song Dataset Challenge.


First, you need to install MyMediaLite. Don’t worry, it is quite easy, and should work fine on Linux, Mac OS X, and Windows.

You will also need several gigabytes of disk space, the challenge datasets, and a working Unix-like environment. On Linux and Mac this should not be a problem. For Windows you could use Cygwin to get such an environment.

In the following, I assume that you have installed MyMediaLite 3.01 (it must be the latest version, because only that one contains some features we will make use of) in ~/src/MyMediaLite. If it is somewhere else, just adapt the paths below accordingly.

Data Preparation

In the MyMediaLite directory, create a directory data/millionsong, and put the unzipped competition dataset there.

cat kaggle_users.txt | perl -ne 'chomp; print "$_\t" . ++$l . "\n"' > user_id_mappings.txt
cut -f 2 user_id_mappings.txt > test_users.txt
cut -f 2 -d ' ' kaggle_songs.txt > candidate_items.txt

# create dataset
~/src/MyMediaLite/scripts/import_dataset.pl --load-user-mapping=user_id_mappings.txt --load-item-mapping=kaggle_songs.txt kaggle_visible_evaluation_triplets.txt > msd.train.txt

# create CV splits
mkdir cv
~/src/MyMediaLite/scripts/per_user_crossvalidation.pl --k=5 --filename=cv/msd < msd.train.txt

# use one split for validation
cp cv/msd-0.train.txt msd_validation.train.txt
cp cv/msd-0.test.txt msd_validation.test.txt
mkdir validation_predictions
mkdir validation_submissions

# prepare directories for prediction/submission files and logs
mkdir logs
mkdir submissions
mkdir predictions
We will try out two different methods: a non-personalized baseline method, and WRMF, a state-of-the-art collaborative filtering method. Both are already implemented in MyMediaLite.
MostPopular is really simple: it just counts in the training data how many users have listened to each song, and ranks them accordingly.
WRMF is a matrix factorization method for implicit/positive-only feedback. I will not describe the method here, if you are interested in the details, please have a look at the paper describing WRMF. There is also another paper from the same year describing a very very similar method.

Trying out Different Recommenders

Run in the MyMediaLite directory:

bin/item_recommendation --training-file=msd_validation.train.txt --test-file=msd_validation.test.txt --data-dir=data/millionsong --recommender=MostPopular --random-seed=1 --predict-items-number=500 --num-test-users=1000 --no-id-mapping --candidate-items=candidate_items.txt

You will get an output like this:

Set random seed to 1.
loading_time 1.67
memory 21
training data: 110000 users, 149052 items, 1160746 events, sparsity 99.99292
test data:     110000 users, 77330 items, 290187 events, sparsity 99.99659
training_time 00:00:00.0718350 .AUC 0.56605 prec@5 0.0078 prec@10 0.007 MAP 0.02051 recall@5 0.01875 recall@10 0.03011 NDCG 0.05008 MRR 0.02324 num_users 1000 num_items 386213 num_lists 1000 testing_time 00:00:35.3801840
memory 120

The MAP 0.02051 is the interesting piece of information: This is an estimate of how well we will perform on the leaderboard with this recommender.

The command for the WRMF recommender is similar, only that we also see results at different iterations:

k=28; cpos=28; reg=0.002; bin/item_recommendation --training-file=msd_validation.train.txt --test-file=msd_validation.test.txt  --recommender=WRMF --random-seed=1 --predict-items-number=500 --num-test-users=1000 --test-users=test_users.txt --find-iter=1 --max-iter=30 --recommender-options="num_iter=0 num_factors=$k c_pos=$cpos reg=$reg" --data-dir=data/millionsong --no-id-mapping --candidate-items=candidate_items.txt

The output will be like this (I removed some parts for better readability):

WRMF num_factors=28 regularization=0.002 c_pos=28 num_iter=0
MAP 0.00003 iteration 0
MAP 0.01106 iteration 1
MAP 0.01659 iteration 2
MAP 0.02593 iteration 3
MAP 0.03558 iteration 4
MAP 0.05341 iteration 30

Nice. This is already some improvement over the MostPopular baseline.

Creating a Submission

bin/item_recommendation --training-file=data/millionsong/msd.train.txt --recommender=MostPopular --predict-items-number=500 --prediction-file=data/millionsong/predictions/mp.pred --test-users=data/millionsong/kaggle_users.txt
k=28; cpos=28; reg=0.002; it=30; bin/item_recommendation --training-file=msd.train.txt --recommender=WRMF --random-seed=1 --predict-items-number=500 --recommender-options="num_iter=$it num_factors=$k c_pos=$cpos reg=$reg" --prediction-file=predictions/wrmf-k-$k-cpos-$cpos-reg-$reg-it-$it.pred --test-users=kaggle_users.txt --candidate-items=candidate_items.txt --data-dir=data/millionsong

MyMediaLite’s output format is a bit different from the submission file format, so I wrote a little script to convert the prediction file:

~/src/MyMediaLite/scripts/msdchallenge/create_submission.sh < predictions/wrmf-k-28-cpos-28-reg-0.002-it-30.pred > submissions/wrmf-k-28-cpos-28-reg-0.002-it-30.sub

~/src/MyMediaLite/scripts/msdchallenge/create_submission.sh < predictions/mp.pred > submissions/mp.sub

It will not hurt to make sure the submission file is in the correct format (using the script provided by the organizers) before trying to upload it:

./validate_submission.py submissions/wrmf-k-28-cpos-28-reg-0.002-it-30.sub
./validate_submission.py submissions/mp.sub

Compress before upload:

gzip submissions/wrmf-k-28-cpos-28-reg-0.002-it-30.sub
gzip submissions/mp.sub


Now you can upload the submission files to Kaggle. I got the following results:

  • MostPopular: 0.02255
  • WRMF: 0.05654
This would put you currently (at the time of the writing of this blog post) at position 19 out of 52. Not bad, but of course there is still room for improvement.

Next Steps

I am currently preparing three further blog posts, which I will publish during the next days (links will be provided when the post are ready):

  1. using song attributes (artists)
  2. blending results from different recommenders
  3. using additional interaction information from the Million Song Dataset

The approach demonstrated here is just a simple one, relying on functions that are already available in MyMediaLite. One can think of many extensions, either using existing functionality, or implementing them using the framework provided by MyMediaLite:

  1. Improving the BPR-MF results published by the contest organizers in their AdMIRe paper.
  2. Take counts into account: in the popuarity model, in WRMF, in BPR-MF
The following pages contain more ideas/points to work on:
  1. Million Song Dataset: Breaking the Collaborative Filtering Ceiling
  2. Quora: How does Last.fm compute lists of similar artists?
  3. Cold Hard Facts: The Million Song Dataset Challenge: Part I

Want to learn more about MyMediaLite?

Have a look at the website, browse the API documentation and browse the source code on GitHub, search the Google Group archives.

Questions? Problems?

In case there are questions, do not hesitate to ask them in MyMediaLite’s Google Group or in the competition forum.

Filed under kaggle recommendations recsys mir music challenge mymedialite datascience machine learning data science data mining

8 notes

25 years of Machine Learning Journal - Free access in October

(KD Nuggets link via Nando de Freitas - @NandoDF)

Congratulations to the Machine Learning Journal for 25 years of interesting (though not open access) content.

It is nice that there is free access in October, but it is sad that it is only so in October.

As long as the Machine Learning Journal is not open access, I encourage machine learning researchers to submit their work to the Journal of Machine Learning Research instead

Filed under academics open access publication journal machine learning data analysis

1 note

Simply Statistics: Defining data science


Rebranding of statistics as a field seems to be a popular topic these days and “data science” is one of the potential rebranding options. This article over at Revolutions is a nice summary of where the term comes from and what it means. This quote seems pretty accurate:

My own take is that Data Science is a valuable rebranding of computer science and applied statistics skills.

Filed under data science statistics applied statistics data mining machine learning computer science

19 notes

Random Forests: Weka vs. R

Random forests are a really nice (because flexible, robust, scalable) machine learning method. Jeremy Howard recently reported at the O’Reilly Strata New York conference (alas, I was not there, only heard it on Twitter) that 50% of the winners of Kaggle competitions made use of random forests.

So how well perform the random forest implementations of two major (free/open source software) data analytics packages, Weka and R? Both are not exactly known for being the fastest stuff in the universe, but on the other side they offer many different methods, and a nice infrastructure to work in (Java and nice GUI in Weka’s case, a complete statistical/numerical programming language for R).

So I ran RF on a medium sized dense dataset (11 features, 150,000 training instances, about 100,000 test instances). The results are so surprisingly clear that I did not even try to make a more detailed/fair comparison:

Weka 3.7.4 (the latest version, running on Java 6) took 609.26 seconds to grow 50 trees (without I/O), whereas R 2.10.1 needed just 97.37 seconds for the same amount of trees, including making predictions for about 100,000 instances, and I/O. 

Some more details: R 2.10.1 is about two years old, I suspect that there have been performance improvements in during that time). The random forest package for R can be found here. The sample size was set to 80,000 in the R case. I also tried sample sizes of 10,000/20,000/40,000/120,000, which resultet in runtimes of 37.42 / 48.72 / 92.05 / 141.35 seconds  I did not find a way to set the sample size for Weka (Reader, maybe you know a way to do that?).
There are more free implementations of random forests that I have not had the chance to try yet:

I will blog about those as soon as I find the time to play around with them.

You can read about random forests in the Wikipedia article, or in the original paper. By the way, what is the nicest description of random forests on the web? The Wikipedia article is not that good. Any suggestions? Any volunteers to improve the Wikipedia article?

Filed under data analysis machine learning free software open source R Weka random forests decision tree ensembles Java

9 notes

Data Mining Competitions: They Are Very, Very Useful

Note: Read on if you are interested in data analysis, machine learning, or recommender systems.

At this year’s KDD conference, there was, as every year, a workshop on the KDD Cup (at which I was a participant). Additionally, and even more interesting, there was a panel about data mining competitions.

Neal Lathia wrote a really nice and thought-provoking post about this panel discussion, and shared some of his opinions about the topic. I had a different view on some of the things he said, and wanted to write a comment on his blog. After I saw that the comment would be quite long, I decided to turn it into a proper blog post.

Read more …

Filed under kdd kdd2011 kddcup competition challenge prize data mining machine learning recommender systems science engineering academics data analysis predictive analysis