Posts tagged mymedialite
Posts tagged mymedialite
The actual recommendations are then computed on the client side, so that no user data is transferred to the server.
Now that Lucene.Net graduated from the Apache Incubator, why not use it for MyMediaLite?
The Million Song Dataset Challenge is a contest hosted on Kaggle. Its goal is to predict the songs that 100,000 users will listen to, given their listening history and additional listening histories and data about the songs.
Predicting held-out past user choices is a proxy for another task that cannot be directly evaluated without using a live system: personalized recommendation.
MyMediaLite is a tool/library containing state-of-the-art recommendation algorithms. In this post, I explain how MyMediaLite can be used to make predictions for the Million Song Dataset Challenge.
First, you need to install MyMediaLite. Don’t worry, it is quite easy, and should work fine on Linux, Mac OS X, and Windows.
You will also need several gigabytes of disk space, the challenge datasets, and a working Unix-like environment. On Linux and Mac this should not be a problem. For Windows you could use Cygwin to get such an environment.
In the following, I assume that you have installed MyMediaLite 3.01 (it must be the latest version, because only that one contains some features we will make use of) in
~/src/MyMediaLite. If it is somewhere else, just adapt the paths below accordingly.
In the MyMediaLite directory, create a directory
data/millionsong, and put the unzipped competition dataset there.
cat kaggle_users.txt | perl -ne 'chomp; print "$_\t" . ++$l . "\n"' > user_id_mappings.txt cut -f 2 user_id_mappings.txt > test_users.txt cut -f 2 -d ' ' kaggle_songs.txt > candidate_items.txt # create dataset ~/src/MyMediaLite/scripts/import_dataset.pl --load-user-mapping=user_id_mappings.txt --load-item-mapping=kaggle_songs.txt kaggle_visible_evaluation_triplets.txt > msd.train.txt # create CV splits mkdir cv ~/src/MyMediaLite/scripts/per_user_crossvalidation.pl --k=5 --filename=cv/msd < msd.train.txt # use one split for validation cp cv/msd-0.train.txt msd_validation.train.txt cp cv/msd-0.test.txt msd_validation.test.txt mkdir validation_predictions mkdir validation_submissions # prepare directories for prediction/submission files and logs mkdir logs mkdir submissions mkdir predictions
Trying out Different Recommenders
Run in the MyMediaLite directory:
bin/item_recommendation --training-file=msd_validation.train.txt --test-file=msd_validation.test.txt --data-dir=data/millionsong --recommender=MostPopular --random-seed=1 --predict-items-number=500 --num-test-users=1000 --no-id-mapping --candidate-items=candidate_items.txt
You will get an output like this:
Set random seed to 1. loading_time 1.67 memory 21 training data: 110000 users, 149052 items, 1160746 events, sparsity 99.99292 test data: 110000 users, 77330 items, 290187 events, sparsity 99.99659 MostPopular training_time 00:00:00.0718350 .AUC 0.56605 prec@5 0.0078 prec@10 0.007 MAP 0.02051 recall@5 0.01875 recall@10 0.03011 NDCG 0.05008 MRR 0.02324 num_users 1000 num_items 386213 num_lists 1000 testing_time 00:00:35.3801840 memory 120
The MAP 0.02051 is the interesting piece of information: This is an estimate of how well we will perform on the leaderboard with this recommender.
The command for the WRMF recommender is similar, only that we also see results at different iterations:
k=28; cpos=28; reg=0.002; bin/item_recommendation --training-file=msd_validation.train.txt --test-file=msd_validation.test.txt --recommender=WRMF --random-seed=1 --predict-items-number=500 --num-test-users=1000 --test-users=test_users.txt --find-iter=1 --max-iter=30 --recommender-options="num_iter=0 num_factors=$k c_pos=$cpos reg=$reg" --data-dir=data/millionsong --no-id-mapping --candidate-items=candidate_items.txt
The output will be like this (I removed some parts for better readability):
WRMF num_factors=28 regularization=0.002 c_pos=28 num_iter=0 MAP 0.00003 iteration 0 MAP 0.01106 iteration 1 MAP 0.01659 iteration 2 MAP 0.02593 iteration 3 MAP 0.03558 iteration 4 ... MAP 0.05341 iteration 30
Nice. This is already some improvement over the MostPopular baseline.
Creating a Submission
bin/item_recommendation --training-file=data/millionsong/msd.train.txt --recommender=MostPopular --predict-items-number=500 --prediction-file=data/millionsong/predictions/mp.pred --test-users=data/millionsong/kaggle_users.txt
k=28; cpos=28; reg=0.002; it=30; bin/item_recommendation --training-file=msd.train.txt --recommender=WRMF --random-seed=1 --predict-items-number=500 --recommender-options="num_iter=$it num_factors=$k c_pos=$cpos reg=$reg" --prediction-file=predictions/wrmf-k-$k-cpos-$cpos-reg-$reg-it-$it.pred --test-users=kaggle_users.txt --candidate-items=candidate_items.txt --data-dir=data/millionsong
MyMediaLite’s output format is a bit different from the submission file format, so I wrote a little script to convert the prediction file:
~/src/MyMediaLite/scripts/msdchallenge/create_submission.sh < predictions/wrmf-k-28-cpos-28-reg-0.002-it-30.pred > submissions/wrmf-k-28-cpos-28-reg-0.002-it-30.sub ~/src/MyMediaLite/scripts/msdchallenge/create_submission.sh < predictions/mp.pred > submissions/mp.sub
It will not hurt to make sure the submission file is in the correct format (using the script provided by the organizers) before trying to upload it:
Compress before upload:
Now you can upload the submission files to Kaggle. I got the following results:
I am currently preparing three further blog posts, which I will publish during the next days (links will be provided when the post are ready):
The approach demonstrated here is just a simple one, relying on functions that are already available in MyMediaLite. One can think of many extensions, either using existing functionality, or implementing them using the framework provided by MyMediaLite:
Want to learn more about MyMediaLite?
The script expects two files, one prediction file and one file containing the ground truth (actual clicks). The prediction file is similar to the submission file format - the only exception is that it should contain exactly one recommendation list per user. The ground truth file is a “standard” rating file, just like
How to call the script:
./evaluate.pl —prediction-file=pred —groundtruth-file=rec_log_train.last0.1.txt
I created the file rec_log_train.last0.1.txt using the command
tail -n 7320927 rec_log_train.txt > reg_log_train.last0.1.txt
The script is written Perl, which should be installed by default on a typical Linux or Mac OS X machine. You can download it from GitHub: https://github.com/zenogantner/MyMediaLite/blob/master/scripts/kddcup2012/evaluate.pl
Of course, I cannot guarantee that it works correctly. If you have questions or suggestions for improvement, do not hesitate to contact me.
Bob Carpenter on speed differences between languages. Particularly interesting for me because he is also a machine learning (in this case: for natural language processing) guy. Lots of data, numerical code, etc. Pretty much the same as we have in our recommender system library MyMediaLite, just a different application.
… some new things about git: How to push things to remote branches, and how to push things to several remote repositories at once (thanks to Carsten for the hint).
Currently, I am re-working some underlying data structures of MyMediaLite, the recommender system library that I (well, mostly me) develop. This breaks things BIG TIME, and for quite a while. In other words, a really good case for branching.
I knew how local branching and merging works, but I had no idea how to push the contents of branches to a remote repository without pushing it to the remote master branch.
It turns out it is quite easy:
git push email@example.com:mymedialite/mymedialite.git new_ratings:new_ratings
Well, everything looks easy with git once you find out how to do it. It was not obvious to me, and reading the man pages also did not really help.
Pushing to Several Remote Repositories
[remote "MML"] url = firstname.lastname@example.org:mymedialite/mymedialite.git url = email@example.com:zenogantner/MyMediaLite.git
I can then use the MML alias to push to both repositories:
git push MML
Combining Both Features
And guess what, of course the two features can be combined without a problem:
git push MML new_ratings:new_ratings
Visit us in hall 9 at booth 24. We still have some remaining free tickets, if you are interested just send an e-mail to firstname.lastname@example.org.
Some links with further information:
Here is my late FOSDEM report. The nice thing is that all the guys have already written their reports, so I can link to them ;-).
First of all, it was a really great and crowded event. There were around 5,000 participants, and there was a full and interesting schedule. The organization was superb.
On the train to Brussels I gave the slides a final polishing, and prepared release 0.10 of the MyMediaLite recommender system library that I was about to present at FOSDEM.
It all started on Friday with the beer event at Café Delirium, which is a nice beer pub in the center of Brussels. I wanted to meet the MoviePilot guys there, but the place was too crowded to find them. Instead, I spent most time talking with Alan and Patrick, two researchers from University of Cambridge, who attended FOSDEM to present Dasher, which is a one-handed text input method that uses probabilistic language models for input suggestions. Alan had a nice demo program on his mobile phone. The two are members of David MacKay’s group. MacKay is a physicist who is well-known for his work on sustainable energy, if you do not know him yet, read his blog or even his book about the topic. And, for me even more interesting, the group is also working on machine learning, so we had plenty of topics to talk about. So after meeting the guys the book Information Theory, Inference, and Learning Algorithms is high on my reading list (I even got it from the library already …).
The actual event started with a keynote by Eben Moglen (see also this) about how political liberty depends on technology, followed by a talk on the LLVM compiler framework by Chris Lattner.
Afterwards, I was mostly in Data Analytics Devroom (photo, more photos by Nicolas Maillot), which was really stuffed. So I gave my talk on MyMediaLite in front of a rather large audience, something which I am not really used to at academic conferences. I finally met Benjamin and Jannis from MoviePilot, although there was not really a lot of time to chat (there were almost no breaks between the talks).
There were many interesting talks in the Data Analytics room, among others one about Clustering with Mahout by Frank Scholten. It was a nice introduction to clustering, but there is one thing I do not understand about the Mahout guys: Why always those small examples!?! If I have a collection of 20K documents, there is no reason for having the overhead of a distributed computation. It would also be nicer to impress the audience by being bold and really showing off big big data examples that you cannot tackle on a several years old laptop …
Here is a list of available slide sets (I will try to complete this list):
Someone took videos, which will hopefully be available at some point in the future.
Sadly, I missed some very interesting talks because I went to the Mono Devroom to see the talk about the new Mono garbage collector, and then I did not get back into the overcrowded Data Analytics Devroom.
At the end of the first day, I visited the Perl booth and went to have dinner with about a dozen Perl people, which was a really nice experience. Claudio took some nice pictures of that.
On Sunday, I spent most of the time at the Perl Devroom. Highlights were a talk about a new ncurses module (with really impressive demos) by Damien Krotkine, the presentations about the Dancer web microframework (cool logo, by the way; slides) and Moose object system (slides) by Sawyer X, as well as Gabor’s talks about Perl 6 (during which I ported a recommender algorithm to Perl 6; interesting experience, will blog about this soon) and Padre. I missed Franck Cuny’s talk about SPORE, but here are his slides.
The atmosphere was positive, and particulary the Dancer crew made a really energetic impression on me. Sawyer X wrote two posts about Perl and Dancer at FOSDEM. I guess I will prototype the webservice interface for MyMediaLite, and possibly a web-based demo. But let’s see.
I was happy to get to know personally some of the Padre developers that until then I only know via IRC and e-mail: getty, el_che, burak, szagab, szbalint, and Dirk (hope I did not miss anyone).
Overall, a really good event, it is definitely worth going again next year.