Posts tagged data mining
Posts tagged data mining
Steffen and his method/software finally get some well-deserved attention.
Factorization Machines (FMs) are basically factorized polynomial prediction (regression/classification/ranking).
They work really really well for applications like recommendation, where the input data is sparse, and many feature combinations at prediction time (e.g. user-item pairs) are never observed during training.
And the cool thing is, you can mimic many advanced factorization models just by feature engineering for FMs. That means you can reuse the existing training algorithms — no need to derive and implement a new algorithm for a new prediction problem…
The Million Song Dataset Challenge is a contest hosted on Kaggle. Its goal is to predict the songs that 100,000 users will listen to, given their listening history and additional listening histories and data about the songs.
Predicting held-out past user choices is a proxy for another task that cannot be directly evaluated without using a live system: personalized recommendation.
MyMediaLite is a tool/library containing state-of-the-art recommendation algorithms. In this post, I explain how MyMediaLite can be used to make predictions for the Million Song Dataset Challenge.
Preliminaries
First, you need to install MyMediaLite. Don’t worry, it is quite easy, and should work fine on Linux, Mac OS X, and Windows.
You will also need several gigabytes of disk space, the challenge datasets, and a working Unix-like environment. On Linux and Mac this should not be a problem. For Windows you could use Cygwin to get such an environment.
In the following, I assume that you have installed MyMediaLite 3.01 (it must be the latest version, because only that one contains some features we will make use of) in ~/src/MyMediaLite. If it is somewhere else, just adapt the paths below accordingly.
Data Preparation
In the MyMediaLite directory, create a directory data/millionsong, and put the unzipped competition dataset there.
cat kaggle_users.txt | perl -ne 'chomp; print "$_\t" . ++$l . "\n"' > user_id_mappings.txt cut -f 2 user_id_mappings.txt > test_users.txt cut -f 2 -d ' ' kaggle_songs.txt > candidate_items.txt # create dataset ~/src/MyMediaLite/scripts/import_dataset.pl --load-user-mapping=user_id_mappings.txt --load-item-mapping=kaggle_songs.txt kaggle_visible_evaluation_triplets.txt > msd.train.txt # create CV splits mkdir cv ~/src/MyMediaLite/scripts/per_user_crossvalidation.pl --k=5 --filename=cv/msd < msd.train.txt # use one split for validation cp cv/msd-0.train.txt msd_validation.train.txt cp cv/msd-0.test.txt msd_validation.test.txt mkdir validation_predictions mkdir validation_submissions # prepare directories for prediction/submission files and logs mkdir logs mkdir submissions mkdir predictions
Trying out Different Recommenders
Run in the MyMediaLite directory:
bin/item_recommendation --training-file=msd_validation.train.txt --test-file=msd_validation.test.txt --data-dir=data/millionsong --recommender=MostPopular --random-seed=1 --predict-items-number=500 --num-test-users=1000 --no-id-mapping --candidate-items=candidate_items.txt
You will get an output like this:
Set random seed to 1. loading_time 1.67 memory 21 training data: 110000 users, 149052 items, 1160746 events, sparsity 99.99292 test data: 110000 users, 77330 items, 290187 events, sparsity 99.99659 MostPopular training_time 00:00:00.0718350 .AUC 0.56605 prec@5 0.0078 prec@10 0.007 MAP 0.02051 recall@5 0.01875 recall@10 0.03011 NDCG 0.05008 MRR 0.02324 num_users 1000 num_items 386213 num_lists 1000 testing_time 00:00:35.3801840 memory 120
The MAP 0.02051 is the interesting piece of information: This is an estimate of how well we will perform on the leaderboard with this recommender.
The command for the WRMF recommender is similar, only that we also see results at different iterations:
k=28; cpos=28; reg=0.002; bin/item_recommendation --training-file=msd_validation.train.txt --test-file=msd_validation.test.txt --recommender=WRMF --random-seed=1 --predict-items-number=500 --num-test-users=1000 --test-users=test_users.txt --find-iter=1 --max-iter=30 --recommender-options="num_iter=0 num_factors=$k c_pos=$cpos reg=$reg" --data-dir=data/millionsong --no-id-mapping --candidate-items=candidate_items.txt
The output will be like this (I removed some parts for better readability):
WRMF num_factors=28 regularization=0.002 c_pos=28 num_iter=0 MAP 0.00003 iteration 0 MAP 0.01106 iteration 1 MAP 0.01659 iteration 2 MAP 0.02593 iteration 3 MAP 0.03558 iteration 4 ... MAP 0.05341 iteration 30
Nice. This is already some improvement over the MostPopular baseline.
Creating a Submission
bin/item_recommendation --training-file=data/millionsong/msd.train.txt --recommender=MostPopular --predict-items-number=500 --prediction-file=data/millionsong/predictions/mp.pred --test-users=data/millionsong/kaggle_users.txt
k=28; cpos=28; reg=0.002; it=30; bin/item_recommendation --training-file=msd.train.txt --recommender=WRMF --random-seed=1 --predict-items-number=500 --recommender-options="num_iter=$it num_factors=$k c_pos=$cpos reg=$reg" --prediction-file=predictions/wrmf-k-$k-cpos-$cpos-reg-$reg-it-$it.pred --test-users=kaggle_users.txt --candidate-items=candidate_items.txt --data-dir=data/millionsong
MyMediaLite’s output format is a bit different from the submission file format, so I wrote a little script to convert the prediction file:
~/src/MyMediaLite/scripts/msdchallenge/create_submission.sh < predictions/wrmf-k-28-cpos-28-reg-0.002-it-30.pred > submissions/wrmf-k-28-cpos-28-reg-0.002-it-30.sub ~/src/MyMediaLite/scripts/msdchallenge/create_submission.sh < predictions/mp.pred > submissions/mp.sub
It will not hurt to make sure the submission file is in the correct format (using the script provided by the organizers) before trying to upload it:
./validate_submission.py submissions/wrmf-k-28-cpos-28-reg-0.002-it-30.sub
./validate_submission.py submissions/mp.sub
Compress before upload:
gzip submissions/wrmf-k-28-cpos-28-reg-0.002-it-30.sub
gzip submissions/mp.sub
Submission
Now you can upload the submission files to Kaggle. I got the following results:
Next Steps
I am currently preparing three further blog posts, which I will publish during the next days (links will be provided when the post are ready):
The approach demonstrated here is just a simple one, relying on functions that are already available in MyMediaLite. One can think of many extensions, either using existing functionality, or implementing them using the framework provided by MyMediaLite:
Want to learn more about MyMediaLite?
Have a look at the website, browse the API documentation and browse the source code on GitHub, search the Google Group archives.
Questions? Problems?
In case there are questions, do not hesitate to ask them in MyMediaLite’s Google Group or in the competition forum.
… including averaged gradient descent.
(via Olivier Grisel)
Rebranding of statistics as a field seems to be a popular topic these days and “data science” is one of the potential rebranding options. This article over at Revolutions is a nice summary of where the term comes from and what it means. This quote seems pretty accurate:
My own take is that Data Science is a valuable rebranding of computer science and applied statistics skills.
The nicest finding on the web today. Now back to proposal and thesis writing …
Note: Read on if you are interested in data analysis, machine learning, or recommender systems.
At this year’s KDD conference, there was, as every year, a workshop on the KDD Cup (at which I was a participant). Additionally, and even more interesting, there was a panel about data mining competitions.
Neal Lathia wrote a really nice and thought-provoking post about this panel discussion, and shared some of his opinions about the topic. I had a different view on some of the things he said, and wanted to write a comment on his blog. After I saw that the comment would be quite long, I decided to turn it into a proper blog post.
It now includes bindings for Java, C#, Ruby, and Lua. Nice.
We’re organizing a workshop at NIPS 2011. Submission are solicited for a two day workshop December 16-17 in Sierra Nevada, Spain.
This workshop will address tools, algorithms, systems, hardware, and real-world problem domains related to large-scale machine learning (“Big Learning”). The…
Ross Quinlan received the SIGKDD Innovation Award at KDD 2011 in San Diego.
Quinlan is well-known for his work on decision tree learning, in particular for developing the C4.5 algorithm and its successor, C5.0.
He has also a company, RuleQuest Research, that sells tools and services related to his inventions.
At the award session I found out that the single-threaded Linux versions of C5.0 (for classification) and Cubist (for regression) are available under the terms of the GNU General Public License, that is, they are free software. Nice! You can download them here.
Except that I had to install csh to be able to build the programs, installation was without problems. It seems they are not yet packaged for Debian, though. Any volunteers?
PS: The photo above was taken by Markus Weimer. Click on it to get to his flickr photostream.
Bob Carpenter on speed differences between languages. Particularly interesting for me because he is also a machine learning (in this case: for natural language processing) guy. Lots of data, numerical code, etc. Pretty much the same as we have in our recommender system library MyMediaLite, just a different application.