Posts tagged data analysis
Posts tagged data analysis
D. Tunkelang’s #recsys2011 tutorial featured on SlideShare #recsys
Congratulations to the Machine Learning Journal for 25 years of interesting (though not open access) content.
It is nice that there is free access in October, but it is sad that it is only so in October.
Random forests are a really nice (because flexible, robust, scalable) machine learning method. Jeremy Howard recently reported at the O’Reilly Strata New York conference (alas, I was not there, only heard it on Twitter) that 50% of the winners of Kaggle competitions made use of random forests.
So how well perform the random forest implementations of two major (free/open source software) data analytics packages, Weka and R? Both are not exactly known for being the fastest stuff in the universe, but on the other side they offer many different methods, and a nice infrastructure to work in (Java and nice GUI in Weka’s case, a complete statistical/numerical programming language for R).
So I ran RF on a medium sized dense dataset (11 features, 150,000 training instances, about 100,000 test instances). The results are so surprisingly clear that I did not even try to make a more detailed/fair comparison:
Weka 3.7.4 (the latest version, running on Java 6) took 609.26 seconds to grow 50 trees (without I/O), whereas R 2.10.1 needed just 97.37 seconds for the same amount of trees, including making predictions for about 100,000 instances, and I/O.
Some more details: R 2.10.1 is about two years old, I suspect that there have been performance improvements in during that time). The random forest package for R can be found here. The sample size was set to 80,000 in the R case. I also tried sample sizes of 10,000/20,000/40,000/120,000, which resultet in runtimes of 37.42 / 48.72 / 92.05 / 141.35 seconds I did not find a way to set the sample size for Weka (Reader, maybe you know a way to do that?).
There are more free implementations of random forests that I have not had the chance to try yet:
I will blog about those as soon as I find the time to play around with them.
You can read about random forests in the Wikipedia article, or in the original paper. By the way, what is the nicest description of random forests on the web? The Wikipedia article is not that good. Any suggestions? Any volunteers to improve the Wikipedia article?
Note: Read on if you are interested in data analysis, machine learning, or recommender systems.
At this year’s KDD conference, there was, as every year, a workshop on the KDD Cup (at which I was a participant). Additionally, and even more interesting, there was a panel about data mining competitions.
Neal Lathia wrote a really nice and thought-provoking post about this panel discussion, and shared some of his opinions about the topic. I had a different view on some of the things he said, and wanted to write a comment on his blog. After I saw that the comment would be quite long, I decided to turn it into a proper blog post.