Posts tagged ensembles
Posts tagged ensembles
Random forests are a really nice (because flexible, robust, scalable) machine learning method. Jeremy Howard recently reported at the O'Reilly Strata New York conference (alas, I was not there, only heard it on Twitter) that 50% of the winners of Kaggle competitions made use of random forests.
So how well perform the random forest implementations of two major (free/open source software) data analytics packages, Weka and R? Both are not exactly known for being the fastest stuff in the universe, but on the other side they offer many different methods, and a nice infrastructure to work in (Java and nice GUI in Weka’s case, a complete statistical/numerical programming language for R).
So I ran RF on a medium sized dense dataset (11 features, 150,000 training instances, about 100,000 test instances). The results are so surprisingly clear that I did not even try to make a more detailed/fair comparison:
Weka 3.7.4 (the latest version, running on Java 6) took 609.26 seconds to grow 50 trees (without I/O), whereas R 2.10.1 needed just 97.37 seconds for the same amount of trees, including making predictions for about 100,000 instances, and I/O.
Some more details: R 2.10.1 is about two years old, I suspect that there have been performance improvements in during that time). The random forest package for R can be found here. The sample size was set to 80,000 in the R case. I also tried sample sizes of 10,000/20,000/40,000/120,000, which resultet in runtimes of 37.42 / 48.72 / 92.05 / 141.35 seconds I did not find a way to set the sample size for Weka (Reader, maybe you know a way to do that?).
There are more free implementations of random forests that I have not had the chance to try yet:
I will blog about those as soon as I find the time to play around with them.
You can read about random forests in the Wikipedia article, or in the original paper. By the way, what is the nicest description of random forests on the web? The Wikipedia article is not that good. Any suggestions? Any volunteers to improve the Wikipedia article?