Zeno's Notes


Posts tagged decision tree

19 notes

Random Forests: Weka vs. R

Random forests are a really nice (because flexible, robust, scalable) machine learning method. Jeremy Howard recently reported at the O’Reilly Strata New York conference (alas, I was not there, only heard it on Twitter) that 50% of the winners of Kaggle competitions made use of random forests.

So how well perform the random forest implementations of two major (free/open source software) data analytics packages, Weka and R? Both are not exactly known for being the fastest stuff in the universe, but on the other side they offer many different methods, and a nice infrastructure to work in (Java and nice GUI in Weka’s case, a complete statistical/numerical programming language for R).

So I ran RF on a medium sized dense dataset (11 features, 150,000 training instances, about 100,000 test instances). The results are so surprisingly clear that I did not even try to make a more detailed/fair comparison:

Weka 3.7.4 (the latest version, running on Java 6) took 609.26 seconds to grow 50 trees (without I/O), whereas R 2.10.1 needed just 97.37 seconds for the same amount of trees, including making predictions for about 100,000 instances, and I/O. 

Some more details: R 2.10.1 is about two years old, I suspect that there have been performance improvements in during that time). The random forest package for R can be found here. The sample size was set to 80,000 in the R case. I also tried sample sizes of 10,000/20,000/40,000/120,000, which resultet in runtimes of 37.42 / 48.72 / 92.05 / 141.35 seconds  I did not find a way to set the sample size for Weka (Reader, maybe you know a way to do that?).
There are more free implementations of random forests that I have not had the chance to try yet:

I will blog about those as soon as I find the time to play around with them.

You can read about random forests in the Wikipedia article, or in the original paper. By the way, what is the nicest description of random forests on the web? The Wikipedia article is not that good. Any suggestions? Any volunteers to improve the Wikipedia article?

Filed under data analysis machine learning free software open source R Weka random forests decision tree ensembles Java

4 notes

C5.0 and Cubist: GPL-licensed decision tree implementations

Ross Quinlan received the SIGKDD Innovation Award at KDD 2011 in San Diego.

Quinlan is well-known for his work on decision tree learning, in particular for developing the C4.5 algorithm and its successor, C5.0.

He has also a company, RuleQuest Research, that sells tools and services related to his inventions.

KDD 2011 Opening Session

At the award session I found out that the single-threaded Linux versions of C5.0 (for classification) and Cubist (for regression) are available under the terms of the GNU General Public License, that is, they are free software. Nice! You can download them here.

Except that I had to install csh to be able to build the programs, installation was without problems. It seems they are not yet packaged for Debian, though. Any volunteers?

PS: The photo above was taken by Markus Weimer. Click on it to get to his flickr photostream.

Filed under KDD kdd2011 data mining machine learning free software GPL GNU open source debian decision tree c4.5 c5.0 regression classification