Zeno's Notes

...

Posts tagged data analysis

8 notes

25 years of Machine Learning Journal - Free access in October

(KD Nuggets link via Nando de Freitas - @NandoDF)

Congratulations to the Machine Learning Journal for 25 years of interesting (though not open access) content.

It is nice that there is free access in October, but it is sad that it is only so in October.

As long as the Machine Learning Journal is not open access, I encourage machine learning researchers to submit their work to the Journal of Machine Learning Research instead

Filed under academics open access publication journal machine learning data analysis

19 notes

Random Forests: Weka vs. R

Random forests are a really nice (because flexible, robust, scalable) machine learning method. Jeremy Howard recently reported at the O’Reilly Strata New York conference (alas, I was not there, only heard it on Twitter) that 50% of the winners of Kaggle competitions made use of random forests.

So how well perform the random forest implementations of two major (free/open source software) data analytics packages, Weka and R? Both are not exactly known for being the fastest stuff in the universe, but on the other side they offer many different methods, and a nice infrastructure to work in (Java and nice GUI in Weka’s case, a complete statistical/numerical programming language for R).

So I ran RF on a medium sized dense dataset (11 features, 150,000 training instances, about 100,000 test instances). The results are so surprisingly clear that I did not even try to make a more detailed/fair comparison:

Weka 3.7.4 (the latest version, running on Java 6) took 609.26 seconds to grow 50 trees (without I/O), whereas R 2.10.1 needed just 97.37 seconds for the same amount of trees, including making predictions for about 100,000 instances, and I/O. 

Some more details: R 2.10.1 is about two years old, I suspect that there have been performance improvements in during that time). The random forest package for R can be found here. The sample size was set to 80,000 in the R case. I also tried sample sizes of 10,000/20,000/40,000/120,000, which resultet in runtimes of 37.42 / 48.72 / 92.05 / 141.35 seconds  I did not find a way to set the sample size for Weka (Reader, maybe you know a way to do that?).
There are more free implementations of random forests that I have not had the chance to try yet:

I will blog about those as soon as I find the time to play around with them.

You can read about random forests in the Wikipedia article, or in the original paper. By the way, what is the nicest description of random forests on the web? The Wikipedia article is not that good. Any suggestions? Any volunteers to improve the Wikipedia article?

Filed under data analysis machine learning free software open source R Weka random forests decision tree ensembles Java

9 notes

Data Mining Competitions: They Are Very, Very Useful

Note: Read on if you are interested in data analysis, machine learning, or recommender systems.

At this year’s KDD conference, there was, as every year, a workshop on the KDD Cup (at which I was a participant). Additionally, and even more interesting, there was a panel about data mining competitions.

Neal Lathia wrote a really nice and thought-provoking post about this panel discussion, and shared some of his opinions about the topic. I had a different view on some of the things he said, and wanted to write a comment on his blog. After I saw that the comment would be quite long, I decided to turn it into a proper blog post.

Read more …

Filed under kdd kdd2011 kddcup competition challenge prize data mining machine learning recommender systems science engineering academics data analysis predictive analysis