Zeno's Notes

...

Posts tagged free software

19 notes

Random Forests: Weka vs. R

Random forests are a really nice (because flexible, robust, scalable) machine learning method. Jeremy Howard recently reported at the O’Reilly Strata New York conference (alas, I was not there, only heard it on Twitter) that 50% of the winners of Kaggle competitions made use of random forests.

So how well perform the random forest implementations of two major (free/open source software) data analytics packages, Weka and R? Both are not exactly known for being the fastest stuff in the universe, but on the other side they offer many different methods, and a nice infrastructure to work in (Java and nice GUI in Weka’s case, a complete statistical/numerical programming language for R).

So I ran RF on a medium sized dense dataset (11 features, 150,000 training instances, about 100,000 test instances). The results are so surprisingly clear that I did not even try to make a more detailed/fair comparison:

Weka 3.7.4 (the latest version, running on Java 6) took 609.26 seconds to grow 50 trees (without I/O), whereas R 2.10.1 needed just 97.37 seconds for the same amount of trees, including making predictions for about 100,000 instances, and I/O. 

Some more details: R 2.10.1 is about two years old, I suspect that there have been performance improvements in during that time). The random forest package for R can be found here. The sample size was set to 80,000 in the R case. I also tried sample sizes of 10,000/20,000/40,000/120,000, which resultet in runtimes of 37.42 / 48.72 / 92.05 / 141.35 seconds  I did not find a way to set the sample size for Weka (Reader, maybe you know a way to do that?).
There are more free implementations of random forests that I have not had the chance to try yet:

I will blog about those as soon as I find the time to play around with them.

You can read about random forests in the Wikipedia article, or in the original paper. By the way, what is the nicest description of random forests on the web? The Wikipedia article is not that good. Any suggestions? Any volunteers to improve the Wikipedia article?

Filed under data analysis machine learning free software open source R Weka random forests decision tree ensembles Java

3 notes

Does open source exclude high context cultures?

Interesting piece about the threshold for contributors from different cultures.

Also some good comments, e.g. from Donnie Berkholz:

… the same things important to increasing contribution from high-context cultures are *also* important to recruiting and retaining contributors in low-context cultures. Relationships and integration into the community trumps everything.

I fully agree. The Padre project is doing things the right way here. There is a Code of Conduct and a Diversity Statement, the community is open and welcoming, development is public and transparent, developers are spread around the globe (Germany, Israel, Jordan, Australia, Turkey, the Netherlands, Brazil, France …) and people from the project (in particular Gabor) go to events and talk to people in person regularly. And that makes a difference. A lot. I think I use Padre exactly because I met Gabor and Sebastian (sewi) last year at CeBIT and they got me interested again in that nice editor/IDE thing …

So guys, keep up the good work, both on the code and the community.

Filed under culture relationships open source free software padre perl IDE editor community

23 notes

Seahorse - Encryption Made Easy?

Seahorse is a nice GUI tool for managing (among other things) your GNU Privacy Guard keys.

But it fails to be a nice GPG frontend, because it does not offer (at least in a comfortable way) GPG’s two main features: decrypting and encrypting files.

The main problem of mainstream cryptography remains usability. It is not the main problem that people do not understand how cryptography works, or that we do not have enough algorithms, or that some implementations may be insecure; what is missing are practical tools, and integration into the tools that we already use, that let you perform encryption, decryption, and key management with almost no overhead.

Filed under cryptography communication usability freedom GNOME free software open source PGP GNU Privacy Guard GPG

9 notes

New Release: LaTeX Plugin for Padre

I have just released version 0.12 of the LaTeX plugin for Padre, the Perl IDE, to CPAN.

Some additional LaTeX commands and environment types are supported, and the plugin is compatible to the latest Padre version.

Get it while it’s hot!

I would be happy about feedback, bug reports, and of course patches. If you have access to the Padre SVN repository, you can directly modify the plugin sources there.

Filed under padre perl perl5 modern perl IDE editor programming publishing latex bibtex free software open source