Look Ruby, it’s not you, it’s me. I have just grown out of you. I know we haven’t been together that long, but I will always treasure the time we spent together. Remember when we scraped Twitter? Those were good times. But to tell you the truth I am going back to my ex. I know, I know, I said I was done with Python, but Python just has, well, it’s about the analytics.
So in this installment, I want to talk about a topic near and dear to us all, analytical programming. During my little journey deeper into the heart of data science I have been struggling with where to start, specifically what tool provides the most functionality for the kind of work I will be doing, and has enough support that I won’t be all by myself using it. Given that I have been working with SAS for the last 5 years, it would seem like the obvious choice. Sadly, as those of you SAS users know, it’s actually quite a limited platform. Most of the source code has not been updated since the 80’s; it has a hard time interacting with other programs; and since it is not free open-source software, a single computer license will run you something like $10K annually…
Next up SQL. Again, I am well versed in SQL, but it’s kind of a one trick pony. If you want to access data from a relational database SQL is the most tested and trusted way to do it. The drawback is that this is only a small part of data analysis, let alone data science. What am I going to do - pull the data, then toss it into Excel for some thorough analysis? Simply put, SQL is a useful tool but by no means a one stop shop.
For those of you still lost about the Excel comment, let me make this perfectly clear. Excel is just fine for doing an analysis, once. If it’s any kind of processes which you will be required to duplicate – which, let’s face it, they all are – using Excel is just admitting you are programmatically, and possibly analytically, illiterate. Don’t give me the “Well, I just wanted to put together some high level…”, save it. There is no excuse for leaning on the analytical crutch that is Excel - it’s just a time-suck.
Once I started branching out beyond enterprise software I started searching in the world of open source. My first stop was Python, a fully functionally programming language. And it’s called Python. How cool is that? I didn’t get very far with Python and eventually abandoned it in favor of Ruby; I mean all the cool kids were doing it.
Ruby, I felt was similar enough to Python that I could make it do what I wanted, so I got to work. After a brief stint doing some web development, where Ruby does have some great Gems [libraries], I refocused my efforts on analytics. Immediately, I was pretty impressed with the power I felt having full control of how functions were written, a functionality so often missed in enterprise software. But as time wore on it became increasingly frustrating to have to write every library myself, and worry about things like query optimization. The reason being that the Ruby community is just not rich in data people. Consequentially, I could not find a good engine for dealing with vector mathematics – the basis of high level mathematical computation (a.k.a how fast your computer can finish your analysis) — and thus, they don’t have a lot of pre-made optimized high level mathematical libraries.
Dismayed I had a brief stint playing around with R. But given my experience with SAS, I wanted more functionally - not open-source, but still limited. To take up R would mean I would still need another tool in my toolbox. Also, complex computations in R are quite slow.
At this point I decided to take another look at Python just to see if maybe all that functionality everyone is always talking about is something I could discover. This time went a lot better. Although the barriers to entry are quite steep, once you understand the power of the trifecta - numpy, scipy and matplotlib - the world is yours. Four days after my enlightenment, I am hooked. To celebrate in finding my analytical programming home, I produced a nice little Machine Learning [ML] program to predict the category of multi-dimensional points based off of the category of the points near them in n-dimensional space. Though a basic ML algorithm, building a k-nearest neighbor classification algorithm, (look it up, you might learn something), is no trivial task, I found it not too complicated and even a lot of fun in Python.
So triumphant in my endeavor I bid you adieu, until next time. And for the analytically-minded determined to stick with Ruby, take heart they are working on it…