In my last two posts I went through the basic tools and getting to data, after having had a bit of a play it was time to get serious!
As with a lot of things, they seem simple on the surface but as soon as you drill into them, they become more complex! Playing with a dataset of thousands of records, I noticed that the ‘runtime’ was extending, and I am not a patient person! So it was time to fix this. First issue was that my Python system was single threaded, so while my Intel i7 has 4 cores and 8 threads… very little was being used. So I downloaded and installed iPython, (iPython.org), a great interactive environment which also has a simple parallelisation mechanism. Simply allowing me to spin up a number of threads and make use of the extra CPU power. Great…
For a while my wait-time down to tolerable levels, but as my ‘maths’ got more complex and my datasets grew… things ground to a halt! (You can imagine when the machine started thrashing into virtual memory, it was way over my tolerance level!) So it was time to pull out the big guns!! Firstly- reading the data into a single data-structure was becoming an issue so the obvious answer was to divide and conquer, enter Hadoop.
Well I started using this great article, (here), by Brett Winterford, (leading IT journo and a great Muso.. YouTube if you are interested), who challenged the open guru Des Blanchfield to create a tiny-Hadoop. The article steps you through building a Hadoop environment in about an hour which uses about 500MB! (I had a little problem where the daemons did not start, after a few hours I worked out some of the config files were incorrect, otherwise pretty much as per the instructions.) Great introduction to Hadoop but not really enterprise ready… so I moved on.
You guessed it from my lousy pun in the title … all roads lead to goPivotal.com, where you can find a downloadable version of both Pivotal Greenplum and Pivotal HD with HAWQ – the parallel version of standard SQL which is just SQL on steroids! Now let me warn you these are GB downloads but well worth the bandwidth. These are single node versions which are great to get a taste of these amazing tools. However, (and I hope this is not a secret), but a couple of weeks later I got sent a mail with a link to a cluster version!!
So that’s where I’m up to, my biggest learning surprised me a bit, was not about the technologies and methods at all, but about moving from ‘hacking/playing’ to ‘production’! At the beginning of this exercise I thought the issues all resolved around the ultimate algorithms and absolute performance. However by the end I realised that it’s about the application of these tools and as always to operationalise these system the complete lifecycle is more important than how many ‘rows’ I can process in a second!!
Lastly, get going the only thing to fear is fear itself!