Tag Archives: Python

My Big Data Journey:- Getting serious or my pivotal moment!


In my last two posts I went through the basic tools and getting to data, after having had a bit of a play it was time to get serious!

As with a lot of things, they seem simple on the surface but as soon as you drill into them, they become more complex! Playing with a dataset of thousands of records, I noticed that the ‘runtime’ was extending, and I am not a patient person! So it was time to fix this. First issue was that my Python system was single threaded, so while my Intel i7 has 4 cores and 8 threads… very little was being used.  So I downloaded and installed iPython, (iPython.org), a great interactive environment which also has a simple parallelisation mechanism. Simply allowing me to spin up a number of threads and make use of the extra CPU power. Great…

For a while my wait-time down to tolerable levels, but as my ‘maths’ got more complex and my datasets grew… things ground to a halt! (You can imagine when the machine started thrashing into virtual memory, it was way over my tolerance level!) So it was time to pull out the big guns!! Firstly- reading the data into a single data-structure was becoming an issue so the obvious answer was to divide and conquer, enter Hadoop.

Well I started using this great article, (here), by Brett Winterford, (leading IT journo and a great Muso.. YouTube if you are interested), who challenged the open guru Des Blanchfield to create a tiny-Hadoop. The article steps you through building a Hadoop environment in about an hour which uses about 500MB! (I had a little problem where the daemons did not start, after a few hours I worked out some of the config files were incorrect, otherwise pretty much as per the instructions.) Great introduction to Hadoop but not really enterprise ready… so I moved on.

You guessed it from my lousy pun in the title … all roads lead to goPivotal.com, where you can find a downloadable version of both Pivotal Greenplum and Pivotal HD with HAWQ – the parallel version of standard SQL which is just SQL on steroids! Now let me warn you these are GB downloads but well worth the bandwidth. These are single node versions which are great to get a taste of these amazing tools. However, (and I hope this is not a secret), but a couple of weeks later I got sent a mail with a link to a cluster version!!

So that’s where I’m up to, my biggest learning surprised me a bit, was not about the technologies and methods at all, but about moving from ‘hacking/playing’ to ‘production’!  At the beginning of this exercise I thought the issues all resolved around the ultimate algorithms and absolute performance. However by the end I realised that it’s about the application of these tools and as always to operationalise these system the complete  lifecycle is more important than how many ‘rows’ I can process in a second!!

Lastly, get going the only thing to fear is fear itself!


My Big Data Journey – The Basics!


It is All Out There!
Oh wow! There is so much out there to help you get going in Big Data… it’s almost a Big Data problem in itself.

My first step was to brush up on my programming/language skills, since it’s been years since I’ve tried to seriously write any code. I’d been playing around with Python for a little while so I naively Googled “Python for Big Data” and only got 29.4 million hits… so I took a course from http://www.coursera.org to sharpen my Python skills, then followed a Python for Big Data tutorial given via YouTube and I was almost ready to go!

Lastly to get some practical experience I dropped by kaggle.com, signed up and went through a couple of their tutorials, with datasets and advice. Now I’m not a data scientist, but I do have taste of what they are up against in a practical sense.
You Don’t Need to Know The Math!
This will probably get me into trouble! However, I suggest that you don’t need to know how the algorithms work, you just need to know what they do! If you are not a Python person, then you need to know that there are a myriad of libraries available, (mostly for free), which provide a rich set of functions to perform data-analysis. So, the logic to run the algorithm is developed and is being improved by the community, all you need to do is understand how to use the code and what it does!!

For example let’s say you have two sets of numbers and you want to see if they are related, i.e if there is a correlation between them. So have a look at wiki, (here), and you discover there are several mathematical way to find the correlation depending on the type of relationship between the numbers, (linear, exponential, etc). Now if you want to perform a Pearsons’s coefficient calculation, there is Python library that gives you that, or you decide to use one of the Rank coefficient.. then like wise just a different call!

The tools are all there and everyone can use them, fairly simply… like woodwork! However, a skilled carpenter will produce a far superior product by selecting the correct tools and applying them with their past experience, superior knowledge and skills, as well as the insights they gain looking at the raw materials! Similarly that is what distinguishes me, (the hacker), from a Data Scientist!

Next stepping up from the basics…