WTF Is Data Wrangling?
“Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations.” Source: Wikipedia
Thanks Wikipedia. Nothing like a dry, officious sounding definition to kick off the post. Especially describing something that will take up around 80% of any data analyst’s working career.
Data wrangling, in simple terms then, is the part of the analytics process where we get our hands on the raw data. We learn a bit about it. Clean it up. Join it together then find a more useful way of looking at it than we could have done before we started.
The average data analyst will spend around 80% of their time on this stage of the analytics process. If they don’t then they are likely to be wasting the other time they spend actually analysing the data. Your laziness will have rendered it virtually meaningless due to poor data quality and/or inaccurate or incomplete elements.
This doesn’t sound like the sexy, high octane world of data science that the media led us to believe we were embarking upon. Have we been sold a lie?
Well, yes and no.
If you don’t enjoy the prospect of taking data from:
- the neat and tidy data warehouse
- text file data dumps from legacy systems that don’t speak to the DWH
- Excel spreadsheets from the finance team
- public datasets from government bodies
- raw data files gleaned from social media accounts
- and potentially thousands of other data sources
…then sifting manually through them to see where they need cleaned and joined – maybe this isn’t for you after all.
I’ve written before about the importance of an inquisitive nature to really making a good stab at this profession. No sugar coating – it’s in the data wrangling arena that you’ll likely spend most time working on this. Picture yourself manually scanning through thousands of rows looking at data structure and content creating training datasets. That’s long before you get the thrill of writing scripts to automate the same job over the other 10 million rows. But it’s where your business knowledge comes in to play and that old detective-esque need to ask questions of the data. And ultimately find the answers within it.
I don’t believe there is any shortcut to learning the dark arts of data wrangling. No route other than getting your hands dirty and delving ever deeper into each of your data sources. Digging and digging until you know their intricacies and foibles like the back of your hand. Master this skill and you will learn to savour the joy of a new untamed data source and the endless opportunities it can bring your analysis.
Likely to soil your undergarments at the first sign of a unstructured, unclean dataset that you are unprepared and unwilling to wrangle? You’ll always founder on similar rocks in future and your overall desirability as a data scientist or analyst will suffer.
My favourite question when interviewing analysts is when we ask the candidate for an example of a time they faced a challenging situation and how they worked through it.
When a real data analyst is given that question it inevitably comes down to working with difficult datasets. I like to see how they worked their way through the potential minefield to build a workable solution out of the parts they’d been handed.
Show me how you approached the data wrangling, what your thought process and investigation methods were and how you made good out of it. I guarantee you’ll have went a long way to winning me over when it comes to grading your overall performance.
I know how much of an impact getting your data wrangling right can have on the overall success of a project. If I can judge the analyst involved feels the same way then I’ll be a hell of a lot happier trusting them to get it right for me.
Wrangle like your life depends on it.
Skimp on this stage and I’ll conclude that you don’t give a shit if your results are as close to accurate as they possibly can be. If you only want to get some numbers back on a spreadsheet then you might as well write a random number generator and save wasting all of our time. Give the data wrangle the respect it deserves and we can all reap the benefits.