5 min read
One of my favourite parts of doing the Analysts Assemble series is getting to find out more about data worlds that I would never ordinarily know exist. And this interview with Jesse Pisel is an excellent example of that. Jesse analyzes data in the geology and geospatial tech world and caught my eye with a series of geology-based data visualizations on Twitter. Let’s get into Jesse’s story in a little more detail.
Tell us a bit about yourself, how did you get into the data space and what does your data journey look like so far?
My background is in quantitative stratigraphy and geology, so I learned a lot of the basics of data and analysis while doing my PhD at the Colorado School of Mines. I first got into data using R and Mathematica for my dissertation, and started learning Python as well. As I continued to learn more about data analysis I really got into Python as my language of choice. This was in 2015 as more and more data science oriented packages were coming out for Python.
When TensorFlow came out I knew that I had picked a good language to focus on, and have continued down the Python path. I have worked on a variety of different datasets over the past few years that include everything from spatial distributions of ancient river channels, to timeseries forecasting of oil and gas prices. The variety is great because it gives me a chance to learn new tools and apply different things I have learned to new datasets.
What’s a typical day look like for you in your current data role? Which tools and languages do you use? Big team/small team/lone wolf? Remote/office based/co-working space?
A typical day for me depends on where I am at in a project. If I am starting a new project, or if a coworker comes to me with some questions on a dataset, I could spend a majority of my day doing exploratory data analysis. If we have a funding proposal coming due soon or if I am wrapping up a project I tend to do a lot of technical writing and documenting. It’s kind of funny looking at my GitHub page, I can tell when I am wrapping up a project because there are fewer commits and a lot less activity as I spend most of my time writing up documents.
Most days though, I spend a lot of time visualizing and understanding the data I am working with. The Seaborn package in Python is great for this because you can quickly visualize distributions and relationships in the data. One key thing that I have been working on lately is organizing workflows and making sure I comment code so that I can come back to the project and understand what I did.
Jupyter notebooks are great for adding markdown comments and creating a linear workflow that is easy to follow from cell to cell. I highly recommend Jupyter for collaborating and sharing information, you can run Julia, Python, or R in them which can lead to some pretty great integrations between the languages.
As far as work environment goes, I work as a lone wolf in our department. We have other analysts, but they focus mostly on geospatial data in GIS. Despite this they are great to chat with about different ideas and make sure I am following solid logic for analysis as well. We are all co-located in one building which is great to be able to walk over and chat with people, but we still use a lot of remote tools (chat, docs, version control, etc.).
I like that I can work on shapefiles in geopandas in Python and do full custom spatial analysis before sending them to colleagues working in GIS for online maps, presentations, and reports. One last tool that I have been using quite a bit lately is a package called Verde that is used for gridding data in Python. It’s great for making maps and interpreting subsurface and geophysical data (gravity, magnetics, etc.).
You’re building up a good following through tweets about your work on Twitter. How important do you think it is for data professionals, at all stages of their career, to share publicly what they are doing and learning?
At first I wasn’t sure how to really use Twitter besides keeping up with what was going on in the world of machine learning and geology. After seeing some of the people I follow tweeting about their blog posts, I figured that would be a great way to create content that I was interested in, teach myself new skills, and start a conversation with anyone interested in similar topics.
I think it’s key for data professionals to share different things they are learning even if they think it’s been done already. So many times I run across an analysis or a way of thinking from a different field that I can directly apply to geologic problems.
One of my favorites came from a discussion with an economic data analyst, where we came up with the idea to use cost functions to model connectivity between oil and gas reservoirs. So it’s great for everyone to share what they are learning because you never know who might be working on a similar problem in a different field.
Where do you see your own data career going next? Building on your technical skills or moving into a more management-based role?
I really enjoy the technical side of things, so I want to stay on that side of the business. I think the next area that I really want to move into is deploying solutions at scale. So I guess it’s moving from more of a science role into a bit of an engineering role.
Building a solid foundation for deploying at scale completes the data life cycle from collection to processing to large scale inference.
Understanding all the different parts of the data life cycle is key to designing systems that work efficiently. But who knows, I could be happy in a management type role, as there is definitely places where machine learning could make managing a team significantly easier.
If you had a list of “best-kept-secrets” (websites, books, coaches) that have helped you, which would you recommend?
For geologists, geophysicists, and anyone interested in subsurface data I have to recommend the Software Underground. It’s a great community of scientists and engineers discussing everything from Python and machine learning to rock types and hackathons. There is a slack channel, a GitHub, and a bunch of other resources at https://softwareunderground.org/
Another great place to start is contributing to open-source projects. It’s pretty intimidating at first to contribute to production level code, but everyone is really nice and offer a lot of help on writing efficient code and documenting it.
It’s amazing the amount of work that maintainers put into open-source packages, and they need as much help as they can get. So it’s win-win, you learn something, and you get to contribute to something you use on a daily basis.
What is the number one piece of advice you give to aspiring data scientists?
Understand your data. Knowing how it was collected, who collected it, when it was collected, how it was processed, and what it’s supposed to “look like” goes a long way to interpreting results, finding bias, and solving problems.
Where can readers find you online?