You’ve got no experience, but you want to become a data scientist. Where do you start? Should you enter a bunch of Kaggle competitions and throw code up on GitHub until you get noticed? How do you develop your skills and prove to a potential employer that you have them?
The reality is that the only way to truly develop data science experience is to do data science. But while you’re trying to land your first job, there are a few things you can do to make yourself into a stronger candidate and hone your skills along the way.
The problem with Kaggle and GitHub
It is a common hope that by spending a few weekends learning a few fancy programming tricks and scoring a high rank on Kaggle you will emerge on the other side as a fully formed data science butterfly ready to take your new career by storm. Sadly it doesn’t usually work out quite like that.
Options like Kaggle can be a helpful way for you to pick up a few skills and get a chance to practice testing hypotheses. But it is no substitute for real experience, and you should understand the limitations so you can utilize it effectively, rather than obsessing so deeply about hitting the top of the leaderboard that you forget to learn applicable skills.
One of the primary problems with neatly-packaged options like Kaggle is that the issues you will have to address as a working data scientist will almost never come in neatly-packaged boxes. You will rarely have the luxury of pre-cleaned data tied up with a bow and handed to you along with a ready hypothesis and a clearly outlined problem. Instead, most of your job will include having to clean up raw data effectively, which means you need to develop that skill. You will also have to develop strategies for turning that messy data into practical information, and you’ll have to be able to explain the real-life implications of your results, not just the results themselves.
A great score on Kaggle is certainly a nice cherry on the cake if you’re job-hunting, but it will only get you so far if you don’t have the fundamentals to back it. Similarly, throwing code up on GitHub and hoping to get noticed is a bit like shouting into a hurricane and hoping somebody hears you. The number of people competing for the rankings on those sites is so high that the chance of you standing out from the crowd is very low, and it has little to do with your practical skills. Employers know that a high competition score does not necessarily correlate with your being able to do effective work.
So what skills do I actually need?
If you’re thinking of pursuing a data science career, chances are you’ve already got some programming skills under your belt. But if you’re still learning or trying to develop yourself further, make sure you’re putting your efforts in the right direction. By far the most important programming languages you’ll need to know are Python and R. Start with Python, which is more versatile through different applications, but make sure you have a decent grasp of both as R is often more powerful for dedicated statistics.
But fancy coding is only a small fraction of what your job will actually consist of. Data science includes many things, but it is nothing if not mathematically-driven. In order to truly understand your results, you need to understand the principles behind them, which means you better know your numbers. Aside from solid programming skills, you should also have a pretty thorough grasp of statistics, calculus, and linear algebra. In addition to that, you should have a good understanding of the scientific method. Remember: “data” is only one half of the job description. You have to know “science” too.
How do I market myself?
Even if you don’t have job experience yet, you still need to demonstrate to your potential employer that you are a great choice for their company. We’ve already talked about why Kaggle and GitHub might not be the most efficient way to land your new career (although if you DO have a sexy Kaggle score to show off, flaunt it by all means). So what’s the alternative?
Putting together a data science portfolio is a great way for you to show off your skills in practical application. Your portfolio should consist of several projects that show your ability to do good work in multiple capacities, including creating and testing hypotheses, cleaning and analyzing data, and being able to thoroughly explain your results. You should also try to include a range of projects types, not just the cool-looking heavy-duty machine learning stuff (which is unlikely to be your first job anyway). Remember that being able to effectively clean up and explain your data is at least as important as your ability to build complex models.
While building your portfolio, keep in mind that in the real world nobody cares if your project is executed elegantly, methodically, and to 100% completion if it doesn’t also result in tangible information that your employer can put to practical use. If your data isn’t improving the company’s results, it is worthless to them, even if the work that went into it is immaculate. This means that you’ll need to get used to working quickly and being willing to change tactics if the situation calls for it. Fifteen half-completed projects with real influence are better than one perfect project that doesn’t drive results.
The reality is that if you want to do data science, there is no substitute for actually doing data science. You can learn all the fancy tricks in the world, but the only way to really get a good grasp of the full scope is to stick your elbows in and do it. So find some problems, develop some hypotheses, sort some data, analyze some results. That's data science!