Over the last six months, I’ve been delving deeply into R, linear regressions and machine learning. Part of the rationale has been to remember some of the concepts I learned in grad school studying signal processing.

But a more important driver has been the need to better understand how to qualify, evaluate and hire data scientists because data science is a massive competitive advantage. And many of the companies I work with are hiring data scientists.

Support Vector Machine Visualized

Finding the right person to model your data and generate insights can provide massive leverage for your company. But understanding what to look for in a candidate is a challenge.

In my view, when hiring a data scientist, one should look for three main skills with a bonus fourth: data processing, data analysis, data modeling, and system architecture.

Perhaps the most undervalued skill, data processing is the ability to transform data from its current form into a form that can be analyzed. Often data scientists have to extract data from many different internal databases, meaning a knowledge of SQL is important. Also, scraping data from the web or from pdfs is quite common so mastery of scripting is essential. Last, a basic knowledge of Unix to access databases through command line and through secure channels is important.

After data is processed, it must be analyzed. Data analysis is the application of statistics to data to understand what the data imply. Almost anyone can use Excel to build a summary analysis of averages and medians. But without a good grasp of statistical significance, the distributions underlying the data and data bias (like survivorship bias), statistics and data can lead companies to the wrong decisions. So a strong data scientist needs to understand the theories of statistics including the central limit theorem, the different types of distributions, statistical significance testing and so on.

Last, data is modeled. Data science becomes a weapon when the vast amounts of data a company collects can be used to predict or inform business decisions, for example prioritizing a sales funnel or determining what to pay for an ad impression in real time. This is data modeling / machine learning.

There are many machine learning methods: neural networks, support vector machines, linear regression, regression trees, k-nearest neighbors. And the application of these methods isn’t challenging. But implementing them is only the first 20% of the work.

The rest, the 80%, is tuning the models to make them perform. Fire up R and type lm() or rpart() and you have a linear model or a regression tree. The odds are that the data fed into the model won’t be the right one the first time. And that the data will need to be cleaned or transformed. Many different models will need to be run, compared and contrasted, until one that has a strong enough prediction capability is found.

Experience tuning models is essential.

The bonus skill is system architecture. Ultimately, machine learning systems may be deployed into production environments - like Netflix’s recommendation algorithm.

Data scientists who understand how to integrate their infrastructure into the existing web stack without impacting performance or creating excessive dependencies are tremendous assets to the company.

Now the challenge for me is building a model to tell me where to find these great data scientists. If you find someone who can do that, send them my way.

Published 2012-10-29 in