The Key Ingredient to Disrupting with Machine Learning

Which are the ripest areas for startups to disrupt using machine learning? At the core, machine learning/artificial intelligence relies on two key ingredients: advanced algorithms and data sets to train those algorithms. Novel algorithms are increasingly making their way into the public domain in the form of open-source libraries. So, the key differentiator for startups and ultimately long-term competitive advantage is access to proprietary data sets.

In the consumer world, there are natural and intrinsic monopolies at play. Google’s search has a natural feedback mechanism of users clicking on the best search results that results in Google continuing to have the best search. Facebook’s network effects of it social network, in combination with its strategic acquisitions of other fast-growing social networks like Instagram and WhatsApp, reinforce its natural monopoly.

Both of these companies proprietary access to user data have helped them establish market dominance in the online/mobile advertising world. Google and Facebook run first party ads on their core websites, and use that information to power marketplace and also run an ad network. With a God’s eye view of the entire ad ecosystem, Google and Facebook can train the best algorithms to optimize revenues. Everyone else is at a colossal and unsurmountable disadvantage.

Consumer speech recognition is a similar story. Ok Google, Alexa, Siri, and Cortana benefit from the massive distribution advantages of their parent companies. As millions of users continue to provide data to the systems, the language models improve and with it accuracy. As a startup, it’s hard to compete with access to that kind of a data set.

But in the enterprise world, natural monopolies are far fewer for many reasons. First, software buyers actively seek to prevent single vendor lock-in; hence the rise of open source. Second, there are far fewer viral/social feedback loops. Third, it’s rare to find software that improves the more customers use it in a way that Google search improves as more people query.

Consequently, proprietary data sources that are essential to train next-generation machine learning models are easier to amass in enterprise rather than consumer. The most persistent and longest-running example of this is the security space, where many startups have built lists of exploits and attacks. Because these attacks evolve over time, no single company owns a monopoly on the exploit database.

In addition, the enterprise world is a bit behind the consumer world in that there are many data sets whose value is not yet recognized. Few companies run long-term regressions on human resources information systems (HRIS) data, the way that Facebook models user interests spreading through social network. We are just beginning to see the advent of predictive lead scoring tools for sales development representatives.

Because the data sets are far more fragmented, because the buyer population prefers to see multiple vendors, because the application of machine learning in the enterprise world is nascent, and because the incumbents aren’t investing heavily in these technologies yet, it will be easier for B2B startups to amass proprietary training data as a competitive advantage than B2C startups. It’s not that we won’t see innovation in the consumer world. But to get there requires building an enormous customer base with a proprietary data set, that none of the incumbents can replicate or proxy.