Marko Balabanovic, Chief Technology Officer at Digital Catapult, writes about the barriers facing machine learning companies, particularly when they are looking to scale.
Machine learning techniques, within the field of Artificial Intelligence, are becoming increasingly effective and important for data innovators. The major challenges facing fast-growing organisations have been well documented in the Scale-Up Report, and include recruiting skilled employees, building leadership capability, accessing customer and finance, and navigating infrastructure.
However, for companies whose products and services use machine learning, we see two more specific barriers: access to skilled machine learning specialists, and access to large pools of data with which to train their algorithms. Both are exacerbated by the dominant position of the “GAFA” major internet companies (Google, Apple, Facebook and Amazon), who are rapidly acquiring large machine learning teams and have many advantages in acquiring training data through the data and channels they already control.
The well-documented shortage of data scientists gives us strong indications about the availability of machine learning specialists, as the two fields are close. A McKinsey study predicts that globally, demand for data scientists is projected to exceed supply by more than 50% by 2018. Whereas you’d expect a data scientist to use techniques from fields such as statistics, data mining, predictive analytics to extract knowledge and insight from structured and instructed data, a machine learning engineer or scientist may be applying techniques in entirely new areas or innovating on the algorithms and tools themselves.
The typical background would include postgraduate computer science work. In practice many roles would require both data science and machine learning skills, and many individuals would cross the boundaries. Unfortunately we don’t have hard numbers about the supply and demand specifically for machine learning specialists. Often their job titles will be simply 'software engineer' or 'research scientist', and the rise in demand has been more recent. However, anecdotal evidence from many sources, as well as our own experience, tells us that these skills are in very high demand and paid above market rates by big internet firms:
The trend is clear: the next wave of services based on data and machine learning are driving fierce competition for skilled staff globally, with resulting advantages to the organisations that have the deepest pockets to either poach staff from universities or the marketplace, or to acquire companies for their talented founding teams. The unparalleled access these same organisations have to the largest pools of training data serve as a strong incentive for those ambitious in this field.
To create new products or services that employ machine learning techniques, your organisation will need lots of examples (“training data”) for the algorithms to learn from; that is the essence of how machine learning works. Indeed it is the availability of large enough training datasets that has been the bottleneck. As Alexander Wissner-Gross explains, many of the recent AI breakthroughs happened less than three years after key datasets were available, but 18 years after the key algorithms were developed.
“...the prioritised cultivation of high-quality training datasets might allow an order-of-magnitude speedup in AI breakthroughs over purely algorithmic advances.” - Alexander Wissner-Gross
This effect has been termed the “unreasonable effectiveness of data”. It is not surprising that big strides are being made by the big internet companies who have unparalleled access to training data. For example, Google Translate achieved breaktrough performance at Arabic and Chinese-to-English translation using a dataset with more than 1.8tr tokens from Google web and news pages (and it now has a huge pipe of incoming data, currently translating over a 100bn words a day), Facebook’s deep learning face recognition system was trained on the, "largest facial dataset to-date, an identity labeled dataset of four million facial images belonging to more than 4,000 identities".
Equally, it is vitally important for growing startups to establish access to growing pools of training data. It is necessary in order to make their product succeed and it also becomes a key defensible asset. They will even evaluate merger or acquisition opportunities based on the availability of data (as Shivon Zilis of Bloomberg Beta says, "I’ve heard from founders that they are only interested in an acquisition if the acquirer has the right dataset to make their product work”).
Those versed in machine learning will note that for the purposes of simplicity, I am conflating supervised and unsupervised methods, and indeed reinforcement learning where one could argue that the algorithm needs an environment to explore rather than fixed sets of training data. However, when we say “data” we mean both fixed datasets and access to ongoing feeds. In the latter case the incumbents still have a huge advantage through their billions of customers and their control of the most popular internet applications or interfaces, where vast amounts of customer feedback to train and test machine learning systems are readily available.
Unlike training datasets that are usually proprietary and confidential, algorithm development is often moving into an open source world, so state of the art software and hardware designs are open to all. From the bigger players examples include TensorFlow from Google and Torch, contributed to by Facebook, as well as Facebook’s Big Sur hardware designs, while Amazon contributed to the $1bn invested to found Elon Musk and Sam Altman’s new non-profit research company OpenAI. As machine learning algorithms become commoditised we will see more ways to use, 'AI as a service', and you can already see examples such as Microsoft Azure, IBM Watson, Google Cloud Machine Learning or Vision. For all these organisations, more of the value is in the data than the algorithms.
Faced with the challenge of acquiring sufficient training data, startups can attempt a variety of strategies or business models. Moritz Mueller-Freitag has categorised these into 10 varieties, ranging from creating datasets by hand to releasing 'side' applications that are valuable to consumers but have a side effect of generating large sets of training data. However, in many cases there will be what has been called a “data network effect”. A normal network effect will make a service more valuable as it acquires more users. The more people use a social network, the more valuable it is to each user. This tends to lead to a 'winner takes all' market. A data network effect happens when a service becomes smarter (through acquiring more data) the more people use it:
"the more users use your product, the more data they contribute; the more data they contribute, the smarter your product becomes ...; the smarter your product is, the better it serves your users and the more likely they are to come back often and contribute more data – and so on and so forth. Over time, your business becomes deeply and increasingly entrenched, as nobody can serve users as well.” - Matt Turck, Venture Capitalist at FirstMark.
This effect will mean that in many markets the big internet players already have an inbuilt set of advantages: they have existing consumer and business relationships, they have the big messaging and social networks, they have a lot of the content of the internet, they control the big mobile operating systems, they see usage data across many platforms, products and services. This barrier is significant for growing machine learning companies.
The bigger internet players are using their dominant data position to attack new markets such as the internet of things, autonomous vehicles or healthcare (for example, Google has Nest, Sidewalk Labs, the Self-Driving Car Project, DeepMind Health; Apple has HomeKit, HealthKit, CareKit, CarPlay; 5% of US Amazon customers now have an Amazon Echo device listening for voice commands in their homes). Larger players lacking in data will make acquisitions to catch up, such as IBM’s recent acquisition of the Weather Channel and Truven Health Analytics. We have written about the data network effect for smart cities, where we believe that city data and resulting services will also be dominated by GAFA companies, rather than local governments or Internet of Things vendors.
Looking at the 10 specific strategies to acquire data for machine learning as suggested by Mueller-Freitag, there are really only three where a startup is not at a huge disadvantage:
We would add a further factor to that list not in Mueller-Freitag’s 10 strategies. Despite the global nature of internet markets, it is not currently the case that all data can flow freely across borders. Personal health data, as in the example above, can be tightly regulated and restricted to data centres in specific countries; defence or security data even more so. Although this is a potential advantage for local startups or scaleups, in practice the NHS/Deepmind arrangement and Google’s local presence in the UK provides a counterexample.
We can see that in the majority of cases the ready access to funding, distribution channels and existing data sources gives the incumbents huge advantages, and leaves a tricky chicken and egg problem for startups looking to scale. And furthermore, the data network effect is even stronger than we’ve described, as it also serves to pull in more of the machine learning specialists who are in such short supply. They are attracted to the organisations that can provide them the biggest datasets and largest user populations. As quoted in Fast Company, Rick Szeliski, previously lead of Microsoft Research’s Interactive Visual Media group, says about moving to Facebook in Oct 2015: "We came to Facebook because this is where the photos are, and where the data is."
We have shown how scaling companies using machine learning face two significant barriers, in addition, the problems generally faced by scaleups. One is access to machine learning skills and the other is access to, or a way to build, large sets of training data. Both are exacerbated by the dominant position of the big four internet giants Google, Apple, Facebook and Amazon, who are rapidly hiring machine learning specialists globally, and who have access to the largest pools of training data. The 'data network effect' means their position is hard for a new entrant to challenge.
The opportunity now exists for UK machine learning companies to pool resources, both people and data, to achieve scale and momentum with competing data network effects, particularly in domains that are currently less well served. By sharing access to a group of machine learning specialists, companies can gain cheaper access to scarce talent, who will in turn be easier to recruit with the prospect of many interesting challenges and larger datasets across domains.
By further sharing access to data and feeds of data, companies can scale faster and create more valuable machine learning systems. There will always be data that companies will need to keep confidential as their own assets, but there will be many cases where data can be gathered that would benefit several non-competitive partners. For example in a health domain, personal health records or sensor readings can be used by multiple organisations addressing different markets. At Digital Catapult we’re looking to create these kinds of collaborations, and we’d love to talk to you if you’re interested or have feedback.