The UK-based startup DeepZen has created the world’s first AI-generated audiobook, the psychological thriller “She Chose Me” by Tracey Emerson, using a text-to-speech solution that sounds human and has a particular focus on capturing emotions in the voice. This was made possible thanks to support from Digital Catapult and Cray, who provided the powerful technology, computation and know-how needed to achieve this landmark advance. DeepZen’s work on human-sounding AI will make text-to-speech technology a more viable option for audio recordings of all kinds, and for longer-form publications and video gaming in particular.
DeepZen set out to make AI sound human
Recording an audiobook or voice over using a human voice artist can be a lengthy and expensive process. For example, a 300 to 350-page book converts into around ten hours of audio time, which can take weeks of recording and editing.
While AI has been able to provide sophisticated speech interfaces for some time, it had not yet been able to provide a human-sounding listening experience, complete with emotion and inflection that make the consumption of spoken content enjoyable. DeepZen’s purpose is turning this ambition into reality, making text-to-speech more affordable and accessible for all kinds of organisations.
Digital Catapult and Cray help to shape and power the solution
Digital Catapult’s Machine Intelligence Garage programme gives startups access to the exceptional computational power needed to answer the “What if?” questions that innovative ideas begin with.
The machine learning solution that DeepZen has developed for creating audiobooks requires large amounts of compute power to train – an expensive resource for any startup. Together, Digital Catapult and its technology partner Cray provided the DeepZen with the required resources and the technical expertise for using these. DeepZen was given access to Digital Catapult’s NVIDIA DGX-1 to initially formulate their models, and a Cray CS-Storm cluster with 32 state-of-the-art NVIDIA Tesla GPUs, to handle the demands training and validating their models. With access to these resources, DeepZen could further accelerate its research and development and iterate over a large amount of different model architectures and parameters in less time — enabling DeepZen to make critical improvements to their model and develop it to production readiness.
Beside the access to compute resources and technical expertise Machine Intelligence Garage also supported DeepZen with a workshop on user centric design and tailored coaching for investor pitching. Through an investor showcase Machine Intelligence Garage provided DeepZen with the opportunity to pitch to high-calibre investors and build up their network in the investor community.
DeepZen’s ‘humanlike speech’ is now a reality
Thanks to Digital Catapult and Cray, DeepZen have been able to improve their existing prototype to such a degree that they are ready to take it to market, as demonstrated by multiple successful engagements with publishers. Using the machine learning model trained with the help of Digital Catapult and Cray, DeepZen has published the world-first audiobook that is entirely created from text using AI, with several more to follow. DeepZen’s human-sounding text-to-speech technology can now be used during the creative process for audiobooks, video-games and all kinds of spoken narrative. This has the multiple benefits of empowering organisations that would not otherwise have been able to afford live recording, accelerating production times and enhancing the listening experience for the end user.
Listen to examples of DeepZen at work.
“Digital Catapult and Cray engagement has helped us significantly to achieve our development goals by enabling us to train multiple models simultaneously on the super-computers. We would not be able to experiment as much as we could without the server usage made available to us through the Cray and Digital Catapult partnership.” – Taylan Kamis, Co-founder and CEO, DeepZen