2022 will be remembered as the year of Stable Diffusion, of Dalí 2, of incredible models text generators like Palm or code generators like AlphaCode. And yet, chatting last month with Andrés Torrubia, he told me that the most interesting thing he had seen this year was an artificial intelligence that came from the OpenAI lab, an AI called Whisper. What do you think is the most impressive thing that has come out this year? Well, curiously, look, curiously, so far Whisper, I think. Do you know why? Curious, eh. What impresses me about Whisper is that Whisper works, it's like, for me, Whisper, if it were the autonomous car, would be the first self-driving car in dictation, in other words, it's the first one that looks like a person. Well, but in order for you to first understand what Whisper is, I am going to ask you to do the following exercise. I'm going to play you an audio in English and your task is to transcribe each of the words you hear. Are you ready? Three, two, one. Did you understand anything? Yeah, me neither. Well, to this artificial intelligence, this is the perfect transcription it has achieved. How's your Korean? Well, it's no problem for Whisper either and it can also transcribe this audio in perfect English. And well, it understands me too. What you are seeing on the screen now is the speech to text that Whisper gets when passes it the audio track you are listening to. Look closely, not only does it get an almost perfect transcription, understanding even specific words like Whisper or speech to text, but it is also able to generate full stops, commas and other punctuation marks that many other commercial models of speech recognition usually struggle to do. And this is very interesting. Well, not this, but Whisper. Whisper in general has many interesting things and the first interesting thing is the context in which this tool appears. After a year of incredible achievements by the Artificial Intelligence Lab of OpenAI, suddenly out of nowhere a collaborative initiative like Stability.ai emerges, which in September takes as a flag to make open source many of the technologies that OpenAI on its part has decided to keep to itself and share only under paid services. For me this is not a problem either, since in the end OpenAI as a company has to pay its bills and at least it is giving us a way to access these powerful artificial intelligences. Learn Google. But of course, a new kid comes to town and starts giving away candy to the children and suddenly the popular kid starts to be displaced. And at that precise moment OpenAI comes out of nowhere and gives us Whisper for the benefit of all. Because yes, friends, this is open source, I know you love to hear these words. At the end of the video I will show a mini tutorial so you can see how easy it is to use this tool and I will also share a notebook to make it super simple for you. And this is what makes Whisper a super interesting tool, but it is not the only thing. And this is where one of the things that has caught my attention the most is that Whisper is not a complex system that they have designed to process audio like they have never done before or a super complex system with a bunch of processing modules. No, Whisper is this right here, a transformer neural network of 2017, it has no changes, nothing new. It is an architecture that we are all already familiar with. So, if this is so, why didn't a technology like Whisper already exist? Well, the key that makes Whisper so powerful is in the data and in how has structured its training. To train it, OpenAI has used no less than 680,000 hours of audio with its corresponding text a brutality. And if you calculate 680,000 hours and start playing them now, you would end up listening to it in 77 years' time. You could be sure that at some point in the sky you would see Halley's comet fly. But what's more, a very interesting thing is that these audios come in multiple languages, allowing us to train a model that is multilingual, that can understand us if we speak to it in Spanish, English, Korean, whatever. But that's not all. And Whisper, as well as being a multilingual system, is also a multitasking system. This is a trend that, as we saw in the video on Gato, is becoming more and more frequent in the world of deep learning. Not to train artificial intelligence for a single task, but to train it for several different thus making its learning much more solid and robust. As we have seen, Whisper can take audio in English and transcribe it into English, or audio in Korean and transcribe it into Korean. But the same model can also identify which language is being spoken, or act as a voice detector to classify when a piece of audio is not listening to a person. Or, the most interesting task of all, you can speak to Whisper in any language and it will automatically transcribe it into English for you. And in this case I can't tell you why, but I find this functionality fascinating. It doesn't seem to offer us anything new either, does it? In the end you can take the text generated by any text transcriber in your language and pass it through a translator. But in this case I find it fascinating to see how something as simple as a single deep learning model allows you to speak to it in any language and it generates the text in English without having to combine any kind of tools. It's super simple. And the data that we mentioned before is also super interesting. Because my first intuition here is that OpenAI, in the search for a massive dataset of these 680,000 hours of audio that had a text transcription to be able to do this supervised learning, had possibly turned to one of the biggest sources that we can find on the Internet, which is YouTube. In the end you know that all YouTube videos have automatically generated subtitles. Well no. This is precisely what OpenAI emphasises in its paper to explain that they have done a filtering process to eliminate from the dataset any text generated by automatic speech recognition systems. Why? Well, precisely to prevent Whisper from learning also those defects, those vices that the other automatic systems might also have. Having said that, now that we are talking about Whisper and YouTube, there is a theory that would like to tell you about that I find very interesting, it is nothing that is confirmed, but that could explain the reason for the existence of this tool and that could have a certain relationship with a future GPT-4. This is an idea I heard on Dr Alan Thompson's channel that says that at in the near future, where GPT-4 can start training, Whisper could offer the system a huge source of data that previous systems have not had. Let's think that a system like GPT-3 has been trained with a lot of Wikipedia articles, from books, from forums, from Internet conversations, but it has never been able to access all that spoken source that may be in databases like YouTube. A tool like Whisper could be used to completely sweep YouTube, transcribe many of its audios and obtain, unlock a new source of data that previously would not have been possible to use to train a future language model. This is the enormous value of a tool like Whisper that I think makes so interesting. No, it doesn't solve a task that is spectacular, like generating images or generating video, but solves a very useful task and almost solves it to perfection. I say almost, it is not perfect, sometimes some words are obviously wrong and it does not cover all the languages that exist on planet Earth and, well, to look for a limitation compared to other commercial tools, it does not work in real time yet. Processing the audio depending on the length can take a few seconds, sometimes a few minutes, but it is a solid tool, it is mature, it is useful and it is open source, allowing anyone to have access to a professional transcription tool and text translation better than any free alternative. What? Ah, that you also want to have access to this tool. Well, come on, I'll prepare an easy tutorial so you can all use it. We're going to do it on Google Colab, but first and since we're talking about programming, development, innovation, let me remind you that there are very few days left to celebrate the Samsung Dev Day, which is the technological event held every year by the Samsung Dev Spain community, which is the official Samsung community for developers Spanish. This will be a free event that you can't miss. If you are in Madrid you can attend in person on November 16th in the cloister of los Jerónimos of the Prado Museum and if not, you can connect online through its streaming. But yes, you have to register, I was lucky enough to participate last year with a presentation on code generation with artificial intelligence and the experience was great. So you can see, it will be an event full of great talks, talking about technology, innovation, applications and it will also be presented by my dudev, who probably many of you know, so you can not miss it. I'm going to leave below in the description box a link to the Samsung Dev website Spain, where you will find all the information regarding the agenda where to register and a lot of resources. See you on November 16. So let's see how we can use Whisper ourselves in our own code. For this we will use Google Colab, you know that Google here is giving us a free virtual machine that we can use and we will verify that we have enabled the type of environment with hardware acceleration GPU, okay, we will give here GPU, we will give save and now the first step will be to install Whisper. To do this we're going to use these two commands here, to install, you can find in the Whisper GitHub repository, I'm going to leave these commands below in the description box hit run and let it install. Once installed we are going to upload some audio that we want to transcribe, in this case I am going to try with the song of Rosalía de Chicken Teriyaki, we are going to place it here, we drag it and now the next step because we are going to take here and we are going to put the command necessary to be able to execute it, we are going to give here to song.mp3, it is called the file that we have uploaded, ok, song.mp3, the task is going to be to transcribe the size of the model, there are different sizes depending on whether you want more speed when doing the inference or if you want more precision in the results, I generally work with the Medium model which is the one that gives me good results, there are bigger models, there are smaller models, try and in this case simply where we are going to place the output file, we execute and that's it, that's it, we don't have to do anything else, OK, we are already using Whisper, the first time it will take a little while because it has to download the model but from this moment on you can use this system to transcribe any audio you want, cool, ok, we can see that in this case it has detected that the language is Spanish, it has made the inference automatically because we have not told it that we are going to transcribe from Spanish, you can do it if you want and when this cell is already executed we can come here, we see that the Audio Transcription folder has been generated and here we have the different options, we can open the sound.txt and here we open the file, we see that we have the whole song perfectly transcribed and in this case being Rosalía it has more merit and instead of wanting to do the transcription, you would like to do the translation, that is to say convert your voice, your audio to English, all you have to do is change here the task to Translate and in this case Whisper will work to translate what it has transcribed. In this case if you notice the command that we have used has been the one of console but maybe you want to use Whisper inside your code, then also you have the option to work with the own library of Whisper, it is simply this line of code of here, we import it, we load the model that we want, here then I would load the model Medium that is the one that as I say works better for my case and with the loaded model then here we call model.transcribe, we are going to put here song.mp3, we give to execute and in a matter of seconds we will have our transcription again. And here we have it, the Rosalía, rose without card, I send it to your cat, I have it to you with roulette, it was not necessary serenade, ok. Also to make your life easier I have prepared a notebook that you can use, is below in the description box, where you have all the code ready to start working, you just have to enter, check that the GPU is activated, we click this button here to install everything you need, here we choose the task we want to do, if it is transcribe to any language or translate to English and we click run. In this case the cell is prepared so that at the moment you start to execute it is recording right now your microphone, that is to say right now we would be generating an audio file that later we are going to use to transcribe with Whisper, this is for if you want to make a transcription in real time of any class or any thing that you need. We are going to give him to stop, we give him to this button and in a moment we have the result of the that we have said. Equally then below I add you the two necessary commands to be able to transcribe or to translate the audio that you upload. Finally, you should also know that if you want something simpler, there are web pages where you can try this system by uploading your own audio or recording from the microphone. And that's it, 2022 is shaping up to be a spectacular year in terms of the number of neural toys that are coming into our hands to build a lot of tools and to be able to play them. Now it's your turn, what can you do with these? Well, you can build a lot of super interesting things, you can connect for example Whisper with Stable Diffusion so that you can ask to generate a chart or you can for example take all your university classes or all your work meetings, transcribe them, create a huge bank of transcripts and then with the GPT-3 API make a chatbot that allows you to consult, ask questions and answers about all that source of information. For example, something I want to do is to take all the videos from my YouTube channel and transcribe them, generate good quality subtitles in both Spanish and English and be able to make statistics and queries on how many times I have said the word Machine Learning, for example. There are a lot of applications that you can start to build, that you can start to create by combining all these technologies. I had a dog barking in the background that was bothering me a lot. Well, what I was saying, you can create a lot of things and there is a lot to do. From here, from this channel we are going to keep experimenting with this technology, I will keep bringing you new tools so if you haven't subscribed yet, click on the little bell so you always get notifications of new videos and if you want to support all this content you know you can do it through Patreon below in the description box. You have a couple of videos here that are super interesting, I don't know what they are but are super interesting, take a look at them and see you with more artificial intelligence guys, girls, in the next video.