2022 will be remembered as the year of Stable Diffusion, of Dali 2, of incredible models text generators such as Palm or code generators such as AlphaCode. And yet, chatting last month with Andrés Torrubia, he told me that the most interesting thing I'd seen this year was an artificial intelligence that was coming from the OpenAI lab, an AI called Whisper. What do you think is the most impressive thing that has come out this year? Well, oddly enough, look, oddly enough, so far Whisper, I think. Do you know why? Curious, eh. What impresses me about Whisper is that Whisper works, it's like, for me, Whisper, if it were the autonomous car, it would be the first self-driving car in the dictation, i.e. it is the first one that resembles to a person. Well, but in order for you to first understand what this Whisper thing is, I'm going to ask you to do the following exercise. I am going to play you an audio in English and your task is to transcribe each of the words you are listening to. Are you ready? Three, two, one. Have you understood anything? Yeah, me neither. Well, in the ears of this artificial intelligence, this is the perfect transcription it has achieved. How is your Korean? Well, it's no problem for Whisper either, and he can also transcribe this audio at perfect English. And well, he understands me too. What you're seeing on the screen now is the speech to text that Whisper gets when it I pass on the audio track you are listening to. Look closely, not only does he get a near-perfect transcription, understanding even words such as Whisper or speech to text, but it is also capable of generating full stops, commas and other punctuation marks than many other commercial recognition models of speech as he tends to choke on it. And this is very interesting. Well, not this, but Whisper. Whisper in general has many interesting things and the first interesting thing is the context in which this tool appears. After a year of amazing achievements by the Artificial Intelligence Lab OpenAI, suddenly out of the blue a collaborative initiative like Stability.ai which in September takes up the banner of making open source many of the technologies that OpenAI for its part has decided to keep to itself and share only under services payment. This is not a problem for me either, because in the end OpenAI as a company has to pay their bills and at least it's giving us a way to access these powerful artificial intelligences. Learn Google. But of course, a new little boy arrives in town and starts giving out sweets to the children and all of a sudden the popular kid starts to be displaced. And at that precise moment OpenAI comes out of nowhere and gives us Whisper for our benefit of all. Because yes, my friends, this is open source, and I know you love to hear these words. At the end of the video I'm going to show you a mini tutorial so you can see how easy it is to use this tool and I'm also going to share a notebook to make it super easy for you to you. And this is what makes Whisper a super interesting tool, but it is not the only thing. And this is where one of the things that has caught my attention is that Whisper is not a complex system that has been designed to process audio in a way that has never been done before made or a super-complex system with a lot of processing modules. No, Whisper is this right here, a transformer neural network from 2017, it doesn't have no change, nothing new. It is an architecture with which we are all familiar. So, if this is the case, why didn't a technology like Whisper already exist? Well, the key to what makes Whisper so powerful is in the data and how it has been used structured their training. To train him, OpenAI used no less than 680,000 hours of audio with its corresponding text, a brutality. And if you calculate 680,000 hours and start reproducing them now, you would end up with to listen to it 77 years from now. You could be sure that at some point in the sky you would see Halley's comet streaking across the sky. But what's more, a very interesting thing is that these audios come in multiple languages, allowing us to be able to train a model that is multilingual, that can understand us whether we speak to him in Spanish, English, Korean, it doesn't matter. But it doesn't stop there. Whisper is not only a multilingual system, but also a multitasking system. This is a trend that, as we saw in the video on Gato, in the world of deep learning is becoming more and more frequent. Do not train artificial intelligence for a single task, but train it for several tasks different, thus making their learning much more solid and robust. As we have seen, Whisper can take audios in English and transcribe them into English, or audio in Korean and transcribe it into Korean. But the same model can also identify which language is being spoken, or acted upon as a speech detector to classify when a piece of audio is not being listened to to a person. Or also, the task that I find most interesting of all, that you can talk to him or her to Whisper in any language and have it automatically transcribe it into English for you. And in this case I can't tell you why, but for me this seems to me to be one of the most important functions fascinating. It doesn't seem to offer us anything new either, does it? In the end you can take the text generated by any text transcriber in your language and run it through a translator. But in this case, I find it fascinating to see how something as simple as a single deep learning model allows you to speak to it in any language and have it generate the text for you in English without having to combine any tools. It's super simple. And the data we discussed earlier is also very interesting. Because my first intuition here is that OpenAI, in the search for a massive dataset of these 680,000 hours of audio to have a text transcription in order to be able to make this supervised apprenticeship, as he had possibly gone to one of the largest sources of that we can find on the Internet, which is YouTube. In the end you know that all YouTube videos are automatically generated with subtitles. Well, no. This is precisely what OpenAI puts a lot of emphasis on in its paper to explain what they have done a filtering process to remove from the dataset any text occurrences generated by automatic speech recognition systems. Why? It was precisely to prevent Whisper from learning those defects, those vices, too that other automatic systems may also have. That said, now that we're talking about Whisper and YouTube, there is a theory that I want to tell you that I think it's very interesting, it's nothing that is confirmed, but that could explain the reason for the existence of this tool and that it could have a certain relationship with the with a future GPT-4. This is an idea I heard on Dr. Alan Thompson's channel, which says that in the near future, where GPT-4 can begin training, Whisper could offer the system a huge source of data that previous systems had not been able to count on. Think of a system like GPT-3 as having been trained with a lot of Wikipedia articles, of books, of forums, of Internet conversations, but he has never been able to access all of the that spoken source that may be in databases such as YouTube. A tool such as Whisper could be used to sweep YouTube completely, transcribe many of their audios and get, unlock a new source of data that previously would not have It has been possible to use it to train a future language model. This is the enormous value of a tool like Whisper that I think makes it so interesting to this technology. No, it does not solve a task that is spectacular, such as generating images or generating video, but it does solves a very useful task and almost solves it to perfection. I say almost, it's not perfect, sometimes some words are obviously wrong and it doesn't cover it all the languages that exist on planet Earth, and well, to look for some limitation compared to other commercial tools, as it does not yet work in real time either. Depending on the length of the audio it can take a few seconds to process it, sometimes It's a solid tool, it's mature, it's useful and it's open source, now allowing anyone to have access to a professional transcription tool and text translation better than any free alternative. What? Oh, that you too would like to have access to this tool. Well, come on, I'll prepare an easy tutorial for all of you to use. We are going to do it in Google Colab, but first and taking advantage of the fact that we are talking about programming, of development, of innovation, let me remind you that there are very few days left for Samsung Dev Day, which is the technology event held every year to celebrate the Samsung year the Samsung Dev Spain community, which is Samsung's official community for developers Spanish. This will be a free event not to be missed. If you are in Madrid you can attend in person on 16th November at the cloister of the Hieronymites of the Museo del Prado and if not, you can connect online at its streaming. But yes, you have to register, I was lucky enough to be able to participate last year with one of my own presentation on code generation with artificial intelligence and the experience was great. So you see, it's going to be an event full of great talks, talking about technology, of innovation, of applications, and it will also be presented by my dudev, who will surely Many of you will know him, so you can't miss him. I'll leave a link to the Samsung Dev website below in the description box Spain, where you will find all the information regarding the agenda where you can register and a a lot of other resources. See you on 16 November. Let's see how we can use Whisper in our own code. For this we are going to use Google Colab, you know that Google is giving us here a free virtual machine that we can use and we will check as long as we have enabled the GPU hardware accelerated environment type, OK, let's hit it here GPU, let's hit save and now the first step will be to install Whisper. To do this we are going to use these two commands here, to install, you can find this here in Whisper's own GitHub repository, I'm going to leave you below in the description box these commands, hit run and let it install. Once installed, we are going to upload some audio that we want to transcribe, in this case I'm going to try Rosalía's song from Chicken Teriyaki, let's put it here, we drag it and now the next step we are going to take it here and we are going to put the command necessary to be able to run it, we're going to hit song.mp3 here, it's called the file that we we have uploaded, okay, song.mp3, so the task is going to be to transcribe the size of the model, there are different sizes depending on whether you want more speed when making the inference or if you want more precision in the results, I usually work with the Medium model which is the one that gives me good results, there are bigger models, there are smaller models, try it and in this case simply where we are going to place the output file, we run and that's it, that's it, no more to do, okay, we're already using Whisper, the first time it will take a little while because you have to download the model but from this At the moment you can use this system to transcribe any audio you want, mola, ok, we can see that in this case it has detected that the language is Spanish, it has made the inference automatic because we haven't told you that we are going to transcribe from Spanish, you can do it if you want and when this cell is already executed, we can come here, we see that the Audio Transcription folder has been generated and here we have the different options, we can open the sound.txt and here we open the file, we can see that we have the whole song perfectly transcribed, which in this case, being Rosalía, has more merit and instead of wanting to transcription, you would like to make the translation, i.e. to convert your voice, your audio to English, so all you have to do is change here the task by Translate and in this case Whisper will work to translate what it has transcribed. In this case, if you notice, the command we have used is the console command but you may want to use Whisper within your code, then you can also you have the option to work with Whisper's own library, it's simply this one line of code from here, import it, load the model we want, here then I would load the Medium model which is the one that, as I said, works best for my case, and with the loaded model then here we call model.transcribe, let's put here song.mp3, we hit run and in a matter of seconds we will have our transcript back. And here it is, the Rosalia, pink without a card, I send it to your cat, I have it for you with roulette, no need for a serenade, ok. However, to make your life easier, I have prepared a notebook that you can use, is below in the description box, where you already have all the code ready for to start working, just log in, check that the GPU is activated, click on this button here to install everything necessary, here we choose the task we want to do, whether it is transcribing into any language or translating into English and click on run. In this case the cell is prepared so that the moment you start to run it your microphone is recording right now, i.e. right now we would be generating an audio file that we will then use for transcribing with Whisper, this is by if you want to make a real-time transcript of any class or anything else you need. We're going to hit stop, we hit this button and in a moment we have the result of what we've done we have said. Below you will find the two commands needed to be able to transcribe or translate the audio you upload. Finally, you should also know that if you want something simpler, then there are pages website where you can try out this system by uploading your own audios or by recording from the microphone. And this would be, 2022 is shaping up to be a spectacular year in terms of numbers of neural toys that are coming into our hands to build a whole bunch and to be able to touch them. Now it's your turn, what can you do about it? Well, you can build a lot of super interesting things, you can connect by example Whisper with Stable Diffusion so that you can loudly ask it to generate a table or you can for example take all your classes at the university or all working meetings, transcribing them, creating a huge bank of transcripts and then with the GPT-3 API to make a chatbot that allows you to query, ask questions and answers on all these sources of information. For example, something I want to do is to take all the videos from my YouTube channel and transcribe it, generate good quality subtitles in both Spanish and English and to be able to make statistics and queries on how many times I have said for example the word Machine Learning. There are a lot of applications that you can start to build, that you can start to build, that you can start to create by combining all these technologies. I had a dog barking in the background that was bothering me a lot. Well, as I was saying, you can create a lot of things and there is a lot to do. From here, from this channel, we will continue to experiment with this technology, I will continue to bring you new tools so if you haven't done so yet, please subscribe, click on the little bell to always receive notifications of new videos and if you want to support all this content you know you can do so through Patreon below in the description box. You have a couple of videos around here that are super interesting, I don't know what they are but are super interesting, keep an eye on them and we'll see you with more artificial intelligence guys, girls, in the next video.