2022 will be remembered as the year of stable diffusion, of Dali 2, of incredible text
generators models like Palm or code generators like AlphaCode. And yet, talking last month
with Andres Torrubia, he told me that the most interesting thing he had seen this year was an
artificial intelligence that came from the OpenAI laboratory, an AI called Whisper.
What is the most impressive thing that has come out this year for you?
Well, curiously, until now, Whisper, I think. Do you know why?
Curious, huh?
What impresses me about Whisper is that Whisper works, it's like, for me, Whisper, if it were the
autonomous car, would be the first self-driving of the dictatorship. It is the first that looks like a person.
Well, but for you to understand first what this Whisper is, I'm going to ask you to do the following exercise.
I'm going to play an audio in English and your task is to transcribe each of the words you're listening to.
Are you ready? 3, 2, 1.
Have you understood anything?
Yeah, me neither.
Well, hearing this artificial intelligence, this is the perfect transcription you've got.
And how about your Korean?
Well, for Whisper it's not a problem either and you can also transcribe this audio in perfect English.
And well, you understand me too.
What you are seeing on the screen now is the speech to text that Whisper gets when he passes the audio track you are listening to.
Look closely, not only does it get an almost perfect transcription, even understanding concrete words like Whisper or speech to text,
but it is also able to generate points, commas and other punctuation marks that many other speech recognition commercial models are usually fed.
And this is very interesting.
Well, not this, but Whisper.
Whisper in general has many interesting things.
And the first interesting thing is the context in which this tool appears.
After a year of incredible achievements by the OpenAI artificial intelligence laboratory,
suddenly a collaborative initiative like Stability.ai emerges, which in September takes the lead in making Open Source
many of the technologies that OpenAI, for its part, has decided to keep to itself and share only under paid services.
For me this is not a problem either, since in the end OpenAI as a company has to pay its bills
and at least it is giving us a way to access these powerful artificial intelligences.
Learn Google.
But of course, a new boy arrives in the city and begins to give candy to the children and suddenly the popular boy begins to look displaced.
And at that precise moment, OpenAI arrives out of nowhere and gives us Whisper.
For everyone's benefit.
Because yes, friends, this is Open Source.
I know you love to hear these words.
At the end of the video I will show you a mini tutorial so you can see how easy it is to use this tool
and I will also share a notebook so that it is super simple for you.
And this is what makes Whisper a super interesting tool.
But it's not the only thing.
And this is where one of the things that has caught my attention the most comes from.
And it is that Whisper is not a complex system that they have designed to process audio as it had never been done before,
or a super complex system with a lot of processing modules.
No.
Whisper is this here, a transformer-type neural network from 2017.
It has no change, no novelty.
It is an architecture that we all know.
So if this is so, why didn't a technology like Whisper already exist?
Well, the key to making Whisper something so powerful is in the data and how they have structured their training.
To train it, OpenAI has used no more or less than 680,000 hours of audio with its corresponding text.
A brutality.
And if you do the calculation, 680,000 hours, if you started playing them now, you would end up listening to it in 77 years.
You could be sure that at some point in time you would see the comet Halley.
But it is also a very interesting thing, is that these audios come in multiple languages,
allowing us to train a model that is multilingual, that can understand us if we speak it in Spanish, English, Korean...
It doesn't matter.
But the thing is not just there.
And it is that Whisper, in addition to being a multilingual system, is also a multitasking system.
This is a trend that, as we saw in the video about Gato, is increasingly frequent in the world of deep learning.
Not train AI for a single task, but train it for several different ones.
Thus making its learning much more solid and robust.
As we have seen, Whisper can take English audios and transcribe them to English.
Or Korean audios and transcribe them to Korean.
But the same model can also identify which language is being spoken.
Or act as a voice detector to classify when a piece of audio is being heard, not a person.
Or also the most interesting task, I think, of all.
That you can speak to Whisper in any language and that he automatically transcribes it to English.
And in this case I would not know why, but for me this seems to me a fascinating functionality.
It seems that it does not offer us anything new, right?
In the end you can take the text that any text transcriber generates in your language and pass it through a translator.
But in this case it seems fascinating to me to see how something as simple as a single deep learning model
allows you to speak in any language and that it generates the text in English without having to combine any type of tools.
It's super simple.
And the data we have commented on before is also super interesting.
Because my first intuition here is that OpenAI, in the search for a massive data set of these 680,000 hours of audio
that had a text transcription to be able to do this supervised learning,
possibly had gone to one of the largest sources we can find on the internet, which is YouTube.
In the end, you already know that all YouTube videos have automatically generated subtitles.
Well, no.
In this OpenAI emphasizes a lot in its paper to explain to us that they have made a filtering process
to remove any text from the data set generated by automatic speech recognition systems.
Why?
Well, precisely to prevent Whisper from learning those defects, those vices that other automatic systems could also have.
That said, now that we are talking about Whisper and YouTube,
there is a theory that I want to tell you that I find very interesting, it is nothing that is confirmed,
but that could explain the reason for the existence of this tool and that could have a certain relationship with a future GPT-4.
This is an idea that I heard on the channel of Dr. Alan Thompson and that says that in the near future,
where GPT-4 can start training, Whisper could offer the system a huge source of data
that previous systems had not counted on.
Let's think that a system like GPT-3 has been trained with a lot of Wikipedia articles,
books, forums, internet conversations,
but it has never been able to access all that spoken source that can be on databases such as YouTube.
A tool like Whisper could be used to completely erase YouTube,
transcribe many of its audios and unlock a new source of data
that would not have been possible to use before to train a future language model.
This is the enormous value that a tool like Whisper has,
and that I think makes this technology so interesting.
No, it does not solve a task that is spectacular, such as generating images or generating videos,
but it solves a very useful task and almost solves it to perfection.
Be careful, I say almost, it is not perfect, sometimes some words are obviously wrong
and it does not cover all the languages ​​that exist on planet Earth,
and well, for looking for some limitation compared to other commercial tools,
it does not work in real time either.
Still, processing audio, depending on the length,
can take a few seconds, sometimes a minute,
but it is a solid tool, it is mature, it is useful.
And also open source, allowing anyone now
can access a professional transcription and translation tool
better than any free alternative.
What? Oh, you also want to access this tool?
Well, come on, I'll prepare an easy tutorial for you so you can all use it,
we will do it in Google Colab.
But before, and taking advantage of the fact that we are talking about programming,
development, innovation, let me remind you that there are very few days left
for the Samsung Dev Day to be celebrated,
which is the technological event that the Samsung Dev Spain community celebrates every year,
which is the official Samsung community for Spanish developers.
This will be a free event that you cannot miss.
If you are in Madrid, you can attend in person on November 16
at the Clostro de los Jerónimos of the Museo del Prado.
And if not, you can connect online through its streaming.
But yes, you have to register.
I was lucky last year to be able to participate
with a paper on the generation of code with artificial intelligence
and the experience was great.
So you see, it will be an event full of great talks,
talking about technology, innovation, applications,
and it will also be presented by my Dudef,
which surely many of you know him,
so you can't miss it.
I'll leave you a link in the description box
to the Samsung Dev Spain website,
where you will find all the information regarding the agenda,
where to register and a lot of other resources.
See you on November 16.
Well, let's see how we can use Whisper in our own code.
For this, we are going to use Google Call App.
You already know that Google is giving us a free virtual machine
that we can use and we are going to verify
whenever we have activated the type of environment
with hardware acceleration GPU.
Okay, let's give it here.
GPU, let's save it.
And now the first step will be to install Whisper.
For this, we are going to use these two commands here.
Install, you can find this in the GitHub Whisper repository itself.
I'll leave you down in the description box.
These commands, we give it to run and let it install.
Once installed, we are going to upload some audio
that we want to transcribe.
In this case, I'm going to try with the song of Rosalia
by Chicken Teriyaki.
Let's put it here, we drag it.
And now the next step, we're going to take it here
and we're going to put the command we need to run it.
We're going to give it here to song.mp3.
It's called the file we've uploaded.
Okay, song.mp3.
The task is going to be to transcribe the size of the model.
There are different sizes,
depending on whether you want more speed when doing the inference
or if you want more precision in the results.
I usually work with the medium model,
which is the one that gives me good results.
There are larger models, there are smaller models.
Try it.
And in this case, we're just going to put the output file.
We run it and that's it.
That's it.
There's nothing else to do.
Okay, we're using Whisper.
The first time it will take a while
because it has to download the model,
but from this moment on,
you can use this system to transcribe
any audio you want.
Cool.
Okay, we see that in this case it has detected
that the language is Spanish.
It has made the automatic inference
because we have not told it that we are going to transcribe in Spanish.
You can do it if you want.
And when this cell is already executed,
we can come over here.
We see that the audio transcription folder has been generated
and here we have the different options.
We can open the sound.txt
and here we open the file.
We see that we have the whole song perfectly transcribed,
which in this case, being Rosalía, has more merit.
If instead of wanting to do the transcription,
you would like to do the translation,
that is, convert your voice, your audio to English,
then all you have to do is change the task here for translate.
And in this case, Whisper will work to translate
what it has transcribed.
In this case, if you notice,
the command we have used has been the console one,
but maybe you want to use Whisper within your code.
Then you also have the option of working
with the Whisper library itself.
It's just this line of code here.
We import it, we load the model we want.
Here, I would load the medium model,
which is the one that, as I say, works best for my case.
And with the loaded model,
then here we call model.transcribe.
We are going to put here song.mp3.
We hit run and in a matter of seconds,
we will have our transcription again.
And here we have it.
The pink rose without a card, I send it to your cat,
I have it with a roulette, no need to serenade.
Well, shock.
Also to make your life easier,
I have prepared a notebook that you can use.
It's down in the description box,
where you already have all the code ready to start working.
You just have to enter,
check that the GPU is activated,
we hit this button here to install everything necessary.
Here we choose the task we want to do,
because if it is to transcribe to any language or translate to English,
and we hit run.
In this case, the cell is prepared
so that the moment you start to run it,
it is recording your microphone right now.
That is, right now we would be generating an audio file
that we will later use to transcribe with Whisper.
This is in case you want to make a transcription in real time
of any class or anything you need.
We're going to stop it, we hit this button,
and in a moment we have the result of what we have said.
Likewise, then below I add the two commands necessary
to be able to transcribe or translate the audio that you upload.
Finally, you also have to know that if you want something simpler,
there is a website where you can try this system
by uploading your own audios or recording from the microphone.
And this would be.
2022 is really looking like a spectacular year
in terms of the number of neural toys
that are coming to our hands
to build a lot of tools and to be able to touch them.
Now it's your turn.
What can you do with this?
Well, you can build a lot of super interesting things.
You can connect, for example, Whisper with Stable Diffusion
so that you can ask it to generate a picture.
Or you can, for example, take all your classes at the university
or all the work meetings, transcribe them,
create a huge bank of transcriptions
and then with the GPT-3 API make a chatbot
that allows you to consult, ask questions and answers
about all that source of information.
For example, something I want to do
is take all the videos from my YouTube channel
and transcribe them, generate good quality subtitles
both in Spanish and in English
and be able to do statistics and consultations
of how many times I have said, for example, the word machine learning.
There are a lot of applications that you can start building,
that you can start creating by combining all these technologies.
There is also a barking dog in the background
that was bothering me a lot.
Well, what I was saying,
that you can create a lot of things and there is a lot to do.
From here, from this channel, we will continue to do experiments
with this technology.
I will continue to bring new tools
so if you haven't done it yet, subscribe,
hit the bell so you always get notifications
that there is a new video
and if you want to support all this content,
you already know that you can do it through Patreon
below in the description box.
You have a couple of videos around here that are super interesting,
I don't know which ones they are, but they are super interesting,
take a look at them
and see you with more artificial intelligence,
guys, girls, in the next video.