What is the RLHF Method?
There are three main steps in this method. First, supervised fine-tuning model. Second is the reward model. And third, reinforced learning model. Let’s understand these three steps one by one.
Three Main Steps of RLHF
I had told you about GPT 3 and how 570 GB of data was fed to train it. In which, Wikipedia pages and web pages were included. Now, OpenAI knows that this dataset is already very large.
If we want to improve our model, then we don’t need to feed more data. But how can we improve its responses from existing data? We need to finetune the model. Because 570 GBs Having data in the text is important. You will get so much variety in that data that there will be hardly anything that has not been mentioned.
01. Supervised Fine-Tuning Model
So OpenAI tried to fine-tune the model of GPT 3, with the help of which GPT 3. 5 was created after fine-tuning. So the training data is GPT 3 and 3. 3. 5. To do fine-tuning, they hired 40 contractors whose job was to create a supervised training dataset. To make the inputs and outputs of higher quality. They saw that during ChatGPT ChatGPT, and the answer given by ChatGPT, how can that answer be made better?
Their job was to see every answer individually and then type by hand to see if they could type it or not. Its language should not be like this but like this. So for a lot of inputs, known outputs were created manually. And they were matched with each other. This whole process was very time-consuming and very slow.
And it was very expensive too. Approximately 13, 000 sets of inputs and outputs were made. And then it was fed back to the chat in GPT to show that computer code. Look, this is how you should respond when asked such questions. Now, the new data that we are providing recognizes the pattern from this and maintains the same language.
With the help of this, friends, you can get a human-like chatting interface. Because so many people have worked so hard for this. Okay, this was just step 1 Now let’s discuss step 2.
02. Reward Model
After this comes the reward model. To make the answers given by ChatGPT even better, a reward system was made by its programmers. The answer that the chat GPT used to give about a question, that question was asked again and again.
And multiple answers were given every time. So, the programmers got multiple answers for the same question. Usually, they got answers between 4 and 9. Then, the people whom they had hired were made to work as laborers. And they said that the answers that are being given are different. Rank these people.
It was the job of every person to sit and rank that JAT GPT has given five different answers to this question, and now I have to tell them which one is the best, which one is the second best, which one is the third best, and which one is the worst. So this ranking was also given manually by people.
And because of this ranking, every answer was given a score. And this score was called a reward. Because of this, a new ranked data set was created. With every question of ChatGPT, there are multiple answers here. And from those multiple answers, it was seen which was the best answer. Again, a new model was trained.
A reward model. Now, the computer was told to look at the scores on each answer. You just have to provide the answer whose score is the highest and whose reward is the highest. By doing this, the quality of the answers provided by ChatGPT improved even more.
03. Process of Reinforcement Learning
Then came the third step. Reinforced Learning Model. In this step, the computer was taught to reward itself.
Give the answers that you are creating a reward based on the pattern that we have shown. And based on this reward, generate another answer of yours that has the highest reward. Here, a specific algorithm is Policy Optimization, which is also called the PPO model. The equations involved here are very complicated, but the principle simplifying and explaining to you is this.
Basically, this model is very similar to the interaction between a real-life student and a teacher. In the first step, a teacher explains to the student how to read this lesson. In the second step, the teacher takes a test of the student, and grading is done on that test. The answer is right; the answer is wrong.
In the third step, the student becomes so intelligent that he can grade his test by himself and improve himself. So by using human guidance, we slowly taught this computer what is right, what is wrong, and how to behave like a human. You have to reply while chatting. This is where the name came from, friends.
Reinforced learning with human feedback. And this is the reason, friends, that it is not so easy to beat chatGPT by other companies. Because they will also have to do all this hard work. People will have to sit at work. One by one, the output should be seen. They should be ranked. They should be graded. And gradually, it is a slow and time-consuming process with which all this can be improved.
GPT 3.5 Vs. GPT 4
In March 2023, OpenAI GPT 4. How many parameters are there in it? They have not revealed it publicly. But it is said that trillions of parameters have been used. We don’t know how much training data has been fed to it. But a very, very big improvement can be seen between CHAT GPT 3.
5 and version 4. it personally. by doing many tests between these two models. Even though it is a revolutionary technology, it is very important to bear in mind its limitations. I will discuss this in the entire chapter in the end
Final Words
I would like to tell you, how you witnessed it, how this technology was made, and what is the significance of this. It is not that this is 100 percent actual.
I give a perfect answer to everything in the chat GPT, all the time. There are some shortcomings in this. And these shortcomings have come out of this process. Because so many people were made to train for this. Those people must also have their own biases. To say that this output is better than this output.
To give the best rank to this one. It depends on the person. In some people’s opinion, this output will be best. In some people’s opinion, this output will be best. The method of making a reward model cannot be inherently perfect because people have made it. And whatever biases people have, whatever opinions they have about things, ChatGPT as well today.
A very big assumption is made in the RLHF method. And that assumption is that all the people in the world, in some things, their opinions will always be the same. That actually doesn’t happen. In the end, this model was evaluated based on three criteria. Helpfulness, truthfulness, and harmlessness.
And in my opinion, 99 percent of the time, the 4. 0 version of ChatGPT gives very objective answers. It’s not that there is a bias in it, or that it can’t give an opinion. But there are some 1 percent cases where these small biases are reflected in the answers.