Paper Key : IRJ************249
Author: Abdul Rehman,Prof. Dr. Ali Okatan
Date Published: 03 Jan 2025
Abstract
Reinforcement Learning from Human Feedback (RLHF) is a new way to make large language models (LLMs) better by aligning them with human preferences in medical chat systems and other specific fields. This research looks into two different RLHF setups. The first uses a Roberta reward model along with a GPT-2 generation model. The second uses a Llama3.1-8B reward model along with a Llama3.2-1B generation model. Roberta, which is meant to classify text, and Llama, a generative model famous for understanding context well, were tested to see how well they worked in the RLHF process. We took the medical conversation dataset and preprocessed it to create a pairwise comparison dataset with messages, accepted answers, and refused responses. The dataset was then cleaned up to make it easier to understand and all messages were made to fit within a 512-token limit. Both models were very good at telling the difference between answers that were accepted and those that were refused. They were added to an RLHF framework along with Proximal Policy Optimization (PPO). To make the generative models more in line with human preferences, they were also limited to a 512-token context as the messages within the dataset. To judge success, key indicators were looked at, such as reward scores and policy changes. Low-Rank Adaptation (LoRA) and Quantization were added to the Llama-based setup to make it more efficient. Roberta's strong classification made it easier for clear reward signals, but the limited contextual understanding made them less useful for steering GPT-2. On the other hand, the Llama-based process created results that fit together better and were more relevant to the situation. This study shows the trade-offs of using classification-based and context-aware reward models in RLHF systems. It stresses how important it is to use high-quality datasets and training methods that don't use too many resources to get the best results from LLMs.
DOI Requested