Implementation of Reinforcement Learning from Human Feedback (RLHF)

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Project description

InstructGoose

Paper: InstructGPT - Training language models to follow instructions with human feedback

Install

Install from PipPy

pip install instruct-goose

Install directly from the source code

git clone https://github.com/xrsrke/instructGOOSE.git
cd instructGOOSE
pip install -e .

Train the RL-based language model

from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset

import torch
from torch.utils.data import DataLoader, random_split
from torch import optim

from instruct_goose import Agent, RewardModel, RLHFTrainer, RLHFConfig, create_reference_model

Step 1: Load dataset

dataset = load_dataset("imdb", split="train")
dataset, _ = random_split(dataset, lengths=[10, len(dataset) - 10]) # for demenstration purposes
train_dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

Found cached dataset imdb (/Users/education/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)

Step 2: Load the pre-trained model and tokenizer

model_base = AutoModelForCausalLM.from_pretrained("gpt2") # for demonstration purposes
reward_model = RewardModel("gpt2")

tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side="left")
eos_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token

Step 3: Create the RL-based language model agent and the reference model

model = Agent(model_base)
ref_model = create_reference_model(model)

Step 4: Train it

max_new_tokens = 20
generation_kwargs = {
    "min_length":-1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
    "max_new_tokens": max_new_tokens
}

config = RLHFConfig()
N_EPOCH = 1 # for demonstration purposes
trainer = RLHFTrainer(model, ref_model, config)
optimizer = optim.SGD(model.parameters(), lr=1e-3)

for epoch in range(N_EPOCH):
    for batch in train_dataloader:
        inputs = tokenizer(batch["text"], padding=True, truncation=True, return_tensors="pt")
        response_ids = model.generate(
            inputs["input_ids"], attention_mask=inputs["attention_mask"],
            **generation_kwargs
        )
        
        # extract the generated text
        response_ids = response_ids[:, -max_new_tokens:]
        response_attention_mask = torch.ones_like(response_ids)
        
        # evaluate from the reward model
        with torch.no_grad():
            text_input_ids = torch.stack([torch.concat([q, r]) for q, r in zip(inputs["input_ids"], response_ids)], dim=0)
            rewards = reward_model(text_input_ids)
        
        # calculate PPO loss
        loss = trainer.compute_loss(
            query_ids=inputs["input_ids"],
            query_attention_mask=inputs["attention_mask"],
            response_ids=response_ids,
            response_attention_mask=response_attention_mask,
            rewards=rewards
        )
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(f"loss={loss}")

loss=-824.6560668945312
loss=0.030958056449890137
loss=4.284017562866211

TODO

Add support custom reward function
Add support custom value function
Add support non-transformer models
Write config class

Resources

I implemented this using these resources

Copied the load_yaml function from https://github.com/Dahoas/reward-modeling
How to build a dataset to train reward model: https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX–VmlldzozMzAwODM2
How to add value head in PPO agent: https://github.com/lvwerra/trl
How to calculate the loss of PPO agent: https://github.com/lvwerra/trl/blob/main/trl/trainer/ppo_trainer.py
How to use PPO to train RLHF agent: https://github.com/voidful/TextRL
How PPO works: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py
Copied the compute advantages and returns from TLR: https://github.com/lvwerra/trl/blob/d2e8bcf8373726fb92d2110c500f7df6d0bd566d/trl/trainer/ppo_trainer.py#L686

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

0.0.7

Apr 3, 2023

0.0.6

Mar 14, 2023

0.0.5

Feb 12, 2023

0.0.4

Feb 9, 2023

0.0.3

Feb 2, 2023

0.0.2

Feb 2, 2023

0.0.1

Jan 18, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instruct_goose-0.0.7.tar.gz (11.5 kB view hashes)

Uploaded Apr 3, 2023 Source

Built Distribution

instruct_goose-0.0.7-py3-none-any.whl (12.7 kB view hashes)

Uploaded Apr 3, 2023 Python 3

Hashes for instruct_goose-0.0.7.tar.gz

Hashes for instruct_goose-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`532aa9676e27e9e8c570d5663bb6e2e55de1765fcda2c3c0f1b666cfb0c05877`
MD5	`e8349b2024d7777732fef99094f6545d`
BLAKE2b-256	`9b4e9bd9eafab6ba2a564f741645513210b87caec3ffccfae56500b8a7f29e27`

Hashes for instruct_goose-0.0.7-py3-none-any.whl

Hashes for instruct_goose-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d04fd839a81f82ca03272d94d74c5f34ec52c9458292a55e64d1a85a51ec0bcd`
MD5	`e7fc7e0ad43baaf91da421d8d14ed41a`
BLAKE2b-256	`14f027ef27a25ed93d747363926cb4516b146317889f408fc18291a1e8228459`