This contains several projects from Generative-AI-with-Large-Language-Models in Coursera, including Summarization and Text generation, and so on.
I made some changes to adapt to PC users since the original version is for AWS users.
If you have any questions, feel free to contact me.
There are some notes from Coursera and DeepLearning.AI down below. Hope it would be helpful for you.
Known as encoder-only LLM.
Using Masked Language Modeling (MLM).
Objective: Predict next token ('denoising') by using Bidirectional context.
Sentiment analysis
Named entity recognition
Word classification
BERT
ROBERTA
Known as decoder-only LLM.
Using Causal Language Modeling (CLM).
Objective: Reconstruct text by using Unidirectional context.
Text generation
Other emergent behavior (Depends on model size)
GPT
BLOOM
Known as encoder-decoder LLM.
Using Span Corruption.
Objective: Reconstruct Span
Translation
Text summarization
Question answering
T5
BART
We can use Distributed Data Parallel to improve the model's efficiency since sometimes the model is too big for a single GPU, dividing LLM into several GPUs and eventually Synchronizing gradients to update the model.
Reducing memory by distributing the model parameters, gradients and optimizer states across GPUs instead of replicating them.
Different domains have their own special term. Using Pre-training for specific domain adaptation can be efficient and convenient.
BloombergGPT is a financial model, it is a good case of pre-training a model for increased domain-specificity.
Fine-tuning may cause Catastrophic Forgetting since if the model targets on specific goal, it may forget what it has learned before.
Fine-tuning on Multi-task can avoid Catastrophic Forgetting.
FLAN-T5 is a good example.
Used for text summarization
Find the Longest common subsequence (LCS)
Used for text translation
Full fine-tuning is too taxing. PEFT can save space and be flexible.
Selective: Select a subset of the initial LLM parameters to fine-tune.
Reparameterization: Reparameterize model weight using a low-rank representation. (LoRA)
Adaptive: add trainable layers or parameters to the model
- Freeze most of the original LLM weights.
- inject 2 rank decomposition matrices.
- Train the weights of the smaller matrices.
Steps to update the model for inference: - Matrix multiply the low-rank matrices.
${B}\times{A}$ - Add to original weights.
$Freezed weights + {B}\times{A}$
By doing this, we can efficiently reduce the parameters that we adjust, so it can perform on a single GPU.
Prompt tuning is not Prompt engineering.
Prompt tuning means adding additional tokens in inputs. Those tokens are called Soft Prompts.
Prompt tuning can be as successful as Full fine-tuning.
Models' behaving badly:
Toxic language
Aggressive responses
Providing dangerous information
HHH is a standard to evaluate the model's behavior.
HHH:
Helpful
Honest
Harmless
Fine-tuning with human feedback is better than initial fine-tuning, no fine-tuning, and even the reference human baseline.
RLHF can maximize helpfulness and relevance, minimize harm, and bypass dangerous topics.
The agent will get feedback from the environment.
The goal is to get the maximum cumulative reward from the environment.
If LLMs achieve HHH standards, then the reward is higher than if they don't.
Define the model alignment criteria like helpfulness, honestness, or harmlessness.
Prompt -> LLM -> several completions
Label the completions by the standard.
The score may be divided from person to person, so we have to collect multiple responses from different people.
Use the reward model as a binary classifier (positive or negative).
Prompt Dataset -> RL-updated LLM -> Reward Model -> RL algorithm -> (go back to RL-updated LLM) (iterate until it aligns human responses)
PPO is an RL algorithm.
The goal is to update the policy so that the reward is maximized.
The reward model will value the completion.
PPO is to minimize value loss to tell LLM whether this time is a good training.
Calculate policy loss:
Maximizing this expression results in a better human-aligned LLM.
Calculate entropy loss:
Entropy makes the model remain creative.
Higher entropy leads to higher creativity.
Objective function:
After many iterations, the human-aligned model will come out.
If we use a very harmful model to evaluate the result, we may have trained a very bad model.
How to avoid Reward hacking:
KL Divergence Shift Penalty
This will compare RL-updated LLM with the Reference Model.
KL Divergence Shift Penalty gets added to the reward and put into PPO with the reward.
How to evaluate the human-aligned LLM:
Give a test dataset.
Make the Instruct LLM and the Human-aligned LLM generate the output.
Feed the reward model with their outputs.
Compare their score to evaluate whether the human-aligned LLM is better than the Instruct LLM.
RAG is used for improving the model's accuracy and fact-checking by accessing the external resource.
By giving step-by-step procedures, LLMs can produce more accurate output.