large language model - Why does LLM supervised fine tuning only need small amounts of data?

I’ve taken some courses of LLM and reproduced a small LLM from scratch and trained on Shakespeare data. Now I’m learning supervised fine tuning but having difficulty understanding why it only needs much smaller size of datasets (compared to pre-training dataset). I.e. why a model pre-train on the entire open Internet can be fine tuned by just a thousand of sentences.

So here is my understanding of supervised fine tuning: it’s simply like doing the entire pre-training process again, but using a different set of data and train less iterations. (Please correct me if I’m wrong)

If the above understanding is correct, then I wonder:

Does it mean that during the pre-training process (which has billions of iterations), the last 1k iterations must use examples that are diverse enough? Because if unfortunately the last 1k examples happen to be all coming from say a book (because the examples are randomly selected from the full corpus), then it effectively fine-tuned the model to that book? (And therefore ruined the model? This seems too brittle to believe.)
Does that mean the most recent training examples have higher impact than earlier ones? Because this seems to be the foundation of why fine-tuning would work. Otherwise the fine-tuning data will be just equivalent to other pre-training data.
Or if 2 isn’t true, then maybe coverage is the key? E.g. if I already pre-trained a model using corpus from 1k books, then does it make a big difference if I fine-tune on one of the book again?

Thanks!

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

large language model - Why does LLM supervised fine tuning only need small amounts of data? - Stack Overflow

与本文相关的文章

评论列表(0)