最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

large language model - Why does LLM supervised fine tuning only need small amounts of data? - Stack Overflow

programmeradmin3浏览0评论

I’ve taken some courses of LLM and reproduced a small LLM from scratch and trained on Shakespeare data. Now I’m learning supervised fine tuning but having difficulty understanding why it only needs much smaller size of datasets (compared to pre-training dataset). I.e. why a model pre-train on the entire open Internet can be fine tuned by just a thousand of sentences.

So here is my understanding of supervised fine tuning: it’s simply like doing the entire pre-training process again, but using a different set of data and train less iterations. (Please correct me if I’m wrong)

If the above understanding is correct, then I wonder:

  1. Does it mean that during the pre-training process (which has billions of iterations), the last 1k iterations must use examples that are diverse enough? Because if unfortunately the last 1k examples happen to be all coming from say a book (because the examples are randomly selected from the full corpus), then it effectively fine-tuned the model to that book? (And therefore ruined the model? This seems too brittle to believe.)
  2. Does that mean the most recent training examples have higher impact than earlier ones? Because this seems to be the foundation of why fine-tuning would work. Otherwise the fine-tuning data will be just equivalent to other pre-training data.
  3. Or if 2 isn’t true, then maybe coverage is the key? E.g. if I already pre-trained a model using corpus from 1k books, then does it make a big difference if I fine-tune on one of the book again?

Thanks!

发布评论

评论列表(0)

  1. 暂无评论