I’ve been exploring how large language models perform on relatively complex reasoning tasks and noticed something interesting: a larger model (hundreds of billions of parameters) excels at these tasks, while a smaller distilled model (tens of billions of parameters) struggles significantly. I’ve tried improving the smaller model with domain-specific distillation or fine-tuning, but the gains seem limited. I’d love to get your input on a few questions:
Is model size (parameter count) the primary factor determining the performance ceiling for complex reasoning tasks? For a smaller model (e.g., tens of billions of parameters), can further training or optimization bring its performance close to a larger model on complex reasoning tasks, or is parameter count a hard limit? Are there any papers or practical experiences you could share on this topic? Thanks for any insights or discussion!