I am trying to budget for setting up a llm based RAG application which will serve users with dynamic size(Anything from 100 to 2000).
I am able to figure out the GPU requirement to host a certain llm[1], let's say LLAMA 70billion at half-precision will require 168 GB. But I am unable to figure out how to calculate the token speed for a single user and then for multiple concurrent users and how to look for appropriate hardware for.
How should I approach this problem?
Thanks for taking the time to read this. [1]:
I am trying to budget for setting up a llm based RAG application which will serve users with dynamic size(Anything from 100 to 2000).
I am able to figure out the GPU requirement to host a certain llm[1], let's say LLAMA 70billion at half-precision will require 168 GB. But I am unable to figure out how to calculate the token speed for a single user and then for multiple concurrent users and how to look for appropriate hardware for.
How should I approach this problem?
Thanks for taking the time to read this. [1]: https://www.substratus.ai/blog/calculating-gpu-memory-for-llm
Share Improve this question asked Nov 19, 2024 at 17:50 BingBing 6311 gold badge9 silver badges23 bronze badges1 Answer
Reset to default 1From experience - it is not so simple. You need to have take into account:
- engine used for inference (TGI? pure transformers? llama-cpp)
- card type (really it matters whether it is H100 or L40S or A100)
- batch size
- is it a chatbot like experience or maybe you need to process offline?
- What is the maximum context you would like to process?
On basis of this you need to run some benchmark and generalize it