python 3.x - Unable to figure out the hardware requirement(Cloud or on-prem) for open source inference for multiple users

I am trying to budget for setting up a llm based RAG application which will serve users with dynamic size(Anything from 100 to 2000).

I am able to figure out the GPU requirement to host a certain llm[1], let's say LLAMA 70billion at half-precision will require 168 GB. But I am unable to figure out how to calculate the token speed for a single user and then for multiple concurrent users and how to look for appropriate hardware for.

How should I approach this problem?

Thanks for taking the time to read this. [1]:

I am trying to budget for setting up a llm based RAG application which will serve users with dynamic size(Anything from 100 to 2000).

How should I approach this problem?

Thanks for taking the time to read this. [1]: https://www.substratus.ai/blog/calculating-gpu-memory-for-llm

Share Improve this question asked Nov 19, 2024 at 17:50 Bing 6311 gold badge9 silver badges23 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

From experience - it is not so simple. You need to have take into account:

engine used for inference (TGI? pure transformers? llama-cpp)
card type (really it matters whether it is H100 or L40S or A100)
batch size
is it a chatbot like experience or maybe you need to process offline?
What is the maximum context you would like to process?

On basis of this you need to run some benchmark and generalize it

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python 3.x - Unable to figure out the hardware requirement(Cloud or on-prem) for open source inference for multiple users - Stac

1 Answer 1

与本文相关的文章

评论列表(0)