最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python 3.x - Unable to figure out the hardware requirement(Cloud or on-prem) for open source inference for multiple users - Stac

programmeradmin0浏览0评论

I am trying to budget for setting up a llm based RAG application which will serve users with dynamic size(Anything from 100 to 2000).

I am able to figure out the GPU requirement to host a certain llm[1], let's say LLAMA 70billion at half-precision will require 168 GB. But I am unable to figure out how to calculate the token speed for a single user and then for multiple concurrent users and how to look for appropriate hardware for.

How should I approach this problem?

Thanks for taking the time to read this. [1]:

I am trying to budget for setting up a llm based RAG application which will serve users with dynamic size(Anything from 100 to 2000).

I am able to figure out the GPU requirement to host a certain llm[1], let's say LLAMA 70billion at half-precision will require 168 GB. But I am unable to figure out how to calculate the token speed for a single user and then for multiple concurrent users and how to look for appropriate hardware for.

How should I approach this problem?

Thanks for taking the time to read this. [1]: https://www.substratus.ai/blog/calculating-gpu-memory-for-llm

Share Improve this question asked Nov 19, 2024 at 17:50 BingBing 6311 gold badge9 silver badges23 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

From experience - it is not so simple. You need to have take into account:

  1. engine used for inference (TGI? pure transformers? llama-cpp)
  2. card type (really it matters whether it is H100 or L40S or A100)
  3. batch size
  4. is it a chatbot like experience or maybe you need to process offline?
  5. What is the maximum context you would like to process?

On basis of this you need to run some benchmark and generalize it

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论