I have a host with 2 cuda devices and I have a lot of text chunks to embed. As part of my project I'm using SentenceTransformer
with the model BAAI/bge-small-en
. I'm having difficulty making use of the 2nd device.
As I understand it, the following code will instantiate a model and load it onto the first cuda device. In this implementation, model.encode
can batch about 100 chunks / second:
model = SentenceTransformer('BAAI/bge-small-en', trust_remote_code=True, device='cuda')
However, if I try to use multi_process encoding to engage both devices, the performance drops down to about 0.5 chunks / second:
pool = model.start_multi_process_pool()
def encode(docs):
return model.encode_multi_process(docs, pool=pool)
This rate held for batch sizes 16 through 256 and for larger batch sizes, performance was even worse.
Is there something else I can do to take advantage of the second cuda device, or am I doing something incorrect with the encode_multi_process
method?
I have a host with 2 cuda devices and I have a lot of text chunks to embed. As part of my project I'm using SentenceTransformer
with the model BAAI/bge-small-en
. I'm having difficulty making use of the 2nd device.
As I understand it, the following code will instantiate a model and load it onto the first cuda device. In this implementation, model.encode
can batch about 100 chunks / second:
model = SentenceTransformer('BAAI/bge-small-en', trust_remote_code=True, device='cuda')
However, if I try to use multi_process encoding to engage both devices, the performance drops down to about 0.5 chunks / second:
pool = model.start_multi_process_pool()
def encode(docs):
return model.encode_multi_process(docs, pool=pool)
This rate held for batch sizes 16 through 256 and for larger batch sizes, performance was even worse.
Is there something else I can do to take advantage of the second cuda device, or am I doing something incorrect with the encode_multi_process
method?
- 3 The term cuda core makes no sense in this context. There is no GPU with just 2 cuda cores. See for example Are GPU/CUDA cores SIMD ones? for an explanation of that term. Do you mean you have two GPUs? – Homer512 Commented Mar 12 at 21:00
- Yes, I mean I have 2 cuda devices. I'll edit the question – BBrooklyn Commented Mar 12 at 22:27
- Can you add the parameter model_kwargs={'device_map': "auto"} to SentenceTransformer ? – rehaqds Commented Mar 14 at 21:37
1 Answer
Reset to default -1You can try to initialize two models and use ThreadPoolExecutor
or ProcessPoolExecutor
Pseudocode:
model_gpu0 = SentenceTransformer(...,device='cuda:0')
model_gpu1 = SentenceTransformer(...,device='cuda:1')
def encode_docs(model, docs, batch_size=128):
return model.encode(docs,...)
mid_index = len(docs) // 2
docs0 = docs[:mid_index]
docs1 = docs[mid_index:]
output = None
with ThreadPoolExecutor(max_workers=2) as executor:
future0 = executor.submit(encode_docs, model_gpu0, docs0, batch_size)
future1 = executor.submit(encode_docs, model_gpu1, docs1, batch_size)
embeddings0 = future0.result()
embeddings1 = future1.result()
output = embeddings0 + embeddings1