python - Making use of multiple CUDA devices with SentenceTransformer

I have a host with 2 cuda devices and I have a lot of text chunks to embed. As part of my project I'm using SentenceTransformer with the model BAAI/bge-small-en. I'm having difficulty making use of the 2nd device.

As I understand it, the following code will instantiate a model and load it onto the first cuda device. In this implementation, model.encode can batch about 100 chunks / second:

model = SentenceTransformer('BAAI/bge-small-en', trust_remote_code=True, device='cuda')

However, if I try to use multi_process encoding to engage both devices, the performance drops down to about 0.5 chunks / second:

pool = model.start_multi_process_pool()
def encode(docs):
    return model.encode_multi_process(docs, pool=pool)

This rate held for batch sizes 16 through 256 and for larger batch sizes, performance was even worse.

Is there something else I can do to take advantage of the second cuda device, or am I doing something incorrect with the encode_multi_process method?

As I understand it, the following code will instantiate a model and load it onto the first cuda device. In this implementation, model.encode can batch about 100 chunks / second:

model = SentenceTransformer('BAAI/bge-small-en', trust_remote_code=True, device='cuda')

However, if I try to use multi_process encoding to engage both devices, the performance drops down to about 0.5 chunks / second:

pool = model.start_multi_process_pool()
def encode(docs):
    return model.encode_multi_process(docs, pool=pool)

This rate held for batch sizes 16 through 256 and for larger batch sizes, performance was even worse.

Is there something else I can do to take advantage of the second cuda device, or am I doing something incorrect with the encode_multi_process method?

Share Improve this question edited Mar 13 at 7:05 talonmies 72.4k35 gold badges203 silver badges289 bronze badges asked Mar 12 at 20:14 BBrooklyn 4181 gold badge5 silver badges20 bronze badges

3 The term cuda core makes no sense in this context. There is no GPU with just 2 cuda cores. See for example Are GPU/CUDA cores SIMD ones? for an explanation of that term. Do you mean you have two GPUs? – Homer512 Commented Mar 12 at 21:00
Yes, I mean I have 2 cuda devices. I'll edit the question – BBrooklyn Commented Mar 12 at 22:27
Can you add the parameter model_kwargs={'device_map': "auto"} to SentenceTransformer ? – rehaqds Commented Mar 14 at 21:37

Add a comment |

1 Answer 1

Sorted by: Reset to default -1

You can try to initialize two models and use ThreadPoolExecutor or ProcessPoolExecutor Pseudocode:

model_gpu0 = SentenceTransformer(...,device='cuda:0')
model_gpu1 = SentenceTransformer(...,device='cuda:1')

def encode_docs(model, docs, batch_size=128):
    return model.encode(docs,...)

mid_index = len(docs) // 2
docs0 = docs[:mid_index]
docs1 = docs[mid_index:]
output = None

with ThreadPoolExecutor(max_workers=2) as executor:
    future0 = executor.submit(encode_docs, model_gpu0, docs0, batch_size)
    future1 = executor.submit(encode_docs, model_gpu1, docs1, batch_size)
    embeddings0 = future0.result()
    embeddings1 = future1.result()
    output = embeddings0 + embeddings1

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Making use of multiple CUDA devices with SentenceTransformer - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)