I am trying to optimize the performance of a Node.js function that generates audio segments (for example something like OpenAIs TTS API) and then concatenates the resulting audio buffers. My goal is to speed up the reading of each audio response into the buffer and the final concatenation of audio buffers.
Current Approach
Naively I thought that it would be as simple as calling all of the audio segments, then reading them all into their own buffer and putting them together. Here is a simplified version of what I am doing now:
const createAudioSegment = async (text) => {
const response = await openai.audio.speech.create({
model: "tts-1",
voice: "echo",
input: text,
});
return response;
};
const audio_texts = ["text 1", "text 2"] // list of text to turn into audio
const segmentsTimeStart = new Date().getTime()
// Process dialogue segments in parallel
const audioSegments = await Promise.all(
audio_texts.map((text) => createAudioSegment(text))
);
const segmentsTimeEnd = new Date().getTime()
const segmentsTimeDiff = segmentsTimeEnd - segmentsTimeStart
console.log(`Total Audio Segment Time: ${segmentsTimeDiff}ms`)
const audioBufferReadStart = new Date().getTime()
let responseTimes = []
const audioBuffers = await Promise.all(
audioSegments.map(async (segment) => {
const responseStartTime = new Date().getTime()
const arrayBuffer = await segment.arrayBuffer();
const responseTimeDiff = responseEndTime - responseStartTime
responseTimes.push(responseTimeDiff)
return Buffer.from(arrayBuffer);
})
);
const audioBufferReadEnd = new Date().getTime()
console.log(`Audio Buffer Total Read Time: ${audioBufferReadEnd - audioBufferReadStart}ms`)
console.log(`Audio Buffer Individual Read Times (ms): ${responseTimes}`)
const concatStartTime = new Date().getTime()
const finalBuffer = Buffer.concat(audioBuffers);
const concatEndTime = new Date().getTime()
console.log(`Buffer Concatenation Time: ${concatEndTime - concatStartTime}ms`)
Performance Issues
Overall, the time it takes to do this is usually 40-60 seconds. However, when logging the individual operations, I noticed that it is really the reading of each audio segment into its individual buffer that takes the majority of the time. As an example, for a scenario with 12 audio segments, I see the following timing:
- Total Audio Segment Time: 2333 ms
- Audio Buffer Total Read Time: 27455 ms
- Audio Buffer Individual Read Times (ms): [1035,3420,6497,96,150,360,70,88,20344,32,83,254]
- Buffer Concatenation Time: 2 ms
It takes over 20 seconds to read all of the responses to their buffers.. why are some <1sec and some >20sec? I don't understand.
Question
What could be causing such a large discrepancy between the individual read times? Is this something that is avoidable with a more efficient way to do this?
I am trying to optimize the performance of a Node.js function that generates audio segments (for example something like OpenAIs TTS API) and then concatenates the resulting audio buffers. My goal is to speed up the reading of each audio response into the buffer and the final concatenation of audio buffers.
Current Approach
Naively I thought that it would be as simple as calling all of the audio segments, then reading them all into their own buffer and putting them together. Here is a simplified version of what I am doing now:
const createAudioSegment = async (text) => {
const response = await openai.audio.speech.create({
model: "tts-1",
voice: "echo",
input: text,
});
return response;
};
const audio_texts = ["text 1", "text 2"] // list of text to turn into audio
const segmentsTimeStart = new Date().getTime()
// Process dialogue segments in parallel
const audioSegments = await Promise.all(
audio_texts.map((text) => createAudioSegment(text))
);
const segmentsTimeEnd = new Date().getTime()
const segmentsTimeDiff = segmentsTimeEnd - segmentsTimeStart
console.log(`Total Audio Segment Time: ${segmentsTimeDiff}ms`)
const audioBufferReadStart = new Date().getTime()
let responseTimes = []
const audioBuffers = await Promise.all(
audioSegments.map(async (segment) => {
const responseStartTime = new Date().getTime()
const arrayBuffer = await segment.arrayBuffer();
const responseTimeDiff = responseEndTime - responseStartTime
responseTimes.push(responseTimeDiff)
return Buffer.from(arrayBuffer);
})
);
const audioBufferReadEnd = new Date().getTime()
console.log(`Audio Buffer Total Read Time: ${audioBufferReadEnd - audioBufferReadStart}ms`)
console.log(`Audio Buffer Individual Read Times (ms): ${responseTimes}`)
const concatStartTime = new Date().getTime()
const finalBuffer = Buffer.concat(audioBuffers);
const concatEndTime = new Date().getTime()
console.log(`Buffer Concatenation Time: ${concatEndTime - concatStartTime}ms`)
Performance Issues
Overall, the time it takes to do this is usually 40-60 seconds. However, when logging the individual operations, I noticed that it is really the reading of each audio segment into its individual buffer that takes the majority of the time. As an example, for a scenario with 12 audio segments, I see the following timing:
- Total Audio Segment Time: 2333 ms
- Audio Buffer Total Read Time: 27455 ms
- Audio Buffer Individual Read Times (ms): [1035,3420,6497,96,150,360,70,88,20344,32,83,254]
- Buffer Concatenation Time: 2 ms
It takes over 20 seconds to read all of the responses to their buffers.. why are some <1sec and some >20sec? I don't understand.
Question
What could be causing such a large discrepancy between the individual read times? Is this something that is avoidable with a more efficient way to do this?
Share Improve this question edited Mar 18 at 17:32 Trevor Woods asked Mar 13 at 15:59 Trevor WoodsTrevor Woods 481 gold badge1 silver badge8 bronze badges 9 | Show 4 more comments1 Answer
Reset to default 0Note: This is not the final answer, but my attempt to narrow down the problem.
I have noticed few issues with your code first being
Issue 1
const audioSegments = await Promise.all(
audio_texts.map((text) => createAudioSegment(text))
);
notice you are passing a string (text) to createAudioSegment
but in definition you are expecting array of string
const createAudioSegment = async ([text]) => {
so you were sending only very first character (which is just t
) to openai.
Issue 2
const responseTimeDiff = responseEndTime - responseStartTime
responseEndTime
is not defined.
After simplifying you get this
Final solution
import OpenAI from 'openai'
const openai = new OpenAI()
/** @param {string} text */
const createAudioSegment = (text) =>
openai.audio.speech.create({
model: 'tts-1',
voice: 'echo',
input: text,
})
const audio_texts = ['I am text 1', 'I am text 2']
console.time('audioSegments')
const audioSegments = await Promise.all(audio_texts.map((text) => createAudioSegment(text)))
console.timeEnd('audioSegments')
console.time('audioArrayBuffers')
const audioArrayBuffers = await Promise.all(audioSegments.map((segment) => segment.arrayBuffer()))
console.timeEnd('audioArrayBuffers')
console.time('buffers')
const buffers = audioArrayBuffers.map((ab) => Buffer.from(ab))
console.timeEnd('buffers')
console.time('finalBuffer')
const finalBuffer = Buffer.concat(buffers)
console.timeEnd('finalBuffer')
console.log(finalBuffer.length, finalBuffer.byteLength)
Now run this code as-is and share the whole terminal output here. (I don't have access to openai, so I could not test it.)
const createAudioSegment = async ([i, segment])
- when you call itcreateAudioSegment(text)
you are passing a string. Please fix/clarify. – James Commented Mar 13 at 18:05response
there is just a HTTP response, you're essentially makingN
parallel calls to OpenAI, so sure, some of them might finish quicker, others more slowly. .. – AKX Commented Mar 18 at 15:33