I'm new on Swift and Native concept in general. I was working on an application that purpose to capture application and microphone audio -> format to pcm16 -> send to other service to process. But having problem while combining application and microphone aduio, if you have experient in this, any advice would help
Here the code that processing audio data
import AVFoundation
import ScreenCaptureKit
class AudioStreamOutputHandler: NSObject, SCStreamOutput {
private let audioEngine = AVAudioEngine()
let deepgramService: DeepgramService
init(deepgramService: DeepgramService) {
self.deepgramService = deepgramService
super.init()
}
nonisolated func stream(_ stream: SCStream, didStopWithError error: Error) {
print("SCStream stopped with error: \(error.localizedDescription)")
}
// Called when a sample buffer is received.
nonisolated func stream(
_ stream: SCStream,
didOutputSampleBuffer sampleBuffer: CMSampleBuffer,
of type: SCStreamOutputType
) {
switch type {
case .audio:
let pcmBuffer: AVAudioPCMBuffer? = createPCMBuffer(from: sampleBuffer)
guard let pcmBuffer else { return }
if let convertedPCMData: Data = convertTo16BitPCM(from: pcmBuffer) {
deepgramService.sendAudioData(convertedPCMData)
}
break;
case .microphone:
break
default:
break;
}
}
private func createPCMBuffer(from sampleBuffer: CMSampleBuffer) -> AVAudioPCMBuffer? {
guard CMSampleBufferIsValid(sampleBuffer),
let formatDescription = sampleBuffer.formatDescription,
let absd = formatDescription.audioStreamBasicDescription
else {
NSLog("Invalid CMSampleBuffer or missing format description.")
return nil
}
var audioBufferListCopy: AudioBufferList?
do {
try sampleBuffer.withAudioBufferList { audioBufferList, _ in
audioBufferListCopy = audioBufferList.unsafePointer.pointee
}
} catch {
NSLog("Error accessing AudioBufferList: \(error.localizedDescription)")
return nil
}
guard
let format = AVAudioFormat(
standardFormatWithSampleRate: absd.mSampleRate,
channels: AVAudioChannelCount(absd.mChannelsPerFrame)
)
else {
NSLog("Failed to create AVAudioFormat.")
return nil
}
return AVAudioPCMBuffer(
pcmFormat: format,
bufferListNoCopy: &audioBufferListCopy!
)
}
private func convertTo16BitPCM(from buffer: AVAudioPCMBuffer) -> Data? {
guard let floatChannelData = buffer.floatChannelData else {
NSLog("Failed to get floatChannelData.")
return nil
}
let frameLength = Int(buffer.frameLength)
let channelCount = Int(buffer.format.channelCount)
var pcmData = Data(capacity: frameLength * channelCount * MemoryLayout<Int16>.size)
for channel in 0..<channelCount {
let channelData = floatChannelData[channel]
for sampleIndex in 0..<frameLength {
let intSample = Int16(
max(-1.0, min(1.0, channelData[sampleIndex])) * Float(Int16.max)
)
pcmData.append(contentsOf: withUnsafeBytes(of: intSample.littleEndian) { Data($0) })
}
}
return pcmData
}
}
I'm new on Swift and Native concept in general. I was working on an application that purpose to capture application and microphone audio -> format to pcm16 -> send to other service to process. But having problem while combining application and microphone aduio, if you have experient in this, any advice would help
Here the code that processing audio data
import AVFoundation
import ScreenCaptureKit
class AudioStreamOutputHandler: NSObject, SCStreamOutput {
private let audioEngine = AVAudioEngine()
let deepgramService: DeepgramService
init(deepgramService: DeepgramService) {
self.deepgramService = deepgramService
super.init()
}
nonisolated func stream(_ stream: SCStream, didStopWithError error: Error) {
print("SCStream stopped with error: \(error.localizedDescription)")
}
// Called when a sample buffer is received.
nonisolated func stream(
_ stream: SCStream,
didOutputSampleBuffer sampleBuffer: CMSampleBuffer,
of type: SCStreamOutputType
) {
switch type {
case .audio:
let pcmBuffer: AVAudioPCMBuffer? = createPCMBuffer(from: sampleBuffer)
guard let pcmBuffer else { return }
if let convertedPCMData: Data = convertTo16BitPCM(from: pcmBuffer) {
deepgramService.sendAudioData(convertedPCMData)
}
break;
case .microphone:
break
default:
break;
}
}
private func createPCMBuffer(from sampleBuffer: CMSampleBuffer) -> AVAudioPCMBuffer? {
guard CMSampleBufferIsValid(sampleBuffer),
let formatDescription = sampleBuffer.formatDescription,
let absd = formatDescription.audioStreamBasicDescription
else {
NSLog("Invalid CMSampleBuffer or missing format description.")
return nil
}
var audioBufferListCopy: AudioBufferList?
do {
try sampleBuffer.withAudioBufferList { audioBufferList, _ in
audioBufferListCopy = audioBufferList.unsafePointer.pointee
}
} catch {
NSLog("Error accessing AudioBufferList: \(error.localizedDescription)")
return nil
}
guard
let format = AVAudioFormat(
standardFormatWithSampleRate: absd.mSampleRate,
channels: AVAudioChannelCount(absd.mChannelsPerFrame)
)
else {
NSLog("Failed to create AVAudioFormat.")
return nil
}
return AVAudioPCMBuffer(
pcmFormat: format,
bufferListNoCopy: &audioBufferListCopy!
)
}
private func convertTo16BitPCM(from buffer: AVAudioPCMBuffer) -> Data? {
guard let floatChannelData = buffer.floatChannelData else {
NSLog("Failed to get floatChannelData.")
return nil
}
let frameLength = Int(buffer.frameLength)
let channelCount = Int(buffer.format.channelCount)
var pcmData = Data(capacity: frameLength * channelCount * MemoryLayout<Int16>.size)
for channel in 0..<channelCount {
let channelData = floatChannelData[channel]
for sampleIndex in 0..<frameLength {
let intSample = Int16(
max(-1.0, min(1.0, channelData[sampleIndex])) * Float(Int16.max)
)
pcmData.append(contentsOf: withUnsafeBytes(of: intSample.littleEndian) { Data($0) })
}
}
return pcmData
}
}
Share
Improve this question
asked Feb 16 at 21:38
Garfield is on board.Garfield is on board.
632 silver badges4 bronze badges
1 Answer
Reset to default 1Yes! You can combine .audio
and .microphone
into a single PCM stream in swift. You simply mix the two streams together by summing their samples.
With some caveats:
- the sample timestamps may not line up
- the stream formats may not match
- the stream sample rates may not match
- the sample channel counts may not match
- the
.microphone
stream often contains a delayed copy of.audio
That's a lot of caveats! Here are some possibilities for dealing with them:
you could do your best to line the sample timestamps up, or you could make the simplifying assumption that the most recently arrived
CMSampleBuffer
s are "close enough" to mix together, regardless of their timestamps.you've already got code for converting float to integer samples for
.audio
, so you can do the same for.microphone
. This is slightly odd because I thoughtScreenCaptureKit
.audio
was often integer and.microphone
was float. In any case, you can do something similar for.microphone
. If you're mixing the samples as integers, take care to avoid overflow.to match sample rates most people reach for an
AVAudioConverter
, and this can also cover points 2. and 4. Note thatScreenCaptureKit
usesCMSampleBuffer
s instead ofAVAudioPCMBuffer
s, so you'll need to convert. This is per stream, so you should convert to mono to allow mixing, although keep in mind thatAVAudioConverter
's idea of converting stereo to mono is to discard the right channel . This is probably a fine simplifying assumption.see point 3.
When mixing microphone and system audio, if the microphone can hear the speakers then you get a second copy of the system audio in the result. This is called an echo. As far as I know, macOS doesn't have a general echo canceller available to 3rd party apps, so I use a chunk of the webrtc code.
One slight simplification could be not mixing the two streams by using Deepgram's multichannel feature, putting a mono version of .audio
in the left channel and a mono version of .microphone
in the right channel. This plus diarization may be useful to you. It's only a slight simplification because you still need to (maybe) synchronize, rate and format convert and deal with echoes.
Good luck!