Getting Started with Audio Queues in Core Audio

Core Audio is a framework by Apple for handling audio on its platforms (macOS, iOS, etc.). It contains many high-level, mid-level, and low-level services, each providing different levels of abstraction depending on the task being performed.

One service available is the Audio Queue services. Audio queues allow buffers of audio samples to be streamed to an audio device. These buffers can be linear PCM (decompressed samples of audio) or compressed formats (AAC, MP3, etc.). Buffers are filled using a callback function when additional samples are needed.

Audio queue visualization, Apple

When I was integrating Core Audio into my game, I found examples and documentation of Audio Queues to be lacking and out of date. There are books explaining Core Audio in great detail, however I was only looking to initialize and write a callback to fill samples, nothing fancy. To aid someone looking to do something similar, this post will give an example of how to set up Audio Queue services for linear PCM playback.

The code samples will be in Objective-C, however can easily be adapted to Swift. A full example is at the bottom of the post if you want to skip ahead.

Table of Contents

Required Frameworks

Core Audio is split into many frameworks, depending on which services you need for your application. For Audio Queue services, there are two frameworks that are needed: CoreAudio and AudioToolbox. Core Audio contains data types and interfaces for the other audio frameworks. Audio Toolbox contains the Audio Queue services.

The frameworks can be added to your Xcode project under "Frameworks and Libraries" in your target settings, or by adding -framework FRAMEWORK_NAME to your compile and linker commands if you are using the command line.

Audio Queue Setup

Audio Stream Description

Creating an audio requires a stream description:

AudioStreamBasicDescription streamDesc;
streamDesc.mSampleRate = 48000.0f;
streamDesc.mFormatID = kAudioFormatLinearPCM;
streamDesc.mFormatFlags = kLinearPCMFormatFlagIsSignedInteger | kLinearPCMFormatFlagIsPacked;
streamDesc.mBitsPerChannel = 16;
streamDesc.mChannelsPerFrame = 2;
streamDesc.mFramesPerPacket = 1;
streamDesc.mBytesPerFrame = streamDesc.mBitsPerChannel / 8 * streamDesc.mChannelsPerFrame;
streamDesc.mBytesPerPacket = streamDesc.mBytesPerFrame * streamDesc.mFramesPerPacket;

The sample rate can be any common value (48 kHz, 44.1 kHz, 16kHz, etc.). CoreAudio abstracts the sound hardware to support the requested sample rate. This is not always the case in other frameworks, such as Windows WASAPI where the sample rate must match the hardware. I like to use 48 kHz, since I've found most modern sound devices support that sample rate.

The format ID is set to linear PCM, which means the sample data will be played "as-is". This is used in games where mixing of audio sources is performed by the engine and outputs uncompressed samples. Compressed formats can be used instead of linear PCM (AAC, MP3, etc.), which CoreAudio will decompress before playing. This has advantages, such as hardware acceleration which can use less power on mobile devices. Compressed formats is beyond the scope of this post.

The format flags contains information about what the sample data values will be; integer/float, big/little endian, signed/unsigned, etc. Apple recommends using floats on Mac and integers on iPhones/iPads, which are the formats native to those platforms. A non-native format must have a conversion, which can potentially increase power usage. I use 16-bit signed integers per-channel for all platforms to simplify my audio interface. With Apple's move to Apple Silicon, I suspect this recommendation will change to integers for Mac computers. The kLinearPCMFormatFlagIsPacked flag tells CoreAudio that we will be using the entire range of values for our samples.

I found that most examples on the internet use the kLinearPCMFormatFlagIsBigEndian flag. The PowerPC architecture used big endian data types, so this flag was was necessary to inform CoreAudio of big endian samples. Intel/Apple Silicon architectures are little endian so this flag is no longer needed.

The bits per channel and channels per frame values can be set to however you want to provide sample data. 16-bits per channel is plenty of resolution, and in this example we will provide stereo samples. The channels are provided interleaved; if your samples are not interleaved, you will have to perform a conversion.

The frames per packet is set to 1. This value would be different for various compression formats, which won't be covered in this post.

The bytes per frame and bytes per packet can be calculated based on the previous values.

User Data

User data can be provided to the Audio Queue as a void * pointer. This is accessible within the callback function. For this example, I am going to have a struct that will have six members; number of frames that have been buffered, sample rate volume, left-channel frequency, right-channel frequency, and number of bytes per frame:

typedef struct soundState {
    float framesBuffered;
    float sampleRate;
    float volume;
    float leftFrequency;
    float rightFrequency;
    int bytesPerFrame;
} SoundState;
...
SoundState state;
state.framesBuffered = 0.0f;
state.sampleRate = streamDesc.mSampleRate;
state.volume = 0.1f;
state.leftFrequency = 150.0f;
state.rightFrequency = 250.0f;
state.bytesPerFrame = streamDesc.mBytesPerFrame;

Creating the Audio Queue

An Audio Queue for playback can be created using the AudioQueueNewOutput function:

AudioQueueRef audioQueue = 0;
OSStatus err = AudioQueueNewOutput(&streamDesc,
    &streamCallback,
    &state,
    0, 0, 0, // Used for compression formats, not needed for linear PCM
    &audioQueue);
assert(!err);

This function is self-explanatory. The streamCallback will be explained in the next section.

When the Audio Queue is no longer needed, it can be destroyed with the AudioQueueDispose function:

AudioQueueDispose(audioQueue, YES); // Destroy immediately

The second parameter controls whether the Audio Queue is immediately destroyed or if the destruction should occur after all the buffers have completed playing.

The Callback Function

When an Audio Queue buffer needs samples, it will request samples from a callback function. This function is passed into AudioQueueNewOutput. The callback function looks like the following:

void streamCallback(void* userData, AudioQueueRef audioQueue, AudioQueueBufferRef buffer) {
    ...
}

The userData was supplied to AudioQueueNewOutput and can be casted into a pointer of the same type:

SoundState* soundState = (SoundState*)userData;

We can compute the number of frames that the audio buffer can store by using the mAudioDataBytesCapacity member:

int bufferFrameCapacity = buffer->mAudioDataBytesCapacity / (*soundState).bytesPerFrame;

The buffer has a mAudioData member, which is a pointer to where samples need to be written to. The example I'll use in this post is a sine wave in each channel, with a separate frequency for each channel:

int16_t* sampleBuffer = (int16_t*)buffer->mAudioData;

for(int i = 0; i < bufferFrameCapacity; i++) {
    float t = ((*soundState).framesBuffered / (*soundState).sampleRate);
    float x1 = 2 * 3.141592f * t * (*soundState).leftFrequency;
    float x2 = 2 * 3.141592f * t * (*soundState).rightFrequency;

    *(sampleBuffer++) = (int16_t)(32767.0f * sin(x1));
    *(sampleBuffer++) = (int16_t)(32767.0f * sin(x2));

    (*soundState).framesBuffered += 1;
}

The 32767.0f is based on the range of a signed 16-bit integer.

After the samples have been written to the buffer, the buffer has to be notified how many bytes were written. This is done by setting the mAudioDataByteSize member:

buffer->mAudioDataByteSize = bufferFrameCapacity * (*soundState).bytesPerFrame;

If the entire contents of the buffer was not filled because there are no more samples (e.g. if a song/sound effect has completed), the Audio Queue should be stopped by calling AudioQueueStop.

After the buffer is filled, the buffer must be queued back into the Audio Queue:

AudioQueueEnqueueBuffer(audioQueue, buffer, 0, 0);

The third and fourth parameters are not used for linear PCM playback.

Creating Audio Queue Buffers

After creating the Audio Queue, we need to create Audio Queue Buffers. In PortAudio, a callback-based audio abstraction library, the sample buffers are created by the library. In CoreAudio, this must be done manually using the AudioQueueAllocateBuffer function:

// Generate buffers holding at most 1/16th of a second of data
int bufferSize = streamDesc.mBytesPerFrame * (streamDesc.mSampleRate / 16);

AudioQueueBufferRef audioQueueBuffers[2];
err = AudioQueueAllocateBuffer(audioQueue, bufferSize, &(audioQueueBuffers[0]));
assert(!err);
err = AudioQueueAllocateBuffer(audioQueue, bufferSize, &(audioQueueBuffers[1]));
assert(!err);

Other tutorials on Audio Queues have 1/16th of a second of samples per buffer as a good compromise between audio latency and consistency. From some experimentation, this number is sufficient for my own projects. The larger the buffers are, the more latency there will be, but consistency will be higher. The smaller the buffers are, the less latency there will be, but consistency will be lower (this can cause crackling if the callback takes too long to fill the buffer). The number of buffers can be experimented with as well, but there must be at least two buffers (one buffer will be filled while the other is playing). The more buffers there are, the less often the callback function will be called.

Before starting the Audio Queue, we have to fill samples into the buffers. We can do this by using the callback function:

streamCallback(&state, audioQueue, audioQueueBuffers[0]);
streamCallback(&state, audioQueue, audioQueueBuffers[1]);

Starting the Audio Queue

Finally, we can begin the Audio Queue by calling AudioQueueStart:

AudioQueueStart(audioQueue, 0);

The second parameter is the start time. If set, the audio queue will be started after a period of time relative to one of the parameters in AudioQueueNewOutput (not used in this post). Otherwise, the audio queue will immediately begin playback.

Full Example

The example is working correctly if you hear two differently pitched tones, one in each ear.