Labels

Showing posts with label audio. Show all posts
Showing posts with label audio. Show all posts

Tuesday, 28 July 2015

Wave header format structure@audio file...

WAVE PCM soundfile format

The WAVE file format is a subset of Microsoft's RIFF specification for the storage of multimedia files. A RIFF file starts out with a file header followed by a sequence of data chunks. A WAVE file is often just a RIFF file with a single "WAVE" chunk which consists of two sub-chunks -- a "fmt " chunk specifying the data format and a "data" chunk containing the actual sample data. Call this form the "Canonical form". Who knows how it really all works. An almost complete description which seems totally useless unless you want to spend a week looking over it can be found at MSDN(mostly describes the non-PCM, or registered proprietary data formats).
 
The canonical WAVE format starts with the RIFF header:

0         4   ChunkID          Contains the letters "RIFF" in ASCII form
                               (0x52494646 big-endian form).
4         4   ChunkSize        36 + SubChunk2Size, or more precisely:
                               4 + (8 + SubChunk1Size) + (8 + SubChunk2Size)
                               This is the size of the rest of the chunk 
                               following this number.  This is the size of the 
                               entire file in bytes minus 8 bytes for the
                               two fields not included in this count:
                               ChunkID and ChunkSize.
8         4   Format           Contains the letters "WAVE"
                               (0x57415645 big-endian form).

The "WAVE" format consists of two subchunks: "fmt " and "data":
The "fmt " subchunk describes the sound data's format:

12        4   Subchunk1ID      Contains the letters "fmt "
                               (0x666d7420 big-endian form).
16        4   Subchunk1Size    16 for PCM.  This is the size of the
                               rest of the Subchunk which follows this number.
20        2   AudioFormat      PCM = 1 (i.e. Linear quantization)
                               Values other than 1 indicate some 
                               form of compression.
22        2   NumChannels      Mono = 1, Stereo = 2, etc.
24        4   SampleRate       8000, 44100, etc.
28        4   ByteRate         == SampleRate * NumChannels * BitsPerSample/8
32        2   BlockAlign       == NumChannels * BitsPerSample/8
                               The number of bytes for one sample including
                               all channels. I wonder what happens when
                               this number isn't an integer?
34        2   BitsPerSample    8 bits = 8, 16 bits = 16, etc.
          2   ExtraParamSize   if PCM, then doesn't exist
          X   ExtraParams      space for extra parameters

The "data" subchunk contains the size of the data and the actual sound:

36        4   Subchunk2ID      Contains the letters "data"
                               (0x64617461 big-endian form).
40        4   Subchunk2Size    == NumSamples * NumChannels * BitsPerSample/8
                               This is the number of bytes in the data.
                               You can also think of this as the size
                               of the read of the subchunk following this 
                               number.
44        *   Data             The actual sound data.

Wave File Header - RIFF Type Chunk
Wave file headers follow the standard RIFF file format structure. The first 8 bytes in the file is a standard RIFF chunk header which has a chunk ID of "RIFF" and a chunk size equal to the file size minus the 8 bytes used by the header. The first 4 data bytes in the "RIFF" chunk determines the type of resource found in the RIFF chunk. Wave files always use "WAVE". After the RIFF type comes all of the Wave file chunks that define the audio waveform.
Byte NumberSizeDescriptionValue
0-34Chunk ID"RIFF" (0x52494646)
4-74Chunk Data Size(file size) - 8
8-114RIFF Type"WAVE" (0x57415645)

  
Format Chunk - "fmt " 
The format chunk contains information about how the waveform data is stored and 
should be played back including the type of compression used, number of channels, 
sample rate, bits per sample and other attributes.

Byte NumberSizeDescriptionValue
0-34Chunk ID"fmt " (0x666D7420)
4-74Chunk Data SizeLength Of format Chunk (always 0x10)
8-92Compression codeAlways 0x01
10 - 112Channel Numbers0x01=Mono, 0x02=Stereo
12 - 154Sample RateBinary, in Hz
16 - 194Bytes Per Second
20 - 212Bytes Per Sample1=8 bit Mono, 2=8 bit Stereo or 16 bit Mono, 4=16 bit Stereo
22 - 232Bits Per Sample
 Data Chunk - "data"
 The Wave Data Chunk contains the digital audio sample data which can be decoded using the format and compression method specified in the Wave Format Chunk.
Byte NumberSizeDescriptionValue
0-34Chunk ID"data" (0x64617461)
4-74Chunk Data Sizelength of data to follow
8-endDatasound samples


 

As an example, here are the opening 72 bytes of a WAVE file with bytes shown as hexadecimal numbers:
52 49 46 46 24 08 00 00 57 41 56 45 66 6d 74 20 10 00 00 00 01 00 02 00 22 56 00 00 88 58 01 00 04 00 10 00 64 61 74 61 00 08 00 00 00 00 00 00 24 17 1e f3 3c 13 3c 14 16 f9 18 f9 34 e7 23 a6 3c f2 24 f2 11 ce 1a 0d
Here is the interpretation of these bytes as a WAVE soundfile: 



Example2:
The easiest approach to this file format might be to look at an actual WAV file to see how data is stored. In this case, we examine DING.WAV which is standard with all Windows packages. DING.WAV is an 8-bit, mono, 22.050 KHz WAV file of 11,598 bytes in length. Lets begin by looking at the header of the file (using DEBUG).

As expected, the file begins with the ASCII characters "RIFF" identifying it as a WAV file. The next four bytes tell us the length is 0x2D46 bytes (11590 bytes in decimal) which is the length of the entire file minus the 8 bytes for the "RIFF" and length (11598 - 11590 = 8 bytes).
The ASCII characters for "WAVE" and "fmt " follow. Next (line 2 above) we find the value 0x00000010 in the first 4 bytes (length of format chunk: always constant at 0x10). The next four bytes are 0x0001 (Always) and 0x0001 (A mono WAV, one channel used).
Since this is a 8-bit WAV, the sample rate and the bytes/second are the same at 0x00005622 or 22,050 in decimal. For a 16-bit stereo WAV the bytes/sec would be 4 times the sample rate. The next 2 bytes show the number of bytes per sample to be 0x0001 (8-bit mono) and the number of bits per sample to be 0x0008.



Friday, 19 June 2015

what is ADTS?

Audio Data Transport Stream (ADTS) is a format, used by MPEG TS or Shoutcast to stream audio, usually AAC.

Structure

AAAAAAAA AAAABCCD EEFFFFGH HHIJKLMM MMMMMMMM MMMOOOOO OOOOOOPP (QQQQQQQQ QQQQQQQQ)
Header consists of 7 or 9 bytes (without or with CRC).
Letter Length (bits) Description
A 12 syncword 0xFFF, all bits must be 1
B 1 MPEG Version: 0 for MPEG-4, 1 for MPEG-2
C 2 Layer: always 0
D 1 protection absent, Warning, set to 1 if there is no CRC and 0 if there is CRC
E 2 profile, the MPEG-4 Audio Object Type minus 1
F 4 MPEG-4 Sampling Frequency Index (15 is forbidden)
G 1 private bit, guaranteed never to be used by MPEG, set to 0 when encoding, ignore when decoding
H 3 MPEG-4 Channel Configuration (in the case of 0, the channel configuration is sent via an inband PCE)
I 1 originality, set to 0 when encoding, ignore when decoding
J 1 home, set to 0 when encoding, ignore when decoding
K 1 copyrighted id bit, the next bit of a centrally registered copyright identifier, set to 0 when encoding, ignore when decoding
L 1 copyright id start, signals that this frame's copyright id bit is the first bit of the copyright id, set to 0 when encoding, ignore when decoding
M 13 frame length, this value must include 7 or 9 bytes of header length: FrameLength = (ProtectionAbsent == 1 ? 7 : 9) + size(AACFrame)
O 11 Buffer fullness
P 2 Number of AAC frames (RDBs) in ADTS frame minus 1, for maximum compatibility always use 1 AAC frame per ADTS frame
Q 16 CRC if protection absent is 0

Usage in MPEG-TS

ADTS packet must be a content of PES packet. Pack AAC data inside ADTS frame, than pack inside PES packet, then mux by TS packetizer.

Usage in Shoutcast

ADTS frames goes one by one in TCP stream. Look for syncword, parse header and look for next syncword after.

aac header formats

In AAC raw format means it contains only the data. no header portions. Sampling Rate, channels and object type have to be specified by us via application.

ADIF: ADIF_HEADER FRAME1 FRAME2 FRAME3....

In ADIF the header is given only at the beginning. The rest is the frame data. If the header portion is lost we cannot decode it.
ADIF header has a 32 bit ADIF code, 0x41444946 at the start of the header, which helps the decoder to know that it is a ADIF encoded stream. All the data required to decode the stream such as sampling rate, channels, profile etc are all given inside the header

ADTS : ADTS_HEADER FRAME1 ADTS_HEADER FRAME2 ADTS_HEADER FRAME3...

In ADTS, each frame data is preceded by a header. So even if the header portion of any frame is lost, we can still decode the stream. It is very helpful in streaming applications. ADTS header begins with a 12 bit header sync 0xFFF, which helps the decoder to know that it is an ADTS encoded stream.
ADTS header has a fixed and variable header. Fixed header consists of general stream information like sampling rate, channels, profile etc. which remains the same in every frame. Variable header has frame related information like encoded frame size, which varies with frames.

Wednesday, 3 June 2015

MP3 file format..


This is a brief and informal document targeted to those who want to deal with the MPEG format. If you are one of them, you probably already know what is MPEG audio. If not, jump to http://www.mp3.com/ or http://www.layer3.org/ where you will find more details and also more links. This document does not cover compression and decompression algorithm.
NOTE: You cannot just search the Internet and find the MPEG audio specs. It is copyrighted and you will have to pay quite a bit to get the Paper. That's why I made this. Information I got is gathered from the Internet, and mostly originate from program sources I found available for free. Despite my intention to always specify the information sources, I am not able to do it this time. Sorry, I did not maintain the list. :-(
These are not a decoding specs, it just informs you how to read the MPEG headers and the MPEG TAG. MPEG Version 1, 2 and 2.5 and Layer I, II and III are supported, the MP3 TAG (ID3v1 and ID3v1.1) also.. Those of you who use Delphi may find MPGTools Delphi unit (freeware source) useful, it is where I implemented this stuff.
I do not claim information presented in this document is accurate. At first I just gathered it from different sources. It was not an easy task but I needed it. Later, I received lots of comments as feedback when I published this document. I think this last release is highly accurate due to comments and corrections I received.
This document is last updated on December 22, 1999.
MPEG Audio Compression Basics
This is one of many methods to compress audio in digital form trying to consume as little space as possible but keep audio quality as good as possible. MPEG compression showed up as one of the best achievements in this area.
This is a lossy compression, which means, you will certainly loose some audio information when you use this compression methods. But, this lost can hardly be noticed because the compression method tries to control it. By using several quite complicate and demanding mathematical algorithms it will only loose those parts of sound that are hard to be heard even in the original form. This leaves more space for information that is important. This way you can compress audio up to 12 times (you may choose compression ratio) which is really significant. Due to its quality MPEG audio became very popular.
MPEG standards MPEG-1, MPEG-2 and MPEG-4 are known but this document covers first two of them. There is an unofficial MPEG-2.5 which is rarely used. It is also covered.
MPEG-1 audio (described in ISO/IEC 11172-3) describes three Layers of audio coding with the following properties:

  • one or two audio channels
  • highest possible bitrate goes up to about 1Mbps for 5.1
    MPEG Audio Frame Header
    An MPEG audio file is built up from smaller parts called frames. Generally, frames are independent items. Each frame has its own header and audio informations. There is no file header. Therefore, you can cut any part of MPEG file and play it correctly (this should be done on frame boundaries but most applications will handle incorrect headers). For Layer III, this is not 100% correct. Due to internal data organization in MPEG version 1 Layer III files, frames are often dependent of each other and they cannot be cut off just like that.
    When you want to read info about an MPEG file, it is usually enough to find the first frame, read its header and assume that the other frames are the same This may not be always the case. Variable bitrate MPEG files may use so called bitrate switching, which means that bitrate changes according to the content of each frame. This way lower bitrates may be used in frames where it will not reduce sound quality. This allows making better compression while keeping high quality of sound.
    The frame header is constituted by the very first four bytes (32bits) in a frame. The first eleven bits (or first twelve bits, see below about frame sync) of a frame header are always set and they are called "frame sync". Therefore, you can search through the file for the first occurence of frame sync (meaning that you have to find a byte with a value of 255, and followed by a byte with its three (or four) most significant bits set). Then you read the whole header and check if the values are correct. You will see in the following table the exact meaning of each bit in the header, and which values may be checked for validity. Each value that is specified as reserved, invalid, bad, or not allowed should indicate an invalid header. Remember, this is not enough, frame sync can be easily (and very frequently) found in any binary file. Also it is likely that MPEG file contains garbage on it's beginning which also may contain false sync. Thus, you have to check two or more frames in a row to assure you are really dealing with MPEG audio file.
    Frames may have a CRC check. The CRC is 16 bits long and, if it exists, it follows the frame header. After the CRC comes the audio data. You may calculate the length of the frame and use it if you need to read other headers too or just want to calculate the CRC of the frame, to compare it with the one you read from the file. This is actually a very good method to check the MPEG header validity.
    Here is "graphical" presentation of the header content. Characters from A to M are used to indicate different fields. In the table, you can see details about the content of each field.
    AAAAAAAA AAABBCCD EEEEFFGH IIJJKLMM
    SignLength
    (bits)
    Position
    (bits)
    Description
    A11(31-21)Frame sync (all bits set)
    B2(20,19)MPEG Audio version ID
    00 - MPEG Version 2.5
    01 - reserved
    10 - MPEG Version 2 (ISO/IEC 13818-3)
    11 - MPEG Version 1 (ISO/IEC 11172-3) Note: MPEG Version 2.5 is not official standard. Bit No 20 in frame header is used to indicate version 2.5. Applications that do not support this MPEG version expect this bit always to be set, meaning that frame sync (A) is twelve bits long, not eleve as stated here. Accordingly, B is one bit long (represents only bit No 19). I recommend using methodology presented here, since this allows you to distinguish all three versions and keep full compatibility.
    C2(18,17) Layer description
    00 - reserved
    01 - Layer III
    10 - Layer II
    11 - Layer I
    D1(16) Protection bit
    0 - Protected by CRC (16bit crc follows header)
    1 - Not protected
    E4(15,12)Bitrate index
    bitsV1,L1V1,L2V1,L3V2,L1V2, L2 & L3
    0000freefreefreefreefree
    0001323232328
    00106448404816
    00119656485624
    010012864566432
    010116080648040
    011019296809648
    01112241129611256
    100025612811212864
    100128816012814480
    101032019216016096
    1011352224192176112
    1100384256224192128
    1101416320256224144
    1110448384320256160
    1111badbadbadbadbad
    NOTES: All values are in kbps
    V1 - MPEG Version 1
    V2 - MPEG Version 2 and Version 2.5
    L1 - Layer I
    L2 - Layer II
    L3 - Layer III
    "free" means free format. If the correct fixed bitrate (such files cannot use variable bitrate) is different than those presented in upper table it must be determined by the application. This may be implemented only for internal purposes since third party applications have no means to find out correct bitrate. Howewer, this is not impossible to do but demands lot's of efforts.
    "bad" means that this is not an allowed value
    MPEG files may have variable bitrate (VBR). This means that bitrate in the file may change. I have learned about two used methods:

  • bitrate switching. Each frame may be created with different bitrate. It may be used in all layers. Layer III decoders must support this method. Layer I & II decoders may support it.
  • bit reservoir. Bitrate may be borrowed (within limits) from previous frames in order to provide more bits to demanding parts of the input signal. This causes, however, that the frames are no longer independent, which means you should not cut this files. This is supported only in Layer III. More about VBR you may find on Xing Tech site
    For Layer II there are some combinations of bitrate and mode which are not allowed. Here is a list of allowed combinations.
    bitrate allowed modes
    free all
    32 single channel
    48 single channel
    56 single channel
    64 all
    80 single channel
    96 all
    112 all
    128 all
    160 all
    192 all
    224 stereo, intensity stereo, dual channel
    256 stereo, intensity stereo, dual channel
    320 stereo, intensity stereo, dual channel
    384 stereo, intensity stereo, dual channel
  • F2(11,10) Sampling rate frequency index (values are in Hz)
    bitsMPEG1MPEG2MPEG2.5
    00441002205011025
    01480002400012000
    1032000160008000
    11reserv.reserv.reserv.
    G1(9) Padding bit
    0 - frame is not padded
    1 - frame is padded with one extra slot
    Padding is used to fit the bit rates exactly. For an example: 128k 44.1kHz layer II uses a lot of 418 bytes and some of 417 bytes long frames to get the exact 128k bitrate. For Layer I slot is 32 bits long, for Layer II and Layer III slot is 8 bits long. How to calculate frame length
    First, let's distinguish two terms frame size and frame length. Frame size is the number of samples contained in a frame. It is constant and always 384 samples for Layer I and 1152 samples for Layer II and Layer III. Frame length is length of a frame when compressed. It is calculated in slots. One slot is 4 bytes long for Layer I, and one byte long for Layer II and Layer III. When you are reading MPEG file you must calculate this to be able to find each consecutive frame. Remember, frame length may change from frame to frame due to padding or bitrate switching.
    Read the BitRate, SampleRate and Padding of the frame header.
    For Layer I files us this formula:
    FrameLengthInBytes = (12 * BitRate / SampleRate + Padding) * 4
    For Layer II & III files use this formula:
    FrameLengthInBytes = 144 * BitRate / SampleRate + Padding
    Example:
    Layer III, BitRate=128000, SampleRate=441000, Padding=0
          ==>  FrameSize=417 bytes
    H1(8) Private bit. It may be freely used for specific needs of an application, i.e. if it has to trigger some application specific events.
    I2(7,6) Channel Mode
    00 - Stereo
    01 - Joint stereo (Stereo)
    10 - Dual channel (Stereo)
    11 - Single channel (Mono)
    J2(5,4) Mode extension (Only if Joint stereo) Mode extension is used to join informations that are of no use for stereo effect, thus reducing needed resources. These bits are dynamically determined by an encoder in Joint stereo mode.
    Complete frequency range of MPEG file is divided in subbands There are 32 subbands. For Layer I & II these two bits determine frequency range (bands) where intensity stereo is applied. For Layer III these two bits determine which type of joint stereo is used (intensity stereo or m/s stereo). Frequency range is determined within decompression algorythm.
    Layer I and IILayer III
    valueLayer I & II
    00bands 4 to 31
    01bands 8 to 31
    10bands 12 to 31
    11bands 16 to 31
    Intensity stereoMS stereo
    offoff
    onoff
    offon
    onon
    K1(3) Copyright
    0 - Audio is not copyrighted
    1 - Audio is copyrighted
    L1(2) Original
    0 - Copy of original media
    1 - Original media
    M2(1,0) Emphasis
    00 - none
    01 - 50/15 ms
    10 - reserved
    11 - CCIT J.17
    MPEG Audio Tag ID3v1
    The TAG is used to describe the MPEG Audio file. It contains information about artist, title, album, publishing year and genre. There is some extra space for comments. It is exactly 128 bytes long and is located at very end of the audio data. You can get it by reading the last 128 bytes of the MPEG audio file.
    AAABBBBB BBBBBBBB BBBBBBBB BBBBBBBB
    BCCCCCCC CCCCCCCC CCCCCCCC CCCCCCCD
    DDDDDDDD DDDDDDDD DDDDDDDD DDDDDEEE
    EFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFG
    SignLength
    (bytes)
    Position
    (bytes)
    Description
    A3(0-2) Tag identification. Must contain 'TAG' if tag exists and is correct.
    B30(3-32)Title
    C30(33-62)Artist
    D30(63-92)Album
    E4(93-96)Year
    F30(97-126)Comment
    G1(127)Genre
    The specification asks for all fields to be padded with null character (ASCII 0). However, not all applications respect this (an example is WinAmp which pads fields with <space>, ASCII 32).
    There is a small change proposed in ID3v1.1 structure. The last byte of the Comment field may be used to specify the track number of a song in an album. It should contain a null character (ASCII 0) if the information is unknown.

  • sample rate 32kHz, 44.1kHz or 48kHz.
  • bit rates from 32kbps up to 448kbps
    Each layer has its merits. MPEG-2 audio (described in ISO/IEC 13818-3) has two extensions to MPEG-1, usually referred as MPEG-2/LSF and MPEG-2/Multichannel.
    MPEG-2/LSF has the following properties:
  • one or two audio channels
  • sample rates half those of MPEG-1
  • bit rates from 8 kbps up to 256kbps. MPEG-2/Multichannel has the following properties:
  • up to 5 full range audio channels and an LFE-channel (Low Frequency Enhancement <> subwoofer!)
  • sample rates the same as those of MPEG-1
  • Saturday, 30 May 2015

    RTP audio format



    Network Working Group C. Perkins
    Request for Comments: 2198 I. Kouvelas
    Category: Standards Track O. Hodson
    V. Hardman
    University College London
    M. Handley
    ISI
    J.C. Bolot
    A. Vega-Garcia
    S. Fosse-Parisis
    INRIA Sophia Antipolis
    September 1997
    RTP Payload for Redundant Audio Data
    Status of this Memo
    This document specifies an Internet standards track protocol for the
    Internet community, and requests discussion and suggestions for
    improvements. Please refer to the current edition of the "Internet
    Official Protocol Standards" (STD 1) for the standardization state
    and status of this protocol. Distribution of this memo is unlimited.
    Abstract
    This document describes a payload format for use with the real-time
    transport protocol (RTP), version 2, for encoding redundant audio
    data. The primary motivation for the scheme described herein is the
    development of audio conferencing tools for use with lossy packet
    networks such as the Internet Mbone, although this scheme is not
    limited to such applications.
    1
    Introduction
    If multimedia conferencing is to become widely used by the Internet
    Mbone community, users must perceive the quality to be sufficiently
    good for most applications. We have identified a number of problems
    which impair the quality of conferences, the most significant of
    which is packet loss. This is a persistent problem, particularly
    given the increasing popularity, and therefore increasing load, of
    the Internet. The disruption of speech intelligibility even at low
    loss rates which is currently experienced may convince a whole
    generation of users that multimedia conferencing over the Internet is
    not viable. The addition of redundancy to the data stream is offered
    as a solution [
    1
    ]. If a packet is lost then the missing information
    may be reconstructed at the receiver from the redundant data that
    arrives in the following packet(s), provided that the average number
    Perkins, et. al. Standards Track [Page 1]


    RFC 2198
    RTP Payload for Redundant Audio Data September 1997
    of consecutively lost packets is small. Recent work [
    4
    ,
    5
    ] shows that
    packet loss patterns in the Internet are such that this scheme
    typically functions well.
    This document describes an RTP payload format for the transmission of
    audio data encoded in such a redundant fashion.
    Section 2
    presents
    the requirements and motivation leading to the definition of this
    payload format, and does not form part of the payload format
    definition. Sections
    3
    onwards define the RTP payload format for
    redundant audio data.
    2
    Requirements/Motivation
    The requirements for a redundant encoding scheme under RTP are as
    follows:
    o Packets have to carry a primary encoding and one or more
    redundant encodings.
    o As a multitude of encodings may be used for redundant
    information, each block of redundant encoding has to have an
    encoding type identifier.
    o As the use of variable size encodings is desirable, each encoded
    block in the packet has to have a length indicator.
    o The RTP header provides a timestamp field that corresponds to
    the time of creation of the encoded data. When redundant
    encodings are used this timestamp field can refer to the time of
    creation of the primary encoding data. Redundant blocks of data
    will correspond to different time intervals than the primary
    data, and hence each block of redundant encoding will require its
    own timestamp. To reduce the number of bytes needed to carry the
    timestamp, it can be encoded as the difference of the timestamp
    for the redundant encoding and the timestamp of the primary.
    There are two essential means by which redundant audio may be added
    to the standard RTP specification: a header extension may hold the
    redundancy, or one, or more, additional payload types may be defined.
    Including all the redundancy information for a packet in a header
    extension would make it easy for applications that do not implement
    redundancy to discard it and just process the primary encoding data.
    There are, however, a number of disadvantages with this scheme:
    Perkins, et. al. Standards Track [Page 2]


    RFC 2198
    RTP Payload for Redundant Audio Data September 1997
    o There is a large overhead from the number of bytes needed for
    the extension header (4) and the possible padding that is needed
    at the end of the extension to round up to a four byte boundary
    (up to 3 bytes). For many applications this overhead is
    unacceptable.
    o Use of the header extension limits applications to a single
    redundant encoding, unless further structure is introduced into
    the extension. This would result in further overhead.
    For these reasons, the use of RTP header extension to hold redundant
    audio encodings is disregarded.
    The RTP profile for audio and video conferences [
    3
    ] lists a set of
    payload types and provides for a dynamic range of 32 encodings that
    may be defined through a conference control protocol. This leads to
    two possible schemes for assigning additional RTP payload types for
    redundant audio applications:
    1.A dynamic encoding scheme may be defined, for each combination
    of primary/redundant payload types, using the RTP dynamic payload
    type range.
    2.A single fixed payload type may be defined to represent a packet
    with redundancy. This may then be assigned to either a static
    RTP payload type, or the payload type for this may be assigned
    dynamically.
    It is possible to define a set of payload types that signify a
    particular combination of primary and secondary encodings for each of
    the 32 dynamic payload types provided. This would be a slightly
    restrictive yet feasible solution for packets with a single block of
    redundancy as the number of possible combinations is not too large.
    However the need for multiple blocks of redundancy greatly increases
    the number of encoding combinations and makes this solution not
    viable.
    A modified version of the above solution could be to decide prior to
    the beginning of a conference on a set a 32 encoding combinations
    that will be used for the duration of the conference. All tools in
    the conference can be initialized with this working set of encoding
    combinations. Communication of the working set could be made through
    the use of an external, out of band, mechanism. Setup is complicated
    as great care needs to be taken in starting tools with identical
    parameters. This scheme is more efficient as only one byte is used
    to identify combinations of encodings.
    Perkins, et. al. Standards Track [Page 3]


    RFC 2198
    RTP Payload for Redundant Audio Data September 1997
    It is felt that the complication inherent in distributing the mapping
    of payload types onto combinations of redundant data preclude the use
    of this mechanism.
    A more flexible solution is to have a single payload type which
    signifies a packet with redundancy. That packet then becomes a
    container, encapsulating multiple payloads into a single RTP packet.
    Such a scheme is flexible, since any amount of redundancy may be
    encapsulated within a single packet. There is, however, a small
    overhead since each encapsulated payload must be preceded by a header
    indicating the type of data enclosed. This is the preferred
    solution, since it is both flexible, extensible, and has a relatively
    low overhead. The remainder of this document describes this
    solution.
    3
    Payload Format Specification
    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
    document are to be interpreted as described in
    RFC2119
    [
    7
    ].
    The assignment of an RTP payload type for this new packet format is
    outside the scope of this document, and will not be specified here.
    It is expected that the RTP profile for a particular class of
    applications will assign a payload type for this encoding, or if that
    is not done then a payload type in the dynamic range shall be chosen.
    An RTP packet containing redundant data shall have a standard RTP
    header, with payload type indicating redundancy. The other fields of
    the RTP header relate to the primary data block of the redundant
    data.
    Following the RTP header are a number of additional headers, defined
    in the figure below, which specify the contents of each of the
    encodings carried by the packet. Following these additional headers
    are a number of data blocks, which contain the standard RTP payload
    data for these encodings. It is noted that all the headers are
    aligned to a 32 bit boundary, but that the payload data will
    typically not be aligned. If multiple redundant encodings are
    carried in a packet, they should correspond to different time
    intervals: there is no reason to include multiple copies of data for
    a single time interval within a packet.
    0 1 2 3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |F| block PT | timestamp offset | block length |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    Perkins, et. al. Standards Track [Page 4]


    RFC 2198
    RTP Payload for Redundant Audio Data September 1997
    The bits in the header are specified as follows:
    F: 1 bit First bit in header indicates whether another header block
    follows. If 1 further header blocks follow, if 0 this is the
    last header block.
    block PT: 7 bits RTP payload type for this block.
    timestamp offset: 14 bits Unsigned offset of timestamp of this block
    relative to timestamp given in RTP header. The use of an unsigned
    offset implies that redundant data must be sent after the primary
    data, and is hence a time to be subtracted from the current
    timestamp to determine the timestamp of the data for which this
    block is the redundancy.
    block length: 10 bits Length in bytes of the corresponding data
    block excluding header.
    It is noted that the use of an unsigned timestamp offset limits the
    use of redundant data slightly: it is not possible to send
    redundancy before the primary encoding. This may affect schemes
    where a low bandwidth coding suitable for redundancy is produced
    early in the encoding process, and hence could feasibly be
    transmitted early. However, the addition of a sign bit would
    unacceptably reduce the range of the timestamp offset, and increasing
    the size of the field above 14 bits limits the block length field.
    It seems that limiting redundancy to be transmitted after the primary
    will cause fewer problems than limiting the size of the other fields.
    The timestamp offset for a redundant block is measured in the same
    units as the timestamp of the primary encoding (ie: audio samples,
    with the same clock rate as the primary). The implication of this is
    that the redundant encoding MUST be sampled at the same rate as the
    primary.
    It is further noted that the block length and timestamp offset are 10
    bits, and 14 bits respectively; rather than the more obvious 8 and 16
    bits. Whilst such an encoding complicates parsing the header
    information slightly, and adds some additional processing overhead,
    there are a number of problems involved with the more obvious choice:
    An 8 bit block length field is sufficient for most, but not all,
    possible encodings: for example 80ms PCM and DVI audio packets
    comprise more than 256 bytes, and cannot be encoded with a single
    byte length field. It is possible to impose additional structure on
    the block length field (for example the high bit set could imply the
    lower 7 bits code a length in words, rather than bytes), however such
    schemes are complex. The use of a 10 bit block length field retains
    Perkins, et. al. Standards Track [Page 5]


    RFC 2198
    RTP Payload for Redundant Audio Data September 1997
    applications which require this information assume that the CSRC data
    in the RTP header may be applied to the reconstructed redundant data.
    5
    Relation to SDP
    When a redundant payload is used, it may need to be bound to an RTP
    dynamic payload type. This may be achieved through any out-of-band
    mechanism, but one common way is to communicate this binding using
    the Session Description Protocol (SDP) [
    6
    ]. SDP has a mechanism for
    binding a dynamic payload types to particular codec, sample rate, and
    number of channels using the "rtpmap" attribute. An example of its
    use (using the RTP audio/video profile [
    3
    ]) is:
    m=audio 12345 RTP/AVP 121 0 5
    a=rtpmap:121 red/8000/1
    This specifies that an audio stream using RTP is using payload types
    121 (a dynamic payload type), 0 (PCM u-law) and 5 (DVI). The "rtpmap"
    attribute is used to bind payload type 121 to codec "red" indicating
    this codec is actually a redundancy frame, 8KHz, and monaural. When
    used with SDP, the term "red" is used to indicate the redundancy
    format discussed in this document.
    In this case the additional formats of PCM and DVI are specified.
    The receiver must therefore be prepared to use these formats. Such a
    specification means the sender will send redundancy by default, but
    also may send PCM or DVI. However, with a redundant payload we
    additionally take this to mean that no codec other than PCM or DVI
    will be used in the redundant encodings. Note that the additional
    payload formats defined in the "m=" field may themselves be dynamic
    payload types, and if so a number of additional "a=" attributes may
    be required to describe these dynamic payload types.
    To receive a redundant stream, this is all that is required. However
    to send a redundant stream, the sender needs to know which codecs are
    recommended for the primary and secondary (and tertiary, etc)
    encodings. This information is specific to the redundancy format,
    and is specified using an additional attribute "fmtp" which conveys
    format-specific information. A session directory does not parse the
    values specified in an fmtp attribute but merely hands it to the
    media tool unchanged. For redundancy, we define the format
    parameters to be a slash "/" separated list of RTP payload types.
    Thus a complete example is:
    m=audio 12345 RTP/AVP 121 0 5
    a=rtpmap:121 red/8000/1
    a=fmtp:121 0/5
    Perkins, et. al. Standards Track [Page 7]


    RFC 2198
    RTP Payload for Redundant Audio Data September 1997
    This specifies that the default format for senders is redundancy with
    PCM as the primary encoding and DVI as the secondary encoding.
    Encodings cannot be specified in the fmtp attribute unless they are
    also specified as valid encodings on the media ("m=") line.
    6
    Security Considerations
    RTP packets containing redundant information are subject to the
    security considerations discussed in the RTP specification [
    2
    ], and
    any appropriate RTP profile (for example [
    3
    ]). This implies that
    confidentiality of the media streams is achieved by encryption.
    Encryption of a redundant data stream may occur in two ways:
    1.The entire stream is to be secured, and all participants are
    expected to have keys to decode the entire stream. In this case,
    nothing special need be done, and encryption is performed in the
    usual manner.
    2.A portion of the stream is to be encrypted with a different
    key to the remainder. In this case a redundant copy of the last
    packet of that portion cannot be sent, since there is no
    following packet which is encrypted with the correct key in which
    to send it. Similar limitations may occur when
    enabling/disabling encryption.
    The choice between these two is a matter for the encoder only.
    Decoders can decrypt either form without modification.
    Whilst the addition of low-bandwidth redundancy to an audio stream is
    an effective means by which that stream may be protected against
    packet loss, application designers should be aware that the addition
    of large amounts of redundancy will increase network congestion, and
    hence packet loss, leading to a worsening of the problem which the
    use of redundancy was intended to solve. At its worst, this can lead
    to excessive network congestion and may constitute a denial of
    service attack.
    Perkins, et. al. Standards Track [Page 8]


    RFC 2198
    RTP Payload for Redundant Audio Data September 1997
    7
    Example Packet
    An RTP audio data packet containing a DVI4 (8KHz) primary, and a
    single block of redundancy encoded using 8KHz LPC (both 20ms
    packets), as defined in the RTP audio/video profile [
    3
    ] is
    illustrated:
    0 1 2 3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |V=2|P|X| CC=0 |M| PT | sequence number of primary |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    | timestamp of primary encoding |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    | synchronization source (SSRC) identifier |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |1| block PT=7 | timestamp offset | block length |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |0| block PT=5 | |
    +-+-+-+-+-+-+-+-+ +
    | |
    + LPC encoded redundant data (PT=7) +
    | (14 bytes) |
    + +---------------+
    | | |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
    | |
    + +
    | |
    + +
    | |
    + +
    | DVI4 encoded primary data (PT=5) |
    + (84 bytes, not to scale) +
    / /
    + +
    | |
    + +
    | |
    + +---------------+
    | |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    Perkins, et. al. Standards Track [Page 9]


    RFC 2198
    RTP Payload for Redundant Audio Data September 1997
    8
    Authors’ Addresses
    Colin Perkins/Isidor Kouvelas/Orion Hodson/Vicky Hardman
    Department of Computer Science
    University College London
    London WC1E 6BT
    United Kingdom
    EMail: {c.perkins|i.kouvelas|o.hodson|v.hardman}@cs.ucl.ac.uk
    Mark Handley
    USC Information Sciences Institute
    c/o MIT Laboratory for Computer Science
    545 Technology Square
    Cambridge, MA 02139, USA
    EMail: mjh@isi.edu
    Jean-Chrysostome Bolot/Andres Vega-Garcia/Sacha Fosse-Parisis
    INRIA Sophia Antipolis
    2004 Route des Lucioles, BP 93
    06902 Sophia Antipolis
    France
    EMail: {bolot|avega|sfosse}@sophia.inria.fr
    Perkins, et. al. Standards Track [Page 10]


    RFC 2198
    RTP Payload for Redundant Audio Data September 1997
    9
    References
    [1] V.J. Hardman, M.A. Sasse, M. Handley and A. Watson; Reliable
    Audio for Use over the Internet; Proceedings INET’95, Honalulu, Oahu,
    Hawaii, September 1995.
    http://www.isoc.org/in95prc/
    [2] Schulzrinne, H., Casner, S., Frederick R., and V. Jacobson, "RTP:
    A Transport Protocol for Real-Time Applications",
    RFC 1889
    , January
    1996.
    [3] Schulzrinne, H., "RTP Profile for Audio and Video Conferences
    with Minimal Control",
    RFC 1890
    , January 1996.
    [4] M. Yajnik, J. Kurose and D. Towsley; Packet loss correlation in
    the MBone multicast network; IEEE Globecom Internet workshop, London,
    November 1996
    [5] J.-C. Bolot and A. Vega-Garcia; The case for FEC-based error
    control for packet audio in the Internet; ACM Multimedia Systems,
    1997
    [6] Handley, M., and V. Jacobson, "SDP: Session Description Protocol
    (draft 03.2)", Work in Progress.
    [7] Bradner, S., "Key words for use in RFCs to indicate requirement
    levels",
    RFC 2119
    , March 1997.
    Perkins, et. al. Standards Track [Page 11]