Mpeg-ts outline specification:
- 1. Introduction
- 1.1. Packetized elementary streams (PES)
- 2. Time Stamp
- 1.3. Mpeg transport stream (MPEG-TS)
- 2. Multiplexing
- 3. De-Multiplexing
- 3.1. Audio-Video synchronization
- 4. Results
- 4.1. Buffer fullness
- 4.2. Synchronization and skew calculation
- 5. Conclusions
1. Introduction
H.264[5,48,51] is the latest and the most advanced video codec available today. It was jointly developed by the VCEG (video coding experts group) of ITU-T (international telecommunication union) and the MPEG (moving pictures experts group) of ISO/IEC (international standards organization). This standard achieves much greater compression than its predecessors like MPEG-2 video[37], MPEG4 part2 visual[38] etc. But the higher codingefficiency comes at the cost of increased complexity. The H.264 has been adopted as the video standard for many applications around the world including ATSC[21]. H.264 covers only video coding and is not of much use unless the video is accompanied by audio. Hence it is relevant and practical to encode/decode and multiplex/demultiplex both video and audio for replay at the receiver.
HEAACv2[49,50] or
High efficiency advanced audio codec version 2 also known as enhanced
aacplus is a low bit rate audio codec defined in MPEG4 audio profile[2]
belonging to the AAC family. It is specifically designed for low bit
rate applications such as streaming, mobile broadcasting etc. HE AAC v2
has been proven to be the most efficient audio compression tool
available today. It comes with a fully featured toolset which enables
coding in mono, stereo and multichannel modes (up to 48 channels).
HEAACv2[7] is the adopted standard for ATSC-M/H and many other systems
around the world.
The encoded bit streams or elementary streams
of H.264 and HEAACv2 are arranged as a sequence of access units. An
access unit is a coded representation of a frame. Since each frame is
coded differently the size of each access unit also varies. In order to
transmit a multimedia content (audio and video) across a channel, the
two streams have to be converted in to a single stream of fixed sized
packets. For this the elementary streams have to undergo two layers of
packetization (Fig. 1). The first layer of packetization yields
Packetized Elementary Stream (PES) and the second layer of packetization
where the actual multiplexing takes place results in a stream of fixed
sized packets called as Transport Stream (TS). These TS packets are what
are actually transmitted across the network using broadcast techniques
such as those used in ATSC and DVB[16].
1.1. Packetized elementary streams (PES)
PES packets are obtained after the first layer of packetization of coded audio and coded video data. This packetization process is carried out by sequentially separating out the audio and video elementary streams into access units. Hence each PES packet is an encapsulation of one frame of coded data. Each PES packet contains a packet header and the payload data from only one particular stream. PES header contains information which can distinguish between audio and video PES packets. Since the number of bits used to represent a frame in the bit stream varies (for both audio and video) the size of the PES packets also varies. Figure 2 shows how the elementary stream is converted into PES stream.
The PES header format used is shown in table 1. The PES header starts
with a 3 byte packet start code prefix which is always “0x000001”
followed by 1 byte stream id. Stream id is used to uniquely identify a
particular stream. Stream id along with start code prefix is known as
start code (4 bytes). PES packet length may vary and go up to 65536
bytes. In case of longer elementary stream, the packet length may be set
as unbound i.e. 0, only in the case of video stream. The next two bytes
in the header is the time stamp field, which contains the playback time
information. In the proposed method, frame number is used to calculate
the playback time, which is explained next.
2. Time Stamp
Time stamps indicate where a particular access unit belongs in time. Audio-video synchronization is obtained by incorporating time stamps into the headers in both video and audio PES packets.
Traditionally
to enable the decoder to maintain synchronization between audio track
and video frames, a 33 bit encoder clock sample called Program Clock
Reference (PCR) is transmitted in the adaptation field of the TS packet
from time to time (every 100 ms). This along with the presentation time
stamp (PTS) field that resides in the PES packet layer of the transport
stream is used to synchronize the audio and video elementary streams.
The
proposed method uses the frame numbers of both audio and video as time
stamps to synchronize the streams. As explained earlier both H.264 and
HE AAC v2 bit streams are organized into access units i.e. frames
separated by their respective sync sequence. A particular video sequence
will have a fixed frame rate during playback which is specified by
frames per second (fps). So assuming that the decoder has a prior
knowledge about the fps of the video sequence, the presentation time
(PT) or the playback time of a particular video frame can be calculated
using (1).
The AAC compression standard defines each audio frame
to contain 1024 samples per channel. This is true for HE AAC v2[2,3,7]
as well. The sampling frequency of the audio stream can be extracted
from the sampling frequency index field of the ADTS header. The sampling
frequency remains the same for a particular audio stream. Since both
samples per frame and sampling frequency are fixed, the audio frame rate
also remains constant throughout a particular audio stream. Hence the
presentation time (PT) of a particular audio frame (assuming stereo) can
be calculated as follows:
| (2) |
The same expression can be expanded for multi channel audio streams, just by multiplying the number of channels.
Table 1. PES header format[4]
Name | Size (in Bytes) | Description |
Packet start code prefix | 3 | 0x000001 |
Stream id | 1 | Unique ID to distinguish between audio and video PES packetExamples: Audio streams (0xC0-0xDF), Video streams (0xE0-0xEF)[3] |
Note: the above 4 bytes together are known as start code. |
PES Packet length | 2 | The
PES packet can be of any length. A value of zero for the PES packet
length can be used only when the PES packet payload is a video
elementary stream |
Time Stamp | 2 | frame number |
Once the presentation time of one stream is calculated, the frame number
of the second stream that has to be played at that particular time can
calculated. This approach is used at the decoder to achieve the
audio-video synchronization or lip synchronization; this is explained in
detail later on.
Using frame numbers as time stamps has many
advantages over the traditional PCR approach. Obvious advantages are
that there is no need to send the additional Transport Stream (TS)
packets with PCR information, reduced overall complexity, no need to
consider clock jitters during synchronization, smaller time stamp field
in the PES packet i.e., just 16 bits to encode frame number compared to
33 bits for the Presentation Time Stamp (PTS) which has a sample from
the encoder clock. The time stamp field in this project is encoded in 2
bytes in the PES header, which implies that time stamp field can carry
frame numbers up to 65536. Once the frame number of either stream
exceeds this number, which is a possibility in the case of long video
and audio sequences, the frame number is reset to 1. The reset is done
simultaneously on both audio and video frame numbers as soon as the
frame number of either one of the stream crosses 65536. This will not
create a frame number conflict at the de-multiplexer during
synchronization because the audio and video buffer sizes are much
smaller than the maximum allowed frame number.
1.3. Mpeg transport stream (MPEG-TS)
PES packets are of variable sizes and are difficult to multiplex and transmit in an error prone network. Hence they undergo one more layer of packetization which results in Transport Stream (TS) packets.
MPEG
Transport Streams (MPEG-TS)[4] use a fixed length packet size and a
packet identifier identifies each transport packet within the transport
stream. A packet identifier in an MPEG system identifies the type of
packetized elementary stream (PES) whether audio or video. Each TS
packet is 188 bytes long which includes header and payload data. Each
PES packet may be broken down into a number of transport stream (TS)
packets since a PES packet which represents an access unit (a frame) in
the elementary stream which is usually much larger than 188bytes. Also a
particular TS packet should contain data from only one particular PES.
The TS packet header (Fig. 3) is three bytes long; it has been slightly
modified from the standard TS header format for simplicity, although the
framework remains the same.
The sync byte (0x47) indicates the start of the new TS packet. It is followed by a payload unit start indicator (PUSI) flag, which when set indicates that the data payload contains the start of new PES packet. The Adaptation Field Control (AFC) flag when set indicates that all the allotted 185 bytes for the data payload are not occupied by the PES data. This occurs when the PES data is less than 185 bytes. When this happens the unoccupied bytes of the data payload are filled with filler data ( all zeros or all ones), and the length of the filler data is stored in a byte called the offset right after the TS header. Offset is calculated as 185 – length of PES data. The Continuity Counter (CC) is a 4 bit field which is incremented by the multiplexer for each TS packet sent for a particular stream I.e. audio PES or video PES, this information is used at the de-multiplexer side to determine if any packets are lost, repeated or is out of sequence. Packet ID (PID) is a unique 10 bit identification to describe a particular stream to which the data payload belongs in the TS packet.
2. Multiplexing
Multiplexing is a process where Transport Stream (TS) packets are generated and transmitted in such a way that the data buffers at the decoder (de-multiplexer) do not overflow or underflow. Buffer overflow or underflow by the video and audio elementary streams can cause skips or freeze/mute errors in video and audio playback.
The flow
chart of the proposed multiplexing scheme is shown in figures 4 and 5.
The basic logic is based on both audio and video sequences having
constant frame rates. For video, the number of frames per second value
will remain the same throughout the video sequence. In an audio sequence
since sampling frequency remains constant throughout the sequence and
samples per frame is fixed (1024 for stereo), the frame duration also
remains constant.
In the audio/video processing block (Fig. 5), the first step is to check whether the multiplexer is still in the middle of a frame or in the beginning of a new frame. If a new frame is being processed, (4) or (7) is executed appropriately, to find out the TS duration. This information is used to update the TS presentation time at a later stage. Next data is read from the concerned PES packet, if PES is larger than 185 bytes then only the first 185 bytes are read out and the PES packet is adjusted accordingly. If the current TS packet is the last packet for that PES packet, a new PES packet for the next frame (for that stream) is generated. Now the 185 bytes payload data and all the remaining information are ready to generate the transport stream (TS) packet.
Once
a TS packet is generated, the TS presentation time is updated using (5)
and (8). Then the control goes back to the presentation time decision
block and the entire process is repeated till all the video and audio
frames are transmitted.
3. De-Multiplexing
The Transport Stream (TS) input to a receiver is separated into a video elementary stream and audio elementary stream by a de-multiplexer. At this time, the video elementary stream and the audio elementary stream are temporarily stored in the video and audio buffers, respectively.
The
basic flow chart of the de-multiplexer is shown in the figure 6. After
receiving a TS packet, it is checked for the sync byte (0X47), to check
if the packet is valid or not. If invalid that packet is skipped and
de-multiplexing is continued with the next packet. The valid TS packet
header is read to extract fields like packet ID (PID), adaptation field
control flag (AFC), payload unit start (PUS) flag, 4 bit continuity
counter etc. Now the payload is prepared to be read into the appropriate
buffer. By checking the AFC flag it can be known that an offset value
has to be calculated or all 185 bytes in the TS packet have payload
data. If the AFC is set then the payload is extracted by skipping the
stuffing bytes.
The Payload Unit Start (PUS) bit is checked to
see if the present TS packet contains a PES header. If so then, the PES
header is first checked for the presence of the sync sequence (i.e.
0X000001). If not, the packet is discarded and the next TS packet is
processed. If valid then the PES header is read and fields like stream
ID, PES length, frame number are extracted. Now the PID is checked to
see if it is an audio TS packet or video TS packet. Once this decision
is made, the payload is written into its respective buffer. If the TS
packet payload contained the PES header, information like frame number,
its location in the corresponding buffer, PES length are stored in a
separate array variable which is later used for synchronizing the audio
and video streams.
Once the payload has been written into the
audio/video buffer, video buffer is checked for fullness. Since video
files are always much larger than audio files, the video buffer gets
filled up first. Once the video buffer is full, the next occurring IDR
frame is searched in the video buffer. Once found, the IDR frame number
is noted and is used to calculate the corresponding audio frame number
(AF) that has to be played at that time, given by (9).
| (9) |
The
above equation is used to synchronize the audio and video streams. Once
the frame numbers are obtained, the audio and video elementary streams
can be constructed by writing the audio and video buffer contents from
that point (frame) into their respective elementary streams i.e. .aac
and .264 files respectively. Then the streams are merged into a
container format by using mkv merge[31] which is a freely available
software. The resulting container format can be played back by media
players like VLC media player[32] or Gom media player[33]. In the case
of video sequence, to ensure proper playback, picture parameter sets
(PPS) and sequence parameter sets (SPS) must be inserted before the
first IDR frame, because both PPS and SPS information are used by the
decoder to find out the encoding parameters used.
The reason
that the de-multiplexing is carried out from an IDR (instantaneous
decoder refresh) frame is because the IDR frame breaks the video
sequence making sure that the later frames like P or B-frames do not use frames before the IDR frame for motion estimation. This is not true in the case of normal I-
frame. So in a long sequence, the GOPs after the IDR frame are treated
as new sequences by the H.264 decoder. In the case of the audio HE AAC
v2 decoder can playback the sequence from any audio frame.
No comments:
Post a Comment