Labels

Monday, 16 March 2015

How @@@@ VIDEO SYNCS @@@@

How Video Syncs

So this whole time, we've had an essentially useless movie player. It plays the video, yeah, and it plays the audio, yeah, but it's not quite yet what we would call a movie. So what do we do?

PTS and DTS

Fortunately, both the audio and video streams have the information about how fast and when you are supposed to play them inside of them. Audio streams have a sample rate, and the video streams have a frames per second value. However, if we simply synced the video by just counting frames and multiplying by frame rate, there is a chance that it will go out of sync with the audio. Instead, packets from the stream might have what is called a decoding time stamp (DTS) and apresentation time stamp (PTS). To understand these two values, you need to know about the way movies are stored. Some formats, like MPEG, use what they call "B" frames (B stands for "bidirectional"). The two other kinds of frames are called "I" frames and "P" frames ("I" for "intra" and "P" for "predicted"). I frames contain a full image. P frames depend upon previous I and P frames and are like diffs or deltas. B frames are the same as P frames, but depend upon information found in frames that are displayed both before and after them! This explains why we might not have a finished frame after we call avcodec_decode_video2.
So let's say we had a movie, and the frames were displayed like: I B B P. Now, we need to know the information in P before we can display either B frame. Because of this, the frames might be stored like this: I P B B. This is why we have a decoding timestamp and a presentation timestamp on each frame. The decoding timestamp tells us when we need to decode something, and the presentation time stamp tells us when we need to display something. So, in this case, our stream might look like this:
   PTS: 1 4 2 3
   DTS: 1 2 3 4
Stream: I P B B
Generally the PTS and DTS will only differ when the stream we are playing has B frames in it.
When we get a packet from av_read_frame(), it will contain the PTS and DTS values for the information inside that packet. But what we really want is the PTS of our newly decoded raw frame, so we know when to display it.
Fortunately, FFMpeg supplies us with a "best effort" timestamp, which you can get via,av_frame_get_best_effort_timestamp()

Synching

Now, while it's all well and good to know when we're supposed to show a particular video frame, but how do we actually do so? Here's the idea: after we show a frame, we figure out when the next frame should be shown. Then we simply set a new timeout to refresh the video again after that amount of time. As you might expect, we check the value of the PTS of the next frame against the system clock to see how long our timeout should be. This approach works, but there are two issues that need to be dealt with.
First is the issue of knowing when the next PTS will be. Now, you might think that we can just add the video rate to the current PTS — and you'd be mostly right. However, some kinds of video call for frames to be repeated. This means that we're supposed to repeat the current frame a certain number of times. This could cause the program to display the next frame too soon. So we need to account for that.
The second issue is that as the program stands now, the video and the audio chugging away happily, not bothering to sync at all. We wouldn't have to worry about that if everything worked perfectly. But your computer isn't perfect, and a lot of video files aren't, either. So we have three choices: sync the audio to the video, sync the video to the audio, or sync both to an external clock (like your computer). For now, we're going to sync the video to the audio.

Coding it: getting the frame PTS

Now let's get into the code to do all this. We're going to need to add some more members to our big struct, but we'll do this as we need to. First let's look at our video thread. Remember, this is where we pick up the packets that were put on the queue by our decode thread. What we need to do in this part of the code is get the PTS of the frame given to us by avcodec_decode_video2. The first way we talked about was getting the DTS of the last packet processed, which is pretty easy:
  double pts;

  for(;;) {
    if(packet_queue_get(&is->videoq, packet, 1) < 0) {
      // means we quit getting packets
      break;
    }
    pts = 0;
    // Decode video frame
    len1 = avcodec_decode_video2(is->video_st->codec,
                                pFrame, &frameFinished, packet);
    if(packet->dts != AV_NOPTS_VALUE) {
      pts = av_frame_get_best_effort_timestamp(pFrame);
    } else {
      pts = 0;
    }
    pts *= av_q2d(is->video_st->time_base);
We set the PTS to 0 if we can't figure out what it is.
Well, that was easy. A technical note: You may have noticed we're using int64 for the PTS. This is because the PTS is stored as an integer. This value is a timestamp that corresponds to a measurement of time in that stream's time_base unit. For example, if a stream has 24 frames per second, a PTS of 42 is going to indicate that the frame should go where the 42nd frame would be if there we had a frame every 1/24 of a second (certainly not necessarily true).
We can convert this value to seconds by dividing by the framerate. The time_base value of the stream is going to be 1/framerate (for fixed-fps content), so to get the PTS in seconds, we multiply by the time_base.

Coding: Synching and using the PTS

So now we've got our PTS all set. Now we've got to take care of the two synchronization problems we talked about above. We're going to define a function called synchronize_video that will update the PTS to be in sync with everything. This function will also finally deal with cases where we don't get a PTS value for our frame. At the same time we need to keep track of when the next frame is expected so we can set our refresh rate properly. We can accomplish this by using an internal video_clock value which keeps track of how much time has passed according to the video. We add this value to our big struct.
typedef struct VideoState {
  double          video_clock; // pts of last decoded frame / predicted pts of next decoded frame
Here's the synchronize_video function, which is pretty self-explanatory:
double synchronize_video(VideoState *is, AVFrame *src_frame, double pts) {

  double frame_delay;

  if(pts != 0) {
    /* if we have pts, set video clock to it */
    is->video_clock = pts;
  } else {
    /* if we aren't given a pts, set it to the clock */
    pts = is->video_clock;
  }
  /* update the video clock */
  frame_delay = av_q2d(is->video_st->codec->time_base);
  /* if we are repeating a frame, adjust clock accordingly */
  frame_delay += src_frame->repeat_pict * (frame_delay * 0.5);
  is->video_clock += frame_delay;
  return pts;
}
You'll notice we account for repeated frames in this function, too.
Now let's get our proper PTS and queue up the frame using queue_picture, adding a new pts argument:
    // Did we get a video frame?
    if(frameFinished) {
      pts = synchronize_video(is, pFrame, pts);
      if(queue_picture(is, pFrame, pts) < 0) {
 break;
      }
    }
The only thing that changes about queue_picture is that we save that pts value to the VideoPicture structure that we queue up. So we have to add a pts variable to the struct and add a line of code:
typedef struct VideoPicture {
  ...
  double pts;
}
int queue_picture(VideoState *is, AVFrame *pFrame, double pts) {
  ... stuff ...
  if(vp->bmp) {
    ... convert picture ...
    vp->pts = pts;
    ... alert queue ...
  }
So now we've got pictures lining up onto our picture queue with proper PTS values, so let's take a look at our video refreshing function. You may recall from last time that we just faked it and put a refresh of 80ms. Well, now we're going to find out how to actually figure it out.
Our strategy is going to be to predict the time of the next PTS by simply measuring the time between the previous pts and this one. At the same time, we need to sync the video to the audio. We're going to make an audio clock: an internal value thatkeeps track of what position the audio we're playing is at. It's like the digital readout on any mp3 player. Since we're synching the video to the audio, the video thread uses this value to figure out if it's too far ahead or too far behind.
We'll get to the implementation later; for now let's assume we have a get_audio_clock function that will give us the time on the audio clock. Once we have that value, though, what do we do if the video and audio are out of sync? It would silly to simply try and leap to the correct packet through seeking or something. Instead, we're just going to adjust the value we've calculated for the next refresh: if the PTS is too far behind the audio time, we double our calculated delay. if the PTS is too far ahead of the audio time, we simply refresh as quickly as possible. Now that we have our adjusted refresh time, or delay, we're going to compare that with our computer's clock by keeping a runningframe_timer. This frame timer will sum up all of our calculated delays while playing the movie. In other words, this frame_timer is what time it should be when we display the next frame. We simply add the new delay to the frame timer, compare it to the time on our computer's clock, and use that value to schedule the next refresh. This might be a bit confusing, so study the code carefully:
void video_refresh_timer(void *userdata) {

  VideoState *is = (VideoState *)userdata;
  VideoPicture *vp;
  double actual_delay, delay, sync_threshold, ref_clock, diff;
  
  if(is->video_st) {
    if(is->pictq_size == 0) {
      schedule_refresh(is, 1);
    } else {
      vp = &is->pictq[is->pictq_rindex];

      delay = vp->pts - is->frame_last_pts; /* the pts from last time */
      if(delay <= 0 || delay >= 1.0) {
 /* if incorrect delay, use previous one */
 delay = is->frame_last_delay;
      }
      /* save for next time */
      is->frame_last_delay = delay;
      is->frame_last_pts = vp->pts;

      /* update delay to sync to audio */
      ref_clock = get_audio_clock(is);
      diff = vp->pts - ref_clock;

      /* Skip or repeat the frame. Take delay into account
  FFPlay still doesn't "know if this is the best guess." */
      sync_threshold = (delay > AV_SYNC_THRESHOLD) ? delay : AV_SYNC_THRESHOLD;
      if(fabs(diff) < AV_NOSYNC_THRESHOLD) {
 if(diff <= -sync_threshold) {
   delay = 0;
 } else if(diff >= sync_threshold) {
   delay = 2 * delay;
 }
      }
      is->frame_timer += delay;
      /* computer the REAL delay */
      actual_delay = is->frame_timer - (av_gettime() / 1000000.0);
      if(actual_delay < 0.010) {
 /* Really it should skip the picture instead */
 actual_delay = 0.010;
      }
      schedule_refresh(is, (int)(actual_delay * 1000 + 0.5));
      /* show the picture! */
      video_display(is);
      
      /* update queue for next picture! */
      if(++is->pictq_rindex == VIDEO_PICTURE_QUEUE_SIZE) {
 is->pictq_rindex = 0;
      }
      SDL_LockMutex(is->pictq_mutex);
      is->pictq_size--;
      SDL_CondSignal(is->pictq_cond);
      SDL_UnlockMutex(is->pictq_mutex);
    }
  } else {
    schedule_refresh(is, 100);
  }
}
There are a few checks we make: first, we make sure that the delay between the PTS and the previous PTS make sense. If it doesn't we just guess and use the last delay. Next, we make sure we have a synch threshold because things are never going to be perfectly in synch. ffplay uses 0.01 for its value. We also make sure that the synch threshold is never smaller than the gaps in between PTS values. Finally, we make the minimum refresh value 10 milliseconds*.* Really here we should skip the frame, but we're not going to bother.
We added a bunch of variables to the big struct so don't forget to check the code. Also, don't forget to initialize the frame timer and the initial previous frame delay in stream_component_open:
    is->frame_timer = (double)av_gettime() / 1000000.0;
    is->frame_last_delay = 40e-3;

Synching: The Audio Clock

Now it's time for us to implement the audio clock. We can update the clock time in ouraudio_decode_frame function, which is where we decode the audio. Now, remember that we don't always process a new packet every time we call this function, so there are two places we have to update the clock at. The first place is where we get the new packet: we simply set the audio clock to the packet's PTS. Then if a packet has multiple frames, we keep time the audio play by counting the number of samples and multiplying them by the given samples-per-second rate. So once we have the packet:
    /* if update, update the audio clock w/pts */
    if(pkt->pts != AV_NOPTS_VALUE) {
      is->audio_clock = av_q2d(is->audio_st->time_base)*pkt->pts;
    }
And once we are processing the packet:
      /* Keep audio_clock up-to-date */
      pts = is->audio_clock;
      *pts_ptr = pts;
      n = 2 * is->audio_st->codec->channels;
      is->audio_clock += (double)data_size /
 (double)(n * is->audio_st->codec->sample_rate);
A few fine details: the template of the function has changed to include pts_ptr, so make sure you change that. pts_ptr is a pointer we use to inform audio_callback the pts of the audio packet. This will be used next time for synchronizing the audio with the video.
Now we can finally implement our get_audio_clock function. It's not as simple as getting the is->audio_clock value, thought. Notice that we set the audio PTS every time we process it, but if you look at the audio_callback function, it takes time to move all the data from our audio packet into our output buffer. That means that the value in our audio clock could be too far ahead. So we have to check how much we have left to write. Here's the complete code:

double get_audio_clock(VideoState *is) {
  double pts;
  int hw_buf_size, bytes_per_sec, n;
  
  pts = is->audio_clock; /* maintained in the audio thread */
  hw_buf_size = is->audio_buf_size - is->audio_buf_index;
  bytes_per_sec = 0;
  n = is->audio_st->codec->channels * 2;
  if(is->audio_st) {
    bytes_per_sec = is->audio_st->codec->sample_rate * n;
  }
  if(bytes_per_sec) {
    pts -= (double)hw_buf_size / bytes_per_sec;
  }
  return pts;
}

Friday, 6 March 2015

How to Wrap H264 Frames in to FLV

First, the analysis of the FLV data
We acquire a flv file to a simple analysis of the data format flv
the flv Standard Document Download
the flv file analyzer flvprase Download
This is not the focus of this article, this is skipped. I believe we control the flv standard documents can understand the the flv data format, and I want to learn this knowledge friend first completion of this step is also highly recommended.
RTMP uplink h264 video stream
Uplink video server will be saved as binary files (note 16 data, binary data must use binary form to save the data) as shown (the picture is displayed per line, that is, from 0 to f, such as incomplete, separate Open the picture to view)
The tools I use is Notepad + + (and install the binary view plug-in)
If you have to do the first step, it is not difficult to see that rtmp flv video stream is the one after the Video tag - FLV tag in the removal of the head, leaving only the video tag content.
Our control flv standard document-by-analysis
17:1-keyframe 7-avc
00: AVC sequence header - AVC packet type
00 00 00: composition time, AVC when all 0s, meaningless
Because AVC packet type = AVC sequence header, next step is AVCDecoderConfigurationRecord of the contents of the
configurationVersion = 01
AVCProfileIndication = 42
profile_compatibility = 00
AVCLevelIndication = 1f
lengthSizeMinusOne = ff - the number of bytes FLV NALU packet length data (lengthSizeMinusOne & 3) +1, the actual test found a total of FF evaluates to 4, hereinafter also referred to this data
numOfSequenceParameterSets = E1 - SPS number, numOfSequenceParameterSets & 0x1F, the actual test total E1, evaluates to 1
sequenceParameterSetLength = 00 31 - SPS length of 2 bytes, and the calculation result 49
sequenceParameterSetNALUnits = 67 42 80 1f 96 54 05 01 ed 80 a8 40 00 00 03 00 40 00 00 07 b8 00 00 20 00 00 03 01 00 01 fc 63 8c 00 00 10 00 00 03 00 80 00 fe 31 c3 b4 24 4d 40 - SPS just calculated for the 49 bytes, SPS contains the video length, width information
numOfPictureParameterSets = 01 - the number of PPS, the actual test found that total E1, the results
pictureParameterSetLength = 00 04 - PPS length
pictureParameterSetNALUnits = 68 ce 35 20 - PPS
Followed by another new a videotag package data
17:1-keyframe 7-avc
01: AVC NALU
00 00 00: composition time, AVC when all 0s, meaningless
Because AVCPacket type = AVC NALU, the next step is one or more NALU
Each NALU package front has (lengthSizeMinusOne & 3) +1 bytes of NAL packet length description (previously mentioned, remember), previously calculated for the four bytes
00 00 00 02: 2 ??- NALU length
09 10: NAL package
Insert NALU knowledge, the first five of the first byte of each NALU marked the NAL package type, NAL nal_unit_type
  # Define NALU_TYPE_SLICE 1 
  # Define NALU_TYPE_DPA a 2 
  # Define NALU_TYPE_DPB a 3 
  The # define NALU_TYPE_DPC 4 
  # Define NALU_TYPE_IDR a 5 
  The # define NALU_TYPE_SEI 6 
  # Define NALU_TYPE_SPS a 7 
  The # define NALU_TYPE_PPS 8 
  The # define NALU_TYPE_AUD 9 / / access delimiters # define NALU_TYPE_EOSEQ 10 
  # Define NALU_TYPE_EOSTREAM a 11 
  The # define NALU_TYPE_FILL 12 
09 & 0x1f = 9, the access unit delimiter
We resolve SPS header bytes 67, 67 & 0x1F = 7, pps head byte 68,68 & 0x1f = 8, just to correspond on.
00000029: Description next NAL package length 41
06 00 11 80 00 af c8 00 00 03 00 00 03 00 00 af c8 00 00 03 00 00 40 01 0c 00 00 03 00 00 03 00 90 80 08 00 00 03 00 08 80:06 & 0x1f = 6 - SEI
00 00 3c D0: next NAL packet length
65 88 80 ......: 65 & 0x1f = 5 - I frame data
Packet video tag analysis ends, the following will be followed by the P frame data corresponding to the I frame, as shown below
See 00003d80 line, in front of up to 53 4f 7f are on a video tag content, pull 658,880 said earlier that I frame data, 27 is a video tag
27:2-inter frame, P frame ,7-CodecID = AVC
01: AVCPacket type = AVC NALU
00 00 00: composition time, AVC when all 0s, meaningless
000000020930: with the above analysis of the same pull 2 bytes nal package, access unit delimiter
11:17 000000 bytes NAL package
06 01 0c 00 00 80 00 00 90 80 18 00 00 03 00 08 80:06 & 0x1f = 6-SEI
00 00 46 85: NAL packet data length
41 9a 02 ......: 41 & 0x1f = 1, P frame data

Third, the conversion
Roughly summarized under flv h264 stream, according to the order
1, a video tag, the information contained in: SPS, PPS, access unit delimiter, SEI, I frame packet
2, one or a plurality of video tag, containing information: access unit delimiter, SEI, P-frame packet for a plurality of
Cycle 1, 2
It should be noted, do this step conversion, we need from videotag, get to, a NAL package can, it is the I-frames, P-frames and other types, in fact, we do not need to care about here just to better analyze the data.
h264 NALU and NALU between 000,001 (00,000,001) separated, composition h264 format
1,00 00 00 01 SPS 00 00 00 01 PPS 00 00 00 01 Access Unit separator character 00 00 00 01 SEI 00 00 00 01 I frame 00 00 00 01 P frame 00 00 00 01 P frame ...... (P frame variable number )
Cycle 1

Access unit delimiter and SEI is not necessary, h264 in binary form write to a file, use the Elecard StreamEye can broadcast, download address

Wednesday, 4 March 2015

H.264 Frame Types (about I, B, P Frmaes)

H.264 Frames  Types:

 H.264 streams include three types of frames (see Figure 5):
  • I-frames: Also known as key frames, I-frames are completely self-referential and don't use information from any other frames. These are the largest frames of the three, and the highest-quality, but the least efficient from a compression perspective.
  • P-frames: P-frames are "predicted" frames. When producing a P-frame, the encoder can look backwards to previous I or P-frames for redundant picture information. P-frames are more efficient than I-frames, but less efficient than B-frames.
  • B-frames: B-frames are bi-directional predicted frames. As you can see in Figure 5, this means that when producing B-frames, the encoder can look both forwards and backwards for redundant picture information. This makes B-frames the most efficient frame of the three. Note that B-frames are not available when producing using H.264's Baseline Profile.
I, P, and B-frames in an H.264-encoded stream
Figure 5. I, P, and B-frames in an H.264-encoded stream
Now that you know the function of each frame type, I'll show you how to optimize their usage.

Working with I-frames

Though I-frames are the least efficient from a compression perspective, they do perform two invaluable functions. First, all playback of an H.264 video file has to start at an I-frame because it's the only frame type that doesn't refer to any other frames during encoding.
Since almost all streaming video may be played interactively, with the viewer dragging a slider around to different sections, you should include regular I-frames to ensure responsive playback. This is true when playing a video streamed from Flash Media Server, or one distributed via progressive download. While there is no magic number, I typically use an I-frame interval of 10 seconds, which means one I-frame every 300 frames when producing at 30 frames per second (and 240 and 150 for 24 fps and 15 fps video, respectively).
The other function of an I-frame is to help reset quality at a scene change. Imagine a sharp cut from one scene to another. If the first frame of the new scene is an I-frame, it's the best possible frame, which is a better starting point for all subsequent P and B-frames looking for redundant information. For this reason, most encoding tools offer a feature called "scene change detection," or "natural key frames," which you should always enable.
Figure 6 shows the I-frame related controls from Flash Media Encoding Server. You can see that Enable Scene Change detection is enabled, and that the size of the Coded Video Sequence is 300, as in 300 frames. This would be simpler to understand if it simply said "I-frame interval," but it's easy enough to figure out.
I-frame related controls from Flash Media Encoding Server
Figure 6. I-frame related controls from Flash Media Encoding Server
Specifically, the Coded Video Sequence refers to a "Group of Pictures" or GOP, which is the building block of the H.264 stream—that is, each H.264 stream is composed of multiple GOPs. Each GOP starts with an I-frame and includes all frames up to, but not including, the next I-frame. By choosing a Coded Video Sequence size of 300, you're telling Flash Media Encoding Server to create a GOP of 300 frames, or basically the same as an I-frame interval of 300.

IDR frames

I'll describe the Number of B-Pictures setting further on, and I've addressed Entropy Coding Mode already; but I wanted to explain the Minimum IDR interval and IDR frequency. I'll start by defining an IDR frame.
Briefly, the H.264 specification enables two types of I-frames: normal I-frames and IDR frames. With IDR frames, no frame after the IDR frame can refer back to any frame before the IDR frame. In contrast, with regular I-frames, B and P-frames located after the I-frame can refer back to reference frames located before the I-frame.
In terms of random access within the video stream, playback can always start on an IDR frame because no frame refers to any frames behind it. However, playback cannot always start on a non-IDR I-frame because subsequent frames may reference previous frames.
Since one of the key reasons to insert I-frames into your video is to enable interactivity, I use the default setting of 1, which makes every I-frame an IDR frame. If you use a setting of 0, only the first I-frame in the video file will be an IDR frame, which could make the file sluggish during random access. A setting of 2 makes every second I-frame an IDR frame, while a setting of 3 makes every third I-frame an IDR frame, and so on. Again, I just use the default setting of 1.
Minimum IDR interval defines the minimum number of frames in a group of pictures. Though you've set the Size of Codec Video Sequence at 300, you also enabled Scene Change Detection, which allows the encoder to insert an I-frame at scene changes. In a very dynamic MTV-like sequence, this could result in very frequent I-frames, which could degrade overall video quality. For these types of videos, you could experiment with extending the minimum IDR interval to 30–60 frames, to see if this improved quality. For most videos, however, the default interval of 1 provides the encoder with the necessary flexibility to insert frequent I-Frames in short, highly dynamic periods, like an opening or closing logo. For this reason, I also use the default option of 1 for this control.

Working with B-frames

B-frames are the most efficient frames because they can search both ways for redundancies. Though controls and control nomenclature varies from encoder to encoder, the most common B-frame related control is simply the number of B-frames, or "B-Pictures" as shown in Figure 6. Note that the number in Figure 6 actually refers to the number of B-frames between consecutive I-frames or P-frames.
Using the value of 2 found in Figure 6, you would create a GOP that looks like this:
IBBPBBPBBPBB... ...all the way to frame 300. If the number of B-Pictures was 3, the encoder would insert three B-frames between each I-frame and/or P-frame. While there is no magic number, I typically use two sequential B-frames.
How much can B-frames improve the quality of your video? Figure 7 tells the tale. By way of background, this is a frame at the end of a very-high-motion skateboard sequence, and also has significant detail, particularly in the fencing behind the skater. This combination of high motion and high detail is unusual, and makes this frame very hard to encode. As you can see in the figure, the video file encoded using B-frames retains noticeably more detail than the file produced without B-frames. In short, B-frames do improve quality.
File encoded with B-frames (left) and no B-frames (right)
Figure 7. File encoded with B-frames (left) and no B-frames (right)
What's the performance penalty on the decode side? I ran a battery of cross-platform tests, primarily on older, lower-power computers, measuring the CPU load required to play back a file produced with the Baseline Profile (no B-frames), and a file produced using the High Profile with B-frames. The maximum differential that I saw was 10 percent, which isn't enough to affect my recommendation to always use the High Profile except when producing for devices that support only the Baseline Profile.

Advanced B-frame options

Adobe Flash Media Encoding Server also includes the B and P-frame related controls shown in Figure 8. Adaptive B-frame placement allows the encoder to override the Number of B-Pictures value when it will enhance the quality of the encoded stream; for instance, when it detects a scene change and substitutes an I-frame for the B. I always enable this setting.
Other B-frame related options


Reference B-Pictures lets the encoder to use B-frames as a reference frame for P frames, while Allow pyramid B-frame coding lets the encoder use B-frames as references for other B-frames. I typically don't enable these options because the quality difference is negligible, and I've noticed that these options can cause playback to become unstable in some environments.
Reference frames is the number of frames that the encoder can search for redundancies while encoding, which can impact both encoding time and decoding complexity; that is, when producing a B-frame or P-frame, if you used a setting of 10, the encoder would search until it found up to 10 frames with redundant information, increasing the search time. Moreover, if the encoder found redundancies in 10 frames, each of those frames would have to be decoded and in memory during playback, which increases decode complexity.
Intuitively, for most videos, the vast majority of redundancies are located in the frames most proximate to the frame being encoded. This means that values in excess of 4 or 5 increase encoding time while providing little value. I typically use a value of 4.
Finally, though it's not technically related to B-frames, consider the number of Slices per picture, which can be 1, 2, or 4. At a value of 4, the encoder divides each frame into four regions and searches for redundancies in other frames only within the respective region. This can accelerate encoding on multicore computers because the encoder can assign the regions to different cores. However, since redundant information may have moved to a different region between frames—say in a panning or tilting motion—encoding with multiple slices may miss some redundancies, decreasing the overall quality of the video.
In contrast, at the default value of 1, the encoder treats each frame as a whole, and searches for redundancies in the entire frame of potential reference frames. Since it's harder to split this task among multiple cores, this setting is slower, but also maximizes quality. Unless you're in a real hurry, I recommend the default value of 1.

How Flash Video STREAMS, and format of FLV

Flash Video (FLV)

Flash Video is the name of a file format used to deliver video over the Internet using Adobe Flash Player version 6 or newer. Flash Video content may also be embedded within SWF files. Until version 9 update 3 of the Flash Player, Flash Video referred to a proprietary file format, having the extension .FLV but Adobe introduced new file extensions and MIME types and suggests to use those instead of the old FLV:

File Extension FTYP MIME Type Description
.f4v 'F4V ' video/mp4 Video for Adobe Flash Player
.f4p 'F4P ' video/mp4 Protected Media for Adobe Flash Player
.f4a 'F4A ' video/mp4 Audio for Adobe Flash Player
.f4b 'F4B ' video/mp4 Audio Book for Adobe Flash Player
.flv video/x-flv Flash Video

It is possible to place H.264 and AAC streams into the traditional FLV file, but Adobe strongly encourages everyone to embrace the new standard file format. There are functional limits with the FLV structure when streaming H.264 which couldn't be overcome without a redesign of the file format. This is one of the reasons Adobe is moving away from the traditional FLV file structure. Specifically dealing with sequence headers and enders is tricky with FLV streams. Adobe is still working out if it's possible to place On2 VP6 streams into the new file format.

Overview

  • File format parser implementing parts of ISO 14496-12 (very limited sub set of MPEG-4, 3GP and QuickTime movie support).
  • Support for the 3GPP timed text specification 3GPP TS 26.245. Essentially this is a standardized subtitle format within 3GP files. Any number of text tracks are supported and all the information, including esoteric stuff like karaoke meta data is dumped in 'onMetaData' and a new 'onTextData' NetStream callback. Language information in the individual tracks is also reported. That means you can have sub titles in several languages. Check the 3GPP TS 26.245 specification to see what information is available. Note that you have to take care of the formatting and placement of the text yourself, the Flash Player will do nothing here. You can use MP4Box to inject text data into existing files.
  • Partial parsing support for the 'ilst' atom which is the ID3 equivalent iTunes uses to store meta data. This is usually present in iTunes files. It contains ID3 like information and is reported in the onMetaData callback as key/value pairs in a mixed array with the name 'tags'. ID3V2 is not supported right now.
  • A software based H.264 codec with the ability to decode Base, Mainline and High profiles.
  • An AAC decoder supporting AAC Main, AAC LC and SBR (also known as HE-AAC ((The support of AAC allows you to encode audio to 64Kbit/s with the same quality of a 128Kbit/s encoded MP3. Further more, for other use more susceptible to bandwidth usage, like Internet Radio, HE-AAC v2 gives the possibility to encode audio to 32Kbit/s or lower with a surprisingly good final result. In low bitrate streaming scenarios this can make the difference.)).

Issues

Tools to solve FLV-related issues:

Video

Overview

You load and play .mp4,.m4v,.m4a,.mov and .3gp files using the same NetStream API you use to load FLV files. There are a few things to be aware of:
  • Video needs to be in H.264 format only. MPEG-4 Part 2 (Xvid, DivX etc.) video is not supported, H.263 video is not supported, Sorenson Video is not supported. A lot of pod casts are still using MPEG-4 Part 2 so do not be surprised if you do not see any video.
  • the Flash Player is close to 100% compliant to the H.264 standard, all Base, Main, High and High 10 bit streams should play.
  • Extended, High 4:2:2 and High 4:4:4 profiles are not officially supported at this time. They might or might not work depending on what features are used. There are no artificial lower limit on B-frames or any problems with B-pyramids like other players do.
  • Since these files contain an index unlike old FLV files, the Flash Player provides a list of save seek points, e.g. times you can seek to without having the play head jump around. You'll get this information through the onMetaData callback in an array with the name 'seekpoints'. On the downside, some files are missing this information which also means that these files are not seekable at all! This is very different from the traditional FLV file format which is rather based on the notion of key frames to determine the seek points.

Codecs


Codec Introduced in Flash Player version Introduced in Flash Lite version Container Formats ISO Specification Codec Id
Sorenson Spark ((Flash documentation does not state a number for "their" version of Sorenson but describes the codec as a variant of ITU-T (International Telecommunications Union-Telecommunication Standardization Sector) recommendation H.263 (MPEG-4_V). In early 2006, one of Sorenson's compression applications to produce content for Flash offered the Sorenson_3 codec, described by experts as a variant of ITU-T H.264 (MPEG-4_AVC). By late 2006, Sorenson offered new compression applications with other outputs.)) 6 3 FLV 2
Macromedia Screen Video ((This codec divides the screen in wide macroblocks (es: 64x64 pixels), reduces the number of colors, and transmits the changed blocks after compressing them in zlib. This is very similar to what VNC does.formats are bitmap tile based, can be lossy by reducing color depths and are compressed)) 6 - FLV 3
Macromedia ScreenVideo 2 ((This codec can use two different types of macroblock: Iblock and Kblock. The Kblock works like a keyframe and is archived for future references. The Iblock is encoded as differences from a previous block. This new approach, similar to the usual compression of generic video content, guarantees a much better compression ratio, especially in a standard "moving windows" scenario.)) 8 - FLV 6
On2 TrueMotion VP6-E 8 3 MOV 4
On2 TrueMotion VP6-S 9.0.115.0 - MP4V, M4V 5
H.264 (MPEG-4 Part 10) 9.0.115.0 - MP4, F4V, 3GP, 3G2 ISO 14496-10

Adobe Tech Note

Audio

Overview

  • Audio can be either AAC Main, AAC LC or SBR, corresponding to audio object types 0, 1 and 2.
  • The '.mp3' sample type for tracks with mp3 audio is also supported.
  • MP3inMP4 which intends to do multi-channel mp3 playback within mp4 files is not supported.
  • The old QuickTime specific style of embedding AAC and MP3 data is not supported. It is unlikely though that you will run into these kind of files.
  • Unencrypted audio book files contain chapter information. This information is exposed in the onMetaData callback as an array of objects with name 'chapters'.
  • The Flash Player can play back multi-channel AAC files, though the sound is mixed down to two channels and resampled to 44.1Khz. Multi channel playback is targeted for one of the next major revisions of the Flash Player. This requires complete redesign of the sound engine in the Flash Player which dates from circa 1996 and has not been improved since.
  • All sampling rates from 8Khz to 96Khz are supported. A 32 tap Kaiser Bessel based FIR filter which resamples the sound to 44.1Khz, retaining high quality. The most common sample rate combinations have a hard coded number of phases. In case of a 48000 to 44100 Hz conversion the filter has 147 phases f.ex.
  • Flash Player Update 3 Beta 2 now can play back any MP3 sampling rate leveraging the same AAC implementation. No more chipmunks. Ever.

Codecs


Codec Introduced in Flash Player version Container Formats ISO Specification Codec Id
MP3 6 MP3 2
Nellymoser ASAO Codec (speech compression) audio content 6 FLV 5, 6
Raw PCM sampled audio content 6 WAV 0
ADPCM (Adaptive Delta Pulse Code Modulation) audio content 6 1
AAC (HE-AAC/AAC SBR, AAC Main Profile, and AAC-LC) 9.0.115.0 M4A, MP4 ISO 14496-3
Speex 10 FLV Wiki 11

Image

  • Image tracks encoded in JPEG, GIF and PNG are accessible in AS3 as a byte array through the callback 'onImageData'. You can simply take that byte array and use the Loader class to display the images. Most often these images represent cover artwork for audio files.
  • TIFF image tracks are not supported, you might come across files using this.
  • Support for the 'covr' meta data stored in iTunes files, accessible as byte arrays.

Metadata


Property Value Notes
duration Obvious. Unlike for FLV files this field will always be present.
videocodecid For H.264 it reports 'avc1'.
audiocodecid For AAC it reports 'mp4a', for MP3 it reports '.mp3'.
avcprofile 66, 77, 88, 100, 110, 122 or 144 Corresponds to the H.264 profiles
avclevel A number between 10 and 51. Consult this list to find out more.
aottype Either 0, 1 or 2. This corresponds to AAC Main, AAC LC and SBR audio types.
moovposition int The offset in bytes of the moov atom in a file.
trackinfo Array An array of objects containing various infomation about all the tracks in a file.
chapters Array Information about chapters in audiobooks.
seekpoints Array Times you can directly feed into NetStream.seek();
videoframerate int The frame rate of the video if a monotone frame rate is used. Most videos will have a monotone frame rate.
audiosamplerate The original sampling rate of the audio track.
audiochannels The original number of channels of the audio track.
tags ID3 like tag information

FLV Format

A Flash Video file (.FLV file extension) consists of a short header, and then interleaved audio, video, and metadata packets. The audio and video packets are stored very similarly to those in swf, and the metadata packets consist of AMF data.

FLV Header


Field Data Type Example Description
Signature byte3 "FLV" Always "FLV"
Version uint8 "\x01" (1) Currently 1 for known FLV files
Flags uint8 bitmask "\x05" (5, audio+video) Bitmask: 4 is audio, 1 is video
Offset uint32_be "\x00\x00\x00\x09" (9) Total size of header (always 9 for known FLV files)

FLV Stream


Field Data Type Example Description
PreviousTagSize uint32_be "\x00\x00\x00\x00" (0) Always 0

Then a sequence of tags followed by their size until EOF.

FLV Tag


Field Data Type Example Description
Type uint8 "\x12" (0x12, META) Determines the layout of Body, see below for tag types
BodyLength uint24_be "\x00\x00\xe0" (224) Size of Body (total tag size - 11)
Timestamp uint24_be "\x00\x00\x00" (0) Timestamp of tag (in milliseconds)
TimestampExtended uint8 "\x00" (0) Timestamp extension to form a uint32_be. This field has the upper 8 bits.
StreamId uint24_be "\x00\x00\x00" (0) Always 0
Body byteBodyLength ... Dependent on the value of Type

Previous tag size


Field Data Type Example Description
PreviousTagSize uint32_be "\x00\x00\x00\x00" (0) Total size of previous tag, or 0 for first tag

FLV Tag Types


Tag code Name Description
0x08 AUDIO Contains an audio packet similar to a SWF SoundStreamBlock plus codec information
0x09 VIDEO Contains a video packet similar to a SWF VideoFrame plus codec information
0x12 META Contains two AMF packets, the name of the event and the data to go with it

FLV Tag 0x08: AUDIO

The first byte of an audio packet contains bitflags that describe the codec used, with the following layout:

Name Expression Description
soundType (byte & 0x01) >> 0 0: mono, 1: stereo
soundSize (byte & 0x02) >> 1 0: 8-bit, 1: 16-bit
soundRate (byte & 0x0C) >> 2 0: 5.5 kHz (or speex 16kHz), 1: 11 kHz, 2: 22 kHz, 3: 44 kHz
soundFormat (byte & 0xf0) >> 4 0: Uncompressed, 1: ADPCM, 2: MP3, 5: Nellymoser 8kHz mono, 6: Nellymoser, 11: Speex

The rest of the audio packet is simply the relevant data for that format, as per a SWF SoundStreamBlock.

FLV Tag 0x09: VIDEO

The first byte of a video packet describes contains bitflags that describe the codec used, and the type of frame

Name Expression Description
codecID (byte & 0x0f) >> 0 2: Sorensen H.263, 3: Screen video, 4: On2 VP6, 5: On2 VP6 Alpha, 6: ScreenVideo 2
frameType (byte & 0xf0) >> 4 1: keyframe, 2: inter frame, 3: disposable inter frame

In some cases it is also useful to decode some of the body of the video packet, such as to acquire its resolution (if the initial onMetaData META tag is missing, for example).

FLV Tag 0x12: META

The contents of a meta packet are two AMF packets. The first is almost always a short uint16_be length-prefixed UTF-8 string (AMF type 0x02), and the second is typically a mixed array (AMF type 0x08). However, the second chunk typically contains a variety of types, so a full AMF parser should be used.