Tuesday, April 19, 2005

Parallel by force

Suppose you've created the multithreaded MPEG decoder as outlined in the previous entry. Remember the good reason for multithreading the MPEG decoder was:
  • The task is inherently multithreaded so a multithreaded solution results in simpler code.

In fact the MPEG decoder almost begs to be multithreaded.

So one day your multithreaded MPEG decoder is happily zipping thru an MPEG stream that contains just video and one audio track. The following threads are running:

#1 Accept and demux input
#2 Decode video substream
#3 Decode audio substream
#4 Mix and present substreams.

Then your boss shows up and says, "I spent all this money on a 16 CPU superserver and your application is only keeping it 25% busy. I want you to increase the parallellism so all the CPU's will be kept busy. NOW!"

* * *
Now what do you do (other than looking for a new job with a new boss.)

You've already added the "natural" multithreading that is inherent in the problem. How can you increase parallelism even further?

It's time to try to apply the other good reason.

  • The task can be cleanly decomposed into multiple sub-tasks that are highly independent; the independent tasks can use resources in parallel; and the benefits of this parallel usage outweigh the overhead of multithreading. (All three conditions must be true.)

Hmmm....

A video stream is a series frames. Maybe we can create multiple threads and have each thread decode a separate frame. So we add component that separates the stream into a series of, undecoded frames (yes this is fairly easy to do without actually decoding the frames) and a pool of threads that processes these frames. Each thread from the pool picks up the next un-decoded frame, decodes it, and adds the result to a collection of decoded frames. Since frame-decode time varies as a function of the complexity of the image, we also need component to shuffle the decoded frames back into the correct order.

Voila, we can keep as many CPU's busy as we want to by looking forward far enough. Makes sense, right?

Nice theory, anyway. When you start coding the frame decoder, you'll quickly run into a major stumbling block. One of the techniques MPEG uses to compress the video image is to send most frames as a diff from the previous frame. This is very effective -- especially when the movie is showing relatively static scenery (it doesn't work so well during explosions.) Thus as you decode frame #n you regularly have to refer back to frame #n-1 to apply the diff and thereby create the final result. Even more interesting, sometimes you have to look *forward* to frame #n+1! (Don't ask, the MPEG folks are a twisted bunch.)

So the thread-per-frame solution sounds plausable (you can probably sell it to your boss) but fails the "independance" test. Back to the drawing board.

Fortunately for DVDs there's another approach. In order to support fast forward, slow motion, jump to scene, etc, the video on a DVD is carved up into chunks called video objects (VOBs) A VOB contains about half a second worth of video, audio, subtitles, etc. and what's more important each VOB is independant of the VOBs that preceed it and follow it. So, although the thread-per-frame idea was a bust, a thread-per-VOB approach will work nicely. You may need a priority scheme to insure that the thread that's decoding the VOB scheduled to show up next on the screen gets all the resources it needs, but other than that you've found a clean division of the main task into subtasks that can take advantage of the available CPU's by running in parallel.

The new high level design for video looks like:
  1. Accept and demux input
  2. Queue packets for decoder(s)[MT]
  3. Separate into VOBs
  4. Hold VOBs for processing[MT]
  5. Decode VOB
  6. Hold decoded VOBs for reordering[MT]
  7. Reorder decoded VOBs into decode stream
  8. Queue decoded streams for mixer.[MT]
  9. Mix and present substreams.

This approach has added some more synchronization spots -- one to hold the separated VOBs waiting to be decoded, and one to hold the decoded VOBs until they can be placed in the correct sequence and passed on to the mixer. It might be tempting to try to merge demuxer with the VOB separator or the decoded VOB holder with the decoded stream queue, but don't give in to temptation. Solve one problem at a time and let the inherent parallelism take care of improving performance. [or at least get it working correctly and profile it before optimizing.]

The moral of the story:
  • Finding the right decomposition into independant subtasks needs to be done carefully based on detailed understanding of the domain. An obvious solution may not be the right solution.

1 comment:

Anonymous said...
This comment has been removed by a blog administrator.