I was working on a graphics editor for the Zenith Z100 computer. This was a shrink-wrap product intended to be distributed by the thousands (well millions actually but Zenith didn't sell enough Z100's (sigh))
The Z100 had 640 x 225 x 8 color graphics capability (or 640 x 450 if you used an undocumented video mode) (Compare and contrast to
The problem
In a word: speed. One of my test patterns took 40 seconds to render to the screen. This is painfully slow -- probably slow enough to kill the project.
Profiling the C program showed that writing pixels to the screen dominated performance of the program. Profiling the pixel writing code showed that the divide by nine was dog-slow. This was on an Intel 8086 processor (an 8 bit processor with delusions of 16 bitness)
The Hacker
Dale (that would be me) Wilson
The Hack
I can't claim exclusivity for this hack. Other people found it and later it became pretty common, but I can claim that I found it myself.
The DIV instruction on the early Intel processors was a real cycle burner. Not only that, but it took a lot of setup and tied up a lot of registers on a machine that is extremely register-poor. It had to go. The good news is that the DIV instruction was a 32bit/16bit divide which is way more than necessary for this problem. In fact an 8 bit/4 bit divide would be sufficient.
My first approach to solving the problem was to implement Divide-By-Nine as a series of shifts and subtracts. Since the divisor was hard-wired the division could be done in straight line code. No branches, no testing the carry bit (oops, this is Intel, I mean the borrow bit). The result was about ten times faster than the DIV based code. The test pattern now showed up in four seconds vs the original 40 seconds. Good, but four seconds is still a bit sluggish.
Studying the Intel instruction set revealed all sorts of oddities. One interesting one was the AAM gap. The AAM instruction itself was a somewhat dorky instruction intended to be used as one step in multiplying BCD numbers. The acronym stands for Arithmetic Adjust after Multiply. Once you get past the noise about BCD multiplies (who multiplies BCD numbers? Maybe COBOL programmers (grin)) and look at what it actually does to the bits, you discover that it's really a divide instruction. It divides the AL register by ten and puts the quotient in AH and the remainder in AL.
The opcode for AAM is D4 0A. But looking at the Intel instruction set docs reveals that AAM is the only instruction with D4 as the first byte. That seems wasteful. And it is interesting that the second byte happens to be an "A" (i.e. a ten) A little experimenting (can you say self-modifying code (with a straight face?)) reveals that AAM is really a 8 bit "divide immediate" instruction. If you put a 09 in the second byte it divides by 9. If you put a '0' there, you get a divide-by-zero fault!
Back to the write-the-pixel fast algorithm. Out comes the shift-and-subtract sequence. In goes a AAM9 and voila, the render time is significantly under a second. Life is good. Ship the product. Everyone sees what a marvelous machine the Z100 is; Zenith rules; and we all become rich-n-famous (oops--except for that last sentence (sigh.))
Ok, judges. What's the score for this one on the sleazy(0) to righteous(10) scale?
Extenuating circumstances
I left one point of information out of the discussion above. I left the shift and subtract code in the program. At startup the program tested to be sure AAM9 did the right thing. If the test failed, it fell back to shift and divide.
My Rating
Since this is my own hack it's hard for me to be objective but without the startup check, I'll give it about a 3. Using undocumented features of the processor in commercial software is bad news. With the check, however, I'd go as high as a 6. After all is success in the market requires speed and this is the only way to get the speed...
As it turns out, however, this wasn't the only way to get the speed. In a later version of the program I switched from thinking about the screen as a collection of pixels to thinking about it as a collection of horizontal lines (some of which were very short.) This gained almost as much of a speed increase as the switch from shift-subtract to AAM9, and it didn't involve any undocumented ops.
And one last point.
You may wonder why I didn't use the "normal" technique of rendering into an off-screen buffer then blitting to the actual video buffer. Good question, and the answer comes back to available memory. Zenith sold Z100's with as little as 128K bytes of RAM for programs + operating system. (The video buffer was separate.). A screen buffer takes 54K which means there just wasn't enough RAM available to do an off-screen buffer right.
1 comment:
Hi Dale Wilson, it’s late in the evening, quiet and peaceful. This is good computer time for me. I thought I would check on bcd and see what came up. The AAM hack is something that is interesting to many people. I will also spend a little time checking on bcd. Getting late, have a good evening.
Post a Comment