Monday, February 28, 2005

Stealth wins

I finished weaving the "stealth scarf" (see previous dissussion) this weekend and just as I suspected the pattern is pretty much invisible. If you hold it at just the right angle in just the right light you can see it, but that's not what I was hoping for.

For my next bright idea.....

Wednesday, February 23, 2005

The AAM hack

Backgound
I was working on a graphics editor for the Zenith Z100 computer. This was a shrink-wrap product intended to be distributed by the thousands (well millions actually but Zenith didn't sell enough Z100's (sigh))

The Z100 had 640 x 225 x 8 color graphics capability (or 640 x 450 if you used an undocumented video mode) (Compare and contrast to CGA.) The Z100's graphics were memory mapped, but the layout of the memory was somewhat arcane. Converting a (X,Y) pixel address to a byte and bit address involved some calculations including a divide by 9

The problem

In a word: speed. One of my test patterns took 40 seconds to render to the screen. This is painfully slow -- probably slow enough to kill the project.
Profiling the C program showed that writing pixels to the screen dominated performance of the program. Profiling the pixel writing code showed that the divide by nine was dog-slow. This was on an Intel 8086 processor (an 8 bit processor with delusions of 16 bitness)

The Hacker
Dale (that would be me) Wilson

The Hack
I can't claim exclusivity for this hack. Other people found it and later it became pretty common, but I can claim that I found it myself.

The DIV instruction on the early Intel processors was a real cycle burner. Not only that, but it took a lot of setup and tied up a lot of registers on a machine that is extremely register-poor. It had to go. The good news is that the DIV instruction was a 32bit/16bit divide which is way more than necessary for this problem. In fact an 8 bit/4 bit divide would be sufficient.

My first approach to solving the problem was to implement Divide-By-Nine as a series of shifts and subtracts. Since the divisor was hard-wired the division could be done in straight line code. No branches, no testing the carry bit (oops, this is Intel, I mean the borrow bit). The result was about ten times faster than the DIV based code. The test pattern now showed up in four seconds vs the original 40 seconds. Good, but four seconds is still a bit sluggish.

Studying the Intel instruction set revealed all sorts of oddities. One interesting one was the AAM gap. The AAM instruction itself was a somewhat dorky instruction intended to be used as one step in multiplying BCD numbers. The acronym stands for Arithmetic Adjust after Multiply. Once you get past the noise about BCD multiplies (who multiplies BCD numbers? Maybe COBOL programmers (grin)) and look at what it actually does to the bits, you discover that it's really a divide instruction. It divides the AL register by ten and puts the quotient in AH and the remainder in AL.

The opcode for AAM is D4 0A. But looking at the Intel instruction set docs reveals that AAM is the only instruction with D4 as the first byte. That seems wasteful. And it is interesting that the second byte happens to be an "A" (i.e. a ten) A little experimenting (can you say self-modifying code (with a straight face?)) reveals that AAM is really a 8 bit "divide immediate" instruction. If you put a 09 in the second byte it divides by 9. If you put a '0' there, you get a divide-by-zero fault!

Back to the write-the-pixel fast algorithm. Out comes the shift-and-subtract sequence. In goes a AAM9 and voila, the render time is significantly under a second. Life is good. Ship the product. Everyone sees what a marvelous machine the Z100 is; Zenith rules; and we all become rich-n-famous (oops--except for that last sentence (sigh.))


Ok, judges. What's the score for this one on the sleazy(0) to righteous(10) scale?


Extenuating circumstances

I left one point of information out of the discussion above. I left the shift and subtract code in the program. At startup the program tested to be sure AAM9 did the right thing. If the test failed, it fell back to shift and divide.

My Rating
Since this is my own hack it's hard for me to be objective but without the startup check, I'll give it about a 3. Using undocumented features of the processor in commercial software is bad news. With the check, however, I'd go as high as a 6. After all is success in the market requires speed and this is the only way to get the speed...

As it turns out, however, this wasn't the only way to get the speed. In a later version of the program I switched from thinking about the screen as a collection of pixels to thinking about it as a collection of horizontal lines (some of which were very short.) This gained almost as much of a speed increase as the switch from shift-subtract to AAM9, and it didn't involve any undocumented ops.

And one last point.
You may wonder why I didn't use the "normal" technique of rendering into an off-screen buffer then blitting to the actual video buffer. Good question, and the answer comes back to available memory. Zenith sold Z100's with as little as 128K bytes of RAM for programs + operating system. (The video buffer was separate.). A screen buffer takes 54K which means there just wasn't enough RAM available to do an off-screen buffer right.

Monday, February 21, 2005

The FNO Hack

Background
Like the previous hack, this took place on a Honeywell 6000 computer running GCOS. We had an application that used a random access file for persistence in a high traffic environment (~50 allocations/second for sustained periods of time.)

To understand the hack you need to understand how floating point math was implemented on the Honeywell 6000 machine.

The H6000 worked with floating point numbers in the EAQ register where E is an eight bit exponent register and AQ is a double-word (72 bit) general purpose register used as a mantissa in this case. Floating point operations did not always produce normalized results, so a special instruction, FNO, normalized the current contents of the EAQ register.

Floating point normalization
Consider Avogadro's number = 6.022 × 10**23. Another way to express this is 0.602 x 10**24. The first version is normalized (base 10). The second is not. Notice the loss of precision. Binary normalization works the same way. To normalize a number you shift it left to get rid of leading zeroes and adjust the exponent appropriately.

The problem
The challenge was to control the allocation of blocks in the random file. Remember memory is tight, so a free list is not a good solution. Instead we used an allocation table with each bit corresponding to a block in the file.

How can you find an available block as fast as possible.

The hacker Bob ("If you need comments to understand my code, you shouldn't be reading it!") Miller

The hack
An important decision was to represent available blocks as one bits and used blocks as zero bits. This allowed testing 72 bits at a time when looking for an available block by loading the AQ register and branching if the result was non-zero. Thus the the algorithm was:

loop:
load AQ tablepointer (auto incremented by 2 words)
jump if zero to loop
and now the hack:
load the E register with zero
FNO (floating point normalize)
copy the E register to a general purpose register.
Add 36 * the word offset in the allocation table

to get the available block number.




Rate this hack from 0 (slimey) to 10 (righteous).



Important factors in my rating of this hack are:

Although it uses an instruction for "the wrong purpose" if you read the description of the instruction in terms of how it maniuplates bits it does all the right things. In other words as long as Honewell doesn't decide to change how they represent floating point the hack should continue to work.

This is a high use function that has a direct and visible impact on performance.

Rating
I give it a 9 out of ten (If Bob had believed in comments it might actually be a 10).

Friday, February 18, 2005

The IOCW hack

Setting the scene

The time is the early 1970's. The computer, a Honeywell 6000 running the GCOS operating system. Some characteristics of this system:

Addressing space was 18 bits, but fortunately the addressable unit was a four byte word so addressable space was one megabyte (where a byte has nine bits, but that's irrelevant). The OS had virtual memory but individual processes were still limited to the 1M space.

The system had a separate I/O processor (known as the IOC) that handled all input and output. The IOC was progammable using a very limited instruction set (I/O control words (IOCW)). To perform I/O you created one of these programs in your memory and asked the IOC to execute it. You could do scatter/gather I/O using IOCWs. You could also branch (but not conditionally) from one set of IOCWs to another.

When an application requested I/O the OS validated the IOCW's then started the I/O. The OS handled interrupts and status coming back from the IOC, then passed the results (and a limited form of interrupts) back to the application.

The Problem


Memory is at a premium, and dynamic memory allocation is tough in assembly language. Where do you put the IOCW's?

The Hacker

Jim Mettes (who happened to be my boss at the time.)

The Hack

Put an IOCW directly in the read buffer. The IOC fetches the instruction from the buffer first, then reads the data into the buffer -- overwriting the IOCW which is no longer needed. Four bytes (the size of an IOCW) saved!



At this point you should decide for yourself where this hack rates on the slimey to righteous scale. Zero is ultra-slime and 10 is mega-righteous.



The rest of the story.

Consider what happens when a recoverable I/O error happens. The IOC reports the error. The OS determines that it is retryable and asks that age-old question: "Abort, Retry, or Ignore?" In this case the question goes to the console operator, not the person running the program. "Retry" says the console operator because that's what his proceedure manual says to answer.

The OS tells the IOC to try again. The IOC refetches the IOCW -- but it's been overwritten by the data from the previous attempt. Remeber the point about the OS validating the IOCWs before issuing the I/O request? Guess what the designers of the OS forgot to do during the retry? If you were lucky you ended up with a core dump of your application. If you weren't quite so lucky the system administrators ended up with a core dump of the whole machine and came looking for your head!

The Score

I give it a 1. High danger, low payoff. And how do you tell your boss about it?

Wednesday, February 16, 2005

Thought #2 for the day

Some code idioms are inherently bug-prone. High on my list of offenders:

abc()->widget();


abc() presumably returns a pointer to an object that has a widget method. But what if it doesn't have a pointer available?

Programmer: Doctor, it hurts when I do this.
Doctor: So don't do that!


now if abc returned a reference rather than a pointer....

Dale

Fragile programming

There was a time when I practiced defensive programming. The theory was that code should be able to handle anything thrown at it.

What a bad idea!

If you're writing a square root function and someone hands you a negative number, DON"T fix it for them. Abort, throw an exception, trigger an assert, whatever! Just be sure that they know as soon as possible that they've done you wrong. (A compile-time error would be ideal.)

This rant was triggered by code throughout ACE and TAO that looks like:

if (do_some_function () != 0)
return -1; // this should not happen

(Yes, the comments are really there in the code.)

Adding an ACE_ASSERT (and the missing brackets which should have been there in the first place) to one instance of this "idiom" revealed a longstanding design error in ACE's handling of Thread Specific Storage.

I'd rather work with fragile code than helpful code!

Just to be thorough, the code should be:

if (do_some_function () != 0)
{
ACE_ASSERT(false);
return -1;
}

not:

ACE_ASSERT (do_some_function () == 0);

Tuesday, February 15, 2005

Relativity values

In thinking about historical hacks I realized I needed to set the scene to make some of them understandable.

This started me thinking about the computing world and culture then vs. now. Everybody knows about the dramatic drop in price of electronics, but I'm not sure everyone has a gut feeling for the magnitude of the change. In fact, I often forget, and I lived thru it.

So: thought for today.

When I started working as a programmer, if I saved my entire salary (no food, no clothing, etc.) I could buy the computer I worked on in only 333 years.

Today I could buy a much more powerful machine every other day!

That helps to explain why we were willing to invest months and sometimes years of programmer time for relatively small gains in computer time. Thereby motivating hacks...

Friday, February 11, 2005

The Hack Spectrum

Near the end of an otherwise excellent presentation on the boost shared pointers(boost::shared_ptr), Jonathan showed a slide with the title "Tricks." On this slide it explained how to create a smart pointer to a stack based or static object. This immediately raised by "hack"les.

Pointing a smart pointer at an auto variable violates the semantics of the pointer. The fact that it is syntacticly correct is not an excuse. In fact I consider this a defect in the design of the smart pointer -- a well designed smart pointer would not allow such nonsense. [Note what constitutes a well design smart pointer [IMO] is another topic that I might get into later...]

Jonathan defended the trick by saying "as long as you know what you are doing, it's ok." Alas, that's not correct. A more accurate statement would be: "As long as you and every programer who touches the system either before or after you is aware of all of the implications of this trick, then it's not quite so bad." What makes this hack so insideous is that it doesn't break your code; I breaks mine, months or years later. What's worse, once I find the true cause of the "bug" in my code I have lost confidence in the reliability of smart pointer. This means that every time I use smart pointer I have to consider and either eliminate or allow for the possibility that some programmer will have violated the semantics of the pointer somewhere in the system.

In Jonathan's defense, the technique he described comes straight from Smart Pointer Programming Techniques page on the boost web site. My quibble is not with Jonathan, but with the author(s) of boost::shared_ptr and that web page.

This has made me ruminate for the last few days on the various hacks I have encountered, or in some cases perpetrated over the years I have been developing software. Using the terminology from my early days as a programmer, I realized that there is a spectrum of hacks ranging from a "slimey hack" (one that works but makes you want to wash your mind out with lysol after you read it) to a "righteous hack" (one that makes you step back in awe at its beauty and clarity.)

So I thought I'd document some of these historic hacks and explain where I think they fit on the slime-to-righteous scale. Watch this space...

Thursday, February 10, 2005

Swig vs C#

I spent some time recently evaluating SWIG . It's a nice way to expose your C or C++ interfaces to perl or python (and possibly other languages) -- although it has some perfectly understandable limitation on what types you can use. These limitations are surmountable by writing some helper functions.

Unfortunately what the customer wanted was to access the interface "from .NET" (which I translated to: from C#. )

Swig has a -csharp option [Digression: Swig's argument are positional flags!!!!. That is "swig -csharp -c++ " doesnt work! You have to say swig -c++ -csharp. Not necessarily a confidence builder.] Unfortunately the entire documentation for the -csharp option consists of about 20 lines in the manual and these lines are not terribly helpful since they show how to access global C variables using Mono's version of C#.

"Google swig csharp" or "google swig c#" didn't produce anything more helpful. (actually this blog may now show up in that search and I've already explained more than any of the other hits did (chuckle))

There's a SWIG wiki page, but when I rummaged around on it the only relevant thing I found was a FAQ (with no answer) that turned out to be pretty useless.

Bottom line, I think it'll be easier to write a managed C++ bridge layer to expose the C++ interfaces to the .NET world. We'll see (if I ever get back to this issue.)

Monday, February 07, 2005

Linux on the Desktop

Mitch Kapor as quoted in InfoWeek says:

A desktop version of Linux hasn't advanced as rapidly as the server version for a number of reasons. Microsoft's dominance in the desktop operating-system market is a big reason, but Kapor suggested another: the nature of desktop Linux development. Each component that comprises a desktop operating environment, whether the graphical interface, productivity applications, or browser, is being developed by different groups with little collaboration.

"It's not principally a technical issue," Kapor said. Rather, it's been a lack of motivation for these groups of developers to create a unified interface for users.

I say the reason Linux on the desktop isn't more widespread is the word processor in Open Office sucks (ahem, I mean has some fundamental flaws) -- even moreso than does Microsoft Word!

Real people (i.e. the rest of them) don't care from operating systems. They care about reading mail and writing documents and browsing the web. Two outta three aint good enuf.