Building a 3D Portal Engine - Issue 10 - Intermezzo - 8/15/16/32 Bit Color Mixing by (05 March 1999) Return to The Archives
 Introduction
This time something completely different: Color mixing, transparency and getting those fast. Cool for your textured windows, and projected lights. Getting the job done in 24 bit is easy, but what about 16, 15 and 8 bit displays?

When you first see the topic of this article, you might ask yourself: Isn't that a bit off-topic? The answer is: Yes and no - Yes, your average portal renderer doesn't need it. And no, because color mixing techniques are needed for a lot of purposes: Gouraud shading, phong shading, transparant polygons and projected ligths all need to do something with two colors. In the case of a gouraud shaded texture mapper you want to mix an R/G/B value with a texel before plotting it on the screen. In the case of transparent polygons you want to mix a pixel with pixels already in the screen buffer.

There are many ways to combine two colors:
• Add color two to color one
• Multiply color one by color two
• Add a percentage of color two to a percentage of color one.
• The first way may seem the easiest: Simply add red, green and blue values, and voila, there's your final color. It's not that simple however, since each color component has a maximum value, after wich it either goes back to black, or, worse, overflows in the next color component. So, you have to perform some sort of clipping. The second way is great for shading: If you interpolate an intensity over a gouraud shaded polygon, you want to be able to make it completely black, or to give it it's original color, or light it up to pure white. Here are two problems: First, three multiplications for the three color components are exceptionally expensive. Second, if you multiply by more than one (to make the pixel brighter than the original texture), you have to perform clipping again. The last way is perfect for transparent surfaces. You can do it very easy by dividing both source colors by two and adding them: This is almost 50% of color one, plus 50% of color two. I said almost: You lose one bit precision here, so if you mix two identical colors, the result is not always the same as the source colors.

Here's some sample code for the three techniques. It works for 15, 16, 24 and 32 bit, as long as you define the constants REDMASK, GREENMASK, BLUEMASK and the shifts for the colors correctly. The 'shifts' represent the position of each color component in the 32 bit value. For 24 and 32 bit those should be:

 ``` #define REDMASK 0x0F00 #define GREENMASK 0x00F0 #define BLUEMASK 0x000F #define REDSHIFT 16 #define GREENSHIFT 8 #define BLUESHIFT 0 Adding colors: long addColors (long color1, long color2) { int red=(color1 & REDMASK) + (color2 & REDMASK); int green=(color1 & GREENMASK) + (color2 & REDMASK); int blue=(color1 & BLUEMASK) + (color2 & BLUEMASK); if (red > REDMASK ) red = REDMASK; if (green > GREENMASK) green = GREENMASK; if (blue > BLUEMASK ) blue = BLUEMASK; return (red+green+blue); } Multiply: long multiplyColors (long color1, long color2) { int red=(color1 & REDMASK) >> REDSHIFT; int green=(color1 & GREENMASK) >> GREENSHIFT; int blue=(color1 & BLUEMASK) >> BLUESHIFT; int redmul=(color2 & REDMASK) >> REDSHIFT; int greenmul=(color2 & GREENMASK) >> GREENSHIFT; int bluemul=(color2 & BLUEMASK) >> BLUESHIFT; red = (red * redmul) >> GREENSHIFT; green = (green * greenmul) >> (REDSHIFT-GREENSHIFT); blue = (blue * bluemul) >> GREENSHIFT; return (red + green + blue); } Mix 50% of both colors: long mixColors (long color1, long color2) { color1 >>= 1; color2 >>= 1; color1 &= (REDMASK>>1)+(GREENMASK>>1)+(BLUEMASK>>1); color2 &= (REDMASK>>1)+(GREENMASK>>1)+(BLUEMASK>>1); return (color1 + color2); } ```

Those are the basics. Now, let's get it fast.

The first algorithm I showed (adding two colors) is already as fast as I could get it, without reverting to assembler. In case you want to know how to get it faster using some assembler code, here's a hint: Use the carry, and if you have a PII, use the conditional move instruction. That way you get your clipping without branches. I would appreciate it if someone could send me the MMX equivalent, by the way.

The third algorithm is also as good as it gets. Even assembler can't help you here, if you are willing to take the small accuracy penalty, this is the way to go. Especially for 15 bit graphics, you might want to add the colors one component at a time, and do the shift afterwards. This doesn't make much sense for 32bit graphics, especially not when they are moving.

The second algorithm can be done much better. First thing you should notice as a good graphics coder is that we multiply by small values: 0..255, for the 24 bit version. So, this can be sped up perfectly using a small LOT (look-up table). Just build an with 256*256 bytes, and calculate each element by multiplying the row index by the column index, divided by 256. A 65536 element array might be a bit too much, especially for the cache, so you might want to use a 256*64 or even 256*32 array (8192 bytes), because having more levels of each colors doesn't really make sense anyway.

Now that I've introduced look-up tables, we can safely move to 8 bit graphics, because look-up tables are THE way to achieve good results at low color depths.

A friend of mine once implemented a cross-fader that would fluently interpolate between two 8 bit images, without changing the palette. The overall effect was like it was rendered using hi-color, but it ran fine on a 486. How did he do that? Simple, using a couple of LOT's.

First, he constructed some arrays, containing 16 bit versions of the colors in the palette of the image. Each array had 256 elements, each holding a word. He constructed multiple arrays like that, one for each intensity that he needed. He wanted to have transitions in 8 steps, so he needed 8 tables. Next, he constructed a color cube, with 65536 elements. This is a three-dimensional array, 32*64*32. Each element of this cube represents a 16 bit color. Now comes the good part: Each of these 65536 color combinations was then linked to a color in the actual palette, by searching the palette for the 'best match': If you compare the red, green and blue components of a color in the color cube to each color in the existing palette, the best match is the color with the smallest squared distance:

 ` distance = (red1-red2)^2 + (green1-green2)^2 + (blue1-blue2)^2; `

This 'best match' is the value that is stored in the color cube.

Now the main loop is simple. He took two colors, and used the intensity lookup tables to determine for each color the scaled down 16 bit value. Because the intensities of both colors always summed to 100% (e.g., 25% of color 1, 75% of color 2) he could always be sure that the summed color would be in the range 0..65535, without color components bleeding into each other. The resulting value was then looked up in the color cube, resulting in the 'best match' for the mixed color in the existing palette. This color is then plotted on the screen...

So, look-up tables are good. But what about hi-color and true-color?

When dealing with higher color depths, you have to be careful: Tables get huge, and besides taking up large amounts of precious memory, they are bad for the cache. A Pentium II that suffers a cache miss has to fetch data from the main memory, and this takes a lot of cycles. If you have a large table, you can be sure that almost every look-up causes a cache miss, unless you can arrange that you read from the same region in the table a couple of times. If you can't, it might be better to do an expensive calculation instead. An example: Focus used to mix light intensities for the beams with the textures using a look-up table. There where 32 intensities for the light, and and that time Focus rendered only in 16 and 15 bit. So, a look-up table for a single light color was 65536*32 words, wich is 4 megabyte. An advantage was that instead of having to add red, green and blue separately (with clipping) to the texel colors, I could look-up the final color immediately. In this case, there was no significant speedup... Just a big waste of memory, plus the limitation of only a single light color.

Focus does however make good use of look-up tables too: The bilinear interpolation heavily depends on it. To get the bilerp fast, I restricted each texture to a palette of 256 colors, wich is not bad, since each texture can have it's own palette. Then, for each palette entry I calculate 16 intensities. So, instead of having a palette table with 256*4 bytes, I have a palette table with 256*4*16 bytes, in wich I can directly look-up any of the sixteen intensities of the palette colors. The bilinear interpolation algorithm works as follows: When a texel is fetched at coordinate (U,V), the fractional parts of U and V are used to determine how much of the texel at (U,V) is used for the final color, and how much for the three neighbours, (U+1, V), (U, V+1) and (U+1, V+1). If you scale the fraction for each of the four texels to a value between 0 and 15 (make sure that they sum to 15), you can look-up the four colors in the intensity table, and add them without further hassle. The fraction for each component is determined using another look-up: I store U and V as fixed point values, meaning that effectively each value is multiplied by 256 (so 8 bits are added on the right side), so that the 8 extra bits can be used to represent the fractional part. Now I can simply take the fractional parts of U and V (wich are values between 0 and 255), divide them by 16 (so they are between 0 and 15) and use them as an index in a two-dimensional array, wich holds four values:
• (U_frac * V_frac) : The weight of texel (U, V)
• ((1-U_frac) * V_frac) : The weight of texel (U+1, V)
• ((1-U_frac) * (1-V_frac)) : The weight of texel (U+1, V+1)
• (U_frac * (1-V_frac)) : The weight of texel (U, V+1)
• Note that the look-up of the weightfactors is cache-friendly: The first lookup probably causes a cache-miss, but the other three are right there, since the cache always reads 32 bytes at once. This results in true bilinear interpolation (well, almost 'true', since there are only 16 levels in both horizontal and vertical directions), at a very acceptable frame rate. In fact, my code is 25% faster than code that Intel presents on their MMX technology pages. Their MMX implementation of bilinear interpolation takes 47 ticks per pixel, while I get away with less than 30... Add a few cache misses, and your Pentium code still blows an MMX away. And not just with 24 bit graphics, like the MMX code demands.

That was an example of more advanced color mixing. My advice is: Use your brains, use look-ups, but keep an eye on the cache. Clever algorithms make graphics soar, and you don't need a 3D accelerator to do it for you.