flipcode - Bilinear Filtering (Interpolation)

Bilinear Filtering (Interpolation)
by (13 January 1999)

Introduction

Bilinear interpolation is a process that enhances textures that would normally be larger on-screen than in texture memory, i.e. a single texel is used for more than one screen pixel. Regular texture mapping would appear to be 'blocky' in this case. Bilinear interpolation reduces the blockyness by interpolating between texels.

To be more precise: The texel coordinates are internally kept at a certain precision (usually using 8:8 fixed point). During texel fetching, usually only the integer part is used to index the texels. With bilinear interpolation, 4 texels are fetched, and the fractional part of the texel coordinate is used to calculate a weight factor for each of these four pixels, so that the final color of the pixel on screen is a mix of the four fetched pixels. This process enhances enlarged texture maps, but also rotated texture maps, because for these textures effectively the pixels on screen are anti-aliased.

Obviously, this process requires substantial calculations per pixel. First, four weight factors must be determined. These are:

1. Uf * Vf

2. (1 - Uf) * Vf

3. (1 - Uf) * (1 - Vf)

4. Uf * (1 - Vf)

where Uf and Vf are fractional parts of the texel coordinates (U,V).

Then the 4 texels must be fetched and multiplied by these factors. Finally, the four weighted colors must be added and drawn on the screen.

To speed up this process, I decided to use the four most significant bits of my U/V coordinates. So, Uf and Vf together fill 8 bits. For each of these 256 combinates I then precalculate a weight factor. This weightfactor is multiplied by 16 and stored as a byte. So, the 'multiply table' (called multab in the sources) contains 256*4 bytes.

To make the multiply of the weightfactor and the color fast, I decided to use palletized textures. Each texture can have up to 256 distinct colors, wich is usually enough. Note that EACH texture can have it's own palette, so the display will certainly not look as monochrome as Quake. Besides, these are only source colors, shading and transparency can increase the actual number of different colors used to draw a texture. The 256 colors are stored in a table in the texture class. Each color is represented by it's 5/6/5 or 5/5/5 value. Besides this original value there are 15 scaled versions of the color (one for each possible weightfactor). So, to retrieve a color and multiply it by the correct weight- factor, I can now simply look up the right color in the palette table. The palette table is by the way 256*16*2 bytes, that's 8Kb for each texture. Because I retrieve scaled versions of each colors, two things should draw your attention:

1. The four scaled colors can be added, and will never exceed the maximum R/G/B values, even when all four fetched texels where completely white.

2. Four scaled versions of the same color will result in a potentially lower quality mix, because we loose some accuracy here and there.

The final pixel can now be drawn to the screen without further processing.

So This Is Fast?

Not quite. There's an awfull lot of pixel-fetching and weight factor retrieving, so avoiding AGI's is a disaster. Not to mention cache-misses... Some heavy VTune sessions helped me to get rid of the AGI stalls. The cache misses where reduced by storing four weightfactors in a single DWORD and retrieving them with a single 'move eax,[]'. That way, the first two weightfactors can easily be accessed using 'mov bl,al' and 'mov bl,ah', the other two after 'shr eax,16'. The clock count is now down to 43 ticks per 2 pixels. There are no nasty stalls anymore, so the code is theoretically almost as fast as it could get.

You Said 'almost'?

Yup. There's a cool document, called 'fatmap', that describes a technique to enhance texel fetching for texture mappers using altered source texture maps. The texture maps are pre-processed so that the texels are stored in tiles of 4x3 pixels (or any other appropriate size). Each tile is stored linearly, so that when the first pixel of the tile is read (probably causing a cache miss) all other pixels are available in the cache. Hence, three pixels in a row will be in the cache, no matter in wich direction you are reading texels. When all your texels are stored sequentially however, only horizontal texel reads benefit from the cache. I will probably implement this technique to be sure that it is as fast as it can possibly get...

And How Fast Is That?

That's twice as fast as Intel's code (as described on the MMX developer pages, those need 47 ticks per pixel). It's still 8 ticks faster than the FPU routine that a dude on the comp.alt.algorithms newsgroup mentioned (he didn't give the code though, like me). But speed comes at a price, and the price is (as usual) accuracy. The bilinear interpolation is based on 4 bits for U and 4 bits for V. That means that between two adjacent texels only 16 gradients will be calculated. Due to the way the weightfactors are handled internally, it's actually even worse. But for bitmaps that are not zoomed in too far, it looks FAR better than no bilinear interpolation, and it also looks FAR better than the 'bilinear interpolation' that Unreal uses (the dithered effect used in software rendering).

Closing

Greets to all cool coders out there. And one last note: If you are hardcore, and you know how to improve this stuff beyond the incredible level that I reached, let me know!

THE PHANTOM