flipcode - Dirtypunk's Column

Dirtypunk's Column - Issue 01 - Hardware Rendering Optimizations
by (07 March 2000)

Introduction

Well for the sake of all involved, I have decided to start putting together a few tutorials here at uni, while waiting for my lectures. Kurt and I had discussed a column, and now I have time. I'm planning to surprise him with my first update, on a subject he suggested. Optimizing hardware rendering through your friendly neighborhood graphics API.

You see, you may think there is a great deal of information on this on the net, and if you look hard enough, I suppose you would be justified in saying that. But unfortunately not all of this information is valid, valuable or even explanatory.

Here I'm first going to start off with a few principles -

Avoid unnecessary state changes.
Avoid unnecessary calls to your primitive drawing routine - Pass as much as you can at once
Write stuff to the accelerator, but never read it. Any read backs have a tendency to stall the rather fragile pipeline/fifo that has been implemented.

Now, these seem rather general so let me start explaining one by one.

State Changes

Renderer State changes are things which involve, funnily enough, changing the state of the renderer. These include such wonderful items as changing textures/render modes/blend modes. These are slow operations which usually change hardware registers. The usual method involved in optimizing this it to sort by material, which means you can set your materials once, and then render all the primitives with that material. A quick way to do this sort is with a texture hash.


Eg:
namespace RendererObjects {
      LinkedList<Polygon> * Hash
};

class Polygon {
      //Coordinates and stuff
      unsigned int m_texture_handle;

	
public:
      void AddObject();
};

void Polygon::AddObject(Polygon * Poly) {
      RenderObjects::Hash[m_texture_handle].AddNode(this);	
}

This should be fairly obvious - The polygons are inserted into the hash based on the texture handle they use. This can be extended to do entire materials like shaders, including blend mode information etc.

Now lightmaps can be more of a problem. Lightmaps produce lots of little maps. The obvious solution is just to put them into a bigger texture, and change your texture coordinates. This is part of the solution, but how do you hash sort them, and how do you guarentee efficiency?

If you just randomly throw the lightmaps in together you will probably end up doing just as many state changes anyway. If you just group them by material, you end up with lightmaps that will never be visible at the same time.

The way to get your combination right is to put together lightmaps from polygons both with good geometric proximity AND surface material. Surface material has the higher priority, but geometric proximity must be taken into account if you have more smaller maps than can fit into a bigger map.

Then at render time, to get your sorting for state done right for these big maps, you have a double hash (That is a table within a table). The first array is for material, and the second array is for lighting. So each polygon to be rendered has a material number and a light number associated. Then you can set your base material states once, and set your lighting texture states once per secondary hash. Often you may have 2 lots of multtexture passes, and may only use the light hash info to build offset information for rendering the light pass.

Avoiding Unneccesary Draw Calls

Well, this may not be obvious, its just a procedure right? Not much overhead? Wrong. Draw calls are extremely expensive, with setup etc. Profiling will show this is often where most of your time is spent. These calls add a large amount of overhead. The best way to maximize rendering speed is to pass as many triangles through the single most optimal point as little as possible :)

But, we have some problems. Draw calls in OGL and D3D can draw only in one renderer state, and if drawing strips or fans they need a call per strip/fan. The best way to pass all your triangles is to use an array or vertex buffer, and some form of indexed primitive, usually discrete triangles. This way you can copy all your indices and vertices into a single vertex array/buffer, lock it and draw it.

But, and here is the catch - Remember at the start of the paragraph I said you can only draw in one render state per call? Ooh, that makes it tougher. So, what you do is reuse those lovely hash tables again, and put your geometry into arrays for each material.

In some cases, it is better to put your geometry into long strips or fans. In many cases it isn't, but on various cards (namely the GeForce, which loves strips) you can get more benefit from lots of long strips passed through multiple calls, than passing the whole kitten kaboodle through a single call to say glDrawElements. For example, tesselated bezier surfaces are stripable extremely easy. It might be worth on higher tesselation to use strips.

Also note that passing triangles in strippable order is a good optimization - Even if auto strip joining isn't done (stripping can be done simply by checking for double edge use), which I imagine it is in some cases, as Q3 is optimized for it, you do getter higher cache efficiency, and it is better for cards like the GeForce which have an actual vertex cache which stores transformed vertices in on board memory.

Also, for those systems with lovely SIMD instructions, it is best to pass your vertices as 3 floats, aligned as 16 bytes. This works well for SIMD, as the stuff can be thrown at the processor for transformation. This is a simple stride, and you should be doing it with a single 4byte padding between verts. Again, this is an optimization Q3 uses, so I would expect it to be used.

Write But Don't Read

Reading data back from your accelerator is going to make your peformance bite worse than Mike Tyson. These no-no's include reading from the frame buffer across the bus, reading from the z-buffer across the bus, reading back vertices stored on a T&L card etc. While this may seem to be a pain in your large and hairy, in most cases it is not needed to read back. After all, the AGP bus and your graphics card are made for writing things. While the occasional readback may be necessary, for specific applications (for example, if you are using graphics hardware to pre-compute visibilty using extended projection methods). It should not be used during rendering in engine for this reason (the visible surface testings are an application which pushes a lot more polygons, and the readbacks are all done at the end after rendering).

The End

Okay, I hope this will be enough for you guys to get your brains ticking over for now. Just remember, stay out of trouble and run like hell if someone on #flipcode catches you with their momma.

Conor "DirtyPunk" Stokes

Article Series:

Dirtypunk's Column - Issue 01 - Hardware Rendering Optimizations
Dirtypunk's Column - Issue 02 - Line Dualities: Plucker Space And You
Dirtypunk's Column - Issue 03 - Visibility Theory
Dirtypunk's Column - Issue 04 - View Independent Progressive Meshes
Dirtypunk's Column - Issue 05 - AABB Trees - Back To Playing With Blocks