Instancing

Oliver Marsh
When we ask the GPU to render a sprite or mesh to the screen, the biggest bottleneck isn’t the actual rendering of the pixels, but sending the data from the CPU to the GPU. And what this comes down to is how many draw calls we issue in each frame. (A draw call being in OpenGl glDrawElements). So instancing is an optimisation technique in which we bundle as many render objects (render object being a sprite or mesh in our game world) into the same draw call.

In a perfect world we would have a draw call for each render object and we would set uniforms unique to that render object i.e. the color, the MVP matrix, the objects material etc. However if we had a even handful of render objects in our scene, our program would come to a slow unplayable halt. So we want to share as many things as possible across render objects. The only hard limit that we can’t share is the vertices & indices array we are drawing, and the shader we are drawing with. The rest is up for grabs. Even with these two hard parameters, there are still work arounds like having a uber mesh or under shader. The good news is with 2D games, we only use one vertex buffer in the game, a quad, and in this game one shader for sprites. Given this, there is no reason why we can’t render our whole scene with one draw call. And this is what we’re going to do!

To make our game code as simple as possible, the game code knows nothing about instancing. We can call ‘pushSprite’ or ‘pushMesh’ anywhere in the code, which gets added to a render buffer. Then when we are ready to issue the draw commands to the GPU we call ‘executeRenderGroup’ which does all the heavy lifting.

Since the render objects are unsorted in what sprite or shader they use, we first have to sort them. This is the first part of our instancing algorithm. Our sort function is a quick sort that groups the objects based on a criteria. This criteria depends on how your game engine is set up, but for my game is looking for:

1. Same vertex handle (our quad)
2. Same shader program handle
3. Same texture handle (where a texture atlas comes into play)

It groups render objects with this criteria, together. After this we can then loop through them and collect unique information about the render object like it’s position and color. This is part two of our instancing algorithm. This loop looks like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
for(int i = 0; i < renderGroup->items.count; ++i) {
	bool collecting = true;
        while(collecting) {
            RenderItem *nextItem = getRenderItem(group, i + 1);
            if(nextItem) {

                if(info->bufferHandles == nextItem->bufferHandles && info->textureHandle == nextItem->textureHandle && info->program == nextItem->program) {
                    
                    //collect data
                    addElementInifinteAllocWithCount_(&pvms, nextItem->PVM.val, 16);
                    addElementInifinteAllocWithCount_(&colors, nextItem->color.E, 4);
                    
                    if(nextItem->textureHandle) {
                        addElementInifinteAllocWithCount_(&uvs, nextItem->textureUVs.E, 4);
                    } else {
                        assert(uvs.count == 0);
                    }
                    i++;
                } else {
                    collecting = false;
                }
            } else {
                collecting = false;
            }
        }



So here we are looping through and collecting all the unique information and putting in into a stretchy array. In our engine the three things we need to send to the shader are:
1. PVM matrix to position the object on screen
2. Color tint
3. The texture UV coords since we are using a texture atlas

The final part of our instancing algorithm is then converting this array into a form that the GPU can access. Since arrays can’t be very big in glsl, I’ve gone for storing this info in a texture buffer. What this looks like is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
typedef struct {
    GLuint tbo; // this is attached to the buffer
    GLuint buffer;
} BufferStorage;

BufferStorage createBufferStorage(InfiniteAlloc *array) {
    BufferStorage result = {};
    glGenBuffers(1, &result.tbo);


    glBindBuffer(GL_TEXTURE_BUFFER, result.tbo);

    glBufferData(GL_TEXTURE_BUFFER, array->sizeOfMember*array->count, array->memory, GL_DYNAMIC_DRAW);

    
    glGenTextures(1, &result.buffer);

    glBindTexture(GL_TEXTURE_BUFFER, result.buffer);

    
    glTexBuffer(GL_TEXTURE_BUFFER, GL_RGBA32F, result.tbo);

    
    return result;
}


So it creates a GL_TEXTURE_BUFFER object, which we then copy the array data into. We create a buffer for each unique info (one for PVM, one for color & one for UV coords), then set the buffer handle via a uniform. We then delete this buffer the next frame, to make sure we aren’t deleting anything in the middle of rendering.

There is one more piece to the puzzle here. In order to do things as I’ve done i use a modified version of glDrawElements, which is glDrawElementsInstanced. This is specifically made for instancing in which pass the usual arguments as drawelements, but with one more argument specifying how many times we want to draw this vertex array. Then in our shader we have a handy predefined variable call gl_InstanceID. This gives us the index of the instance being drawn, which we use to access the info our of our arrays.

For accessing our PVM matrix we use the following code:

1
2
3
4
5
6
7
	int offset = 4 * int(gl_InstanceID);
	vec4 a = texelFetch(PVMArray, offset + 0);
	vec4 b = texelFetch(PVMArray, offset + 1);
	vec4 c = texelFetch(PVMArray, offset + 2);
	vec4 d = texelFetch(PVMArray, offset + 3);
	
	mat4 PVM = mat4(a, b, c, d);


This wraps up how I’m doing instancing for the game. There are many ways to make use of instancing, and comes down to how your engine is set up, and what can you share amongst render objects. Can you use a texture atlas instead of seperate textures, can your use a uber shader instead of seperate shaders? and for unique data, what’s the best way to retrieve this on the GPU?

Some more info on topic:
Opengl Instancing tutorial
Randy Gaul lecture at Digipen

NOTE:
I tried using a standard glTexImage2D(GL_TEXTURE_2D… to store the unique data instead of a GL_TEXTURE_BUFFER, but I couldn’t get it to work. I think because of the implicit remapping of the values behind the scenes. Whereas GL_TEXTURE_BUFFER doesn’t do this.

NOTE: To my dismay GL_TEXTURE_BUFFER isn’t supported on iOS. However isn’t applicable now since they’ve moved to Metal (even more dismay :().


IMHO much cleaner way to use matrices in instancing is using glVertexAttribDivisor for your matrices. Stick them in regular vertex array buffer one for each instance and call glVertexAttribDivisor (4 times, because mat4 takes 4 attribute slots) with divisor set to 1. Then you'll have regular "in mat4 PVM;" attribute in vertex shader. No need to deal with texture fetches. Same for other non-PVM matrix attributes that are unique per mesh/model.
That does sound cleaner. So is that a seperate vao buffer than the one that is used for the mesh?
VAO does not matter here. What you should be thinking is about GL_ARRAY_BUFFER buffer. You can put it in same buffer as your main mesh data as long as you set up glVertexAttribPointer offset correctly. Or just put it in new buffer with offset 0, then update it whole and remember to bind it before setting attrib pointers.
And each frame would I do a new glbufferdata call for the same Vbo to update it with the new marix values? And am I right to assume this new glbufferdata call delete the old buffer data? Thanks mmozeiko for the help.
OliverMarsh
And each frame would I do a new glbufferdata call for the same Vbo to update it with the new marix values? And am I right to assume this new glbufferdata call delete the old buffer data? Thanks mmozeiko for the help.


Yes glBufferData will delete the old data. glBufferSubdata will overwrite the old data keeping the allocation (though the buffer may get a shadow copy to avoid needing to wait on the gpu).
I was also wondering with this approach, since a vao might be used to draw more than once in the frame, would I have to unbind the last vbo and bind a new one before each gldrawelementsinstanced call, since if I use the same vbo handle I would overwrite the last instance batch’s data?
When is a safe place to delete bufferdata: after gldrawElements call or after swapWindow call?
You can safely overwrite the previous drawcall's data like that in opengl. That guarantee is part of it's immediate mode roots.

If the draw call isn't done yet the driver will duplicate the buffer transparently.

Thankypu for the help rachetfreak, I’ll see how I go.
ratchetfreak
You can safely overwrite the previous drawcall's data like that in opengl. That guarantee is part of it's immediate mode roots.

If the draw call isn't done yet the driver will duplicate the buffer transparently.



Also you can take this approach and add several buffers revolve through them each frame so you dont get sucky perf hits when opengl starts to do copys behind your back. Although I am not for sure if you can avoid this easily in opengl.
godratio


Also you can take this approach and add several buffers revolve through them each frame so you dont get sucky perf hits when opengl starts to do copys behind your back. Although I am not for sure if you can avoid this easily in opengl.


Avoiding that overhead means diving into AZDO. But it kinda means poking at a black box and hoping the internals work out like you want them to.
I moved over to using the above method, and works fine. However I did run into a interesting bug. I draw with two different shaders, one for drawing textures and one for drawing colored quads. The colored quad shader doesn't have an attribute for the uv coordinates (used for looking up in the texture atlas), but they both have one for the PVM matrix & color tint.

1
2
3
in mat4 PVM;
in vec4 color1;
in vec4 uvAtlas; //not in the colored quad shader


On my windows computer the scene rendered fine and there were no problems. But then I ran it on my mac and the textures were rendering but the colored quads weren't. The only difference is that I do a branch to see it is a texture or not. If it is I set the UV attribute, if it isn't I don't.

1
2
3
4
5
6
if(isTexture) {
        GLint UVattrib = getAttribFromProgram(program, "uvAtlas").handle;
        glEnableVertexAttribArray(UVattrib);  
        glVertexAttribPointer(UVattrib, 4, GL_FLOAT, GL_FALSE, offsetForStruct, ((char *)0) +sizeof(float)*20);
         glVertexAttribDivisor(UVattrib, 1);        
 }


So with this branch it didn't draw the quads on mac. So I put a dummy attribute in the quad shader for the uv coordinates and commented out the if statement and it worked.
1
2
3
in mat4 PVM;
in vec4 color1;
in vec4 uvAtlas; //now in both shaders

So not sure what caused the bug.
This code seems to be correct. By "colored quads not rendering" you mean they are not showing up on screen at all? Maybe you can use transform feedback to capture varying output from vertex shader to verify if values seem to be reasonable. Do you have ARB_debug_output extension available on Mac? If yes, enable it and check if it shows some error.

It is better not to query attribute locations at runtime. If you are on GL 3.3 or newer, specify them directly in shader - "layout (location=2) in vec4 uvAtlas;" On lower GL versions bind location before linking shader - glBindAttribLocation. Basically avoid glGet... functions as much as possible.

This way your shaders will have have exactly same attribute locations between them, so you can share same vertex array buffers between them.

Another option is to not use different shaders for this. Create dummy 1x1 texture with (255,255,255) color. Then use your textured shader to draw colored polygons - multiply texture sampled result with color passed in. The performance potentially could be better because you'll have less state changes in GL.
Thanks Martins for the suggestions.

You said:
This way your shaders will have have exactly same attribute locations between them, so you can share same vertex array buffers between them.


Just to clarify one point: I'm sharing the same vao handle and same vbo handle for both shaders. Should I be doing this if they have different attribute locations i.e. I just use glGetAttribLocation?

As well is there a perf hit if I use glGetAttribLocation at init time and cache the locations vs using glBindAttribLocation?

The white dummy texture sounds like a good idea, i'll definitely do that.

Have debug error turned on but doesn't seem like any errors. As well "not drawing" meaning not appearing on the screen where the normally are. I'll have a go using the transform feedback to see if the values are reasonable. I did hard code the color in & didn't get any changes.

with glBindAttributeLocation you can make the locations that same for all shaders and allows you to stop rebinding them every time.
This has more to do with shaders than with vao or vbo. When you compile different shaders the attributes (and also uniforms) get assign some "random" locations. Which by default are not controlled by you. This means that even if you name them same way "in vec2 texcoord;", even if you put them in same order, the different shaders can have different locations assigned to these attributes.

So whenever you will want to switch to different shaders you'll need to bind different vao, or in case of same vao you'll need to do glVertexAttribPointer + enable/disable them. Which potentially is expensive. This is what specifying fixed locations "layout (location=N) in vec2 texcoord;" in shader gives you. You won't need to change attrib pointers anymore. You do this with "layout (location=N)" on gl3.3+ or with glBindAttribLocation.
Ah, got it thank-you Martins
RenderDoc sometimes can help when something doesn't work.
But does it work on macOS?
Oups, didn't thought about that.
I think the bug might be due to what Martins was saying about using glGetAttribute vs glBindAttribute, and some misunderstanding of opengl on by behalf.

Just to clarify how opengl works:

When I call :
glBufferData(GL_ARRAY_BUFFER, ...)

opengl knows to bind that data to the last call to:
glBindBuffer(GL_ARRAY_BUFFER, vboHandle)

And when I call glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, ...) for the indices, how does opengl know which GL_ARRAY_BUFFER I'm referring to? Is it the last previous call to glBindBuffer? And when I do a call to glVertexAttribPointer, is is again the last GL_ARRAY_BUFFER bound that that attrib pointer is talking about?

GL_ELEMENT_ARRAY_BUFFER and GL_ARRAY_BUFFER can be bound at the same time. That's the whole point of binding. There are actually much more buffers that can be bound - http://docs.gl/gl4/glBindBuffer

Think of glBindBuffer as following function:
1
2
3
4
5
static std::map<GLenum , GLuint> GlobalBufferMapping;
void glBindBuffer(GLenum target, GLuint buffer)
{
    GlobalBufferMapping[target] = buffer;
}


Then whenever you use function such as glBufferData, it works like this:
1
2
3
4
5
void glBufferData(GLenum target, GLsizeiptr size, const GLvoid* data, GLenum usage)
{
    GLuint actualBuffer = GlobalBufferMapping[target];
    ActualCallToBufferData(actualBuffer, size, data, usage);
}


glVertexAttribPointer always works on currently bound buffer to GL_ARRAY_BUFFER slot. Because vertex data is stored in buffer called "[vertex] array buffer".

This all goes away with GL_ARB_direct_state_access extension (in core starting with version 4.5). No more glBind... calls. You just pass GLuint buffer handle directly to functions.

Thanks Martins again, that made a lot more sense.