72. Gotta Save Fast

Been a little busy with various things, including dealing with my girlfriend’s dog having gotten run over. He’s okay but with a broken hip. Somehow, getting a dev log post out in time for the end of November totally slipped my mind. In any case, if you haven’t been keeping up with the video dev logs. Now would be a good time. The latest one is a bit beefy:

With that said, I’d like to dive a bit more deeply into what was involved in getting the save game hitch down to an acceptable level. As of writing this post, There’s still a hitch doing the actual saving, so the job is really only halfway done. However, I did do some work to get screenshot saving down to an acceptable level, and I think that’s a problem that many other games have, so it may be more useful to cover it in detail.

If you just want a decent solution to the problem, then feel free to skip to the end of the article to the section labeled “The Solution” and take a look at the code posted there. For the rest of you interested in hearing my journey to get there, read on!

The Problem

So why are we wanting to save screenshots as part of our save game routine? Well, the answer here is pretty simple, which is that we want the load game menu to show a screenshot of where the player was when they last saved. That way they can easily tell whether they are picking the right file to load:

Unity’s API really only gives you a few functions to handle taking screenshots. The first one and the easiest to use is ScreenCapture.CaptureScreenshot. It’s a simple to use function that takes a screenshot and saves it to a png file. This is great, right?

The problem is that it causes a massive lag spike. The peak of this single frame spike is close to 66 milliseconds. Needless to say this is unacceptable for a game that is targeting under 16.66ms per frame (and don’t even get me started about 8.33ms 120FPS which is apparently so impossible that Unity didn’t see fit to label it in this graph)

A Million and One Ways to Save a Frame

Okay okay, so lets look at some of our other alternatives. We also have CaptureScreenshotAsTexture and CaptureScreenshotIntoRenderTexture. The second one will come back much later, but for now lets take a look at the first one:

That’s an improvement, however we are still exceeding 33ms, which is slow enough to cause a noticable hitch even if the game were locked at 30fps. Additionally, this doesn’t even include the hit from writing out the file, as we only have a texture in memory.

Alright, so we should look into writing out this screenshot into a file. The easiest way to do this is the following:

System.IO.File.WriteAllBytes("test1.png", ImageConversion.EncodeToJPG(ScreenCapture.CaptureScreenshotAsTexture()));

Here’s the performance graph for that:

So, obviously this is even worse than ScreenCapture.CaptureScreenshot but we should probably have expected that because the fine folks at Unity HQ know what they’re doing (perhaps), and because we’re just taking a function designed for putting a screenshot into a Texture2D and then we’re writing that out to disk.

However this isn’t actually the approach that I used, because I was actually unaware of both CaptureScreenshotAsTexture and CaptureScreenshotIntoRenderTexture when I initially went about writing this code. I don’t know if they are new API additions or if I am just a bit slow. Either way they both got ignored for no good reason. They are perfectly fine approaches for certain cases, but with that said I probably wouldn’t have stumbled upon my ultimate solution if I had known about them. Instead, my approach was using Texture.ReadPixels.

ReadPixels requires implementing in a coroutine delayed to WaitForEndOfFrame(). This seems alright as far as performance goes, not quite under the 60FPS threshold here, but an improvement. However, this graph is simply calling ReadPixels on a 3840×2160 framebuffer, without even writing that out to a file. If we take a similar approach to the way we did earlier, then we’d get the following code and the following performance characteristics:

    IEnumerator CoWriteScreenshotTemp(int width, int height)
        yield return new WaitForEndOfFrame();
        if(screenshot==null) screenshot = new Texture2D(width, height, TextureFormat.RGB24, false);
        screenshot.ReadPixels(new Rect(0, 0, width-1, height-1), 0, 0, false);
        screenshot_stored = true;
        System.IO.File.WriteAllBytes("test2.jpg", ImageConversion.EncodeToJPG(screenshot));

Damn, so now we’re almost as bad as CaptureScreenshotAsTexture, while having made the code significantly more complicated. However, because we already need to do this ReadPixels process in a Coroutine, perhaps we can do it over multiple frames, so what might happen if we tossed in a few yield return‘s in there and spread that out over a couple frames, and while we’re at it, why don’t we put the file writing on its own frame?

Okay, performance implications are not always obvious in these high-level languages. We have almost an equivalent hitch across two frames now. Still, it seems like there has to be some fertile ground in this direction, even with all the obfuscation of a high-level language. So perhaps the issue is that we’re saving out the file on the main thread. Perhaps if we spin up a thread to write out the file?

Thread _encodeThread;

IEnumerator CoWriteScreenshotTempMultiFrameWithFileWriteThread(int width, int height)
        if(screenshot==null) screenshot = new Texture2D(width, height, TextureFormat.RGB24, false);
        yield return new WaitForEndOfFrame();
        screenshot.ReadPixels(new Rect(0, 0, width/2, height-1), 0, 0, false);
        yield return new WaitForEndOfFrame();
        screenshot.ReadPixels(new Rect(width/2, 0, (width/2)-1, height-1), width/2, 0, false);
        yield return 0;
        rawBytes = screenshot.GetRawTextureData();
        QueuedPath = "test4.jpg";
        _encodeThread = new Thread(EncodeAndWriteFileWithArray);

    void EncodeAndWriteFileWithArray()
        byte[] jpgBytes = ImageConversion.EncodeArrayToJPG(rawBytes,
                          (uint)Screen.width, (uint)Screen.height, 0, 75);
        File.WriteAllBytes(QueuedPath, jpgBytes);
        Globals.updateLoadGameMenu = true;

Well, in spite of the fact that the above graph looks very similar, the scale on the left has changed, and now we are peaking a bit past 33ms rather than all the way to 66ms. So this is definitely an large improvement. However, we’re still not even halfway to 60FPS… One thing we could do pretty easily is to just add in even more yield return new WaitForEndOfFrame()s and read from the framebuffer over even more frames. However, because we are just reading from whatever the most recent frame was, if the scene changes much, this pretty easily can result in problems such as below:

Notice the visible seam down the middle of the frame. The issue is that the framebuffer is changing while we are trying to read from it. The performance of spreading this out over 5 different slices would look like the following:

Admittedly, I was a bit stuck here, but then someone suggested that I take a look into doing a copy GPU side so that I could simply read from the same framebuffer and not have any of these seams. I apologize for not remember who exactly it was who suggested it. In retrospect it seems pretty obvious, but as you might imagine I was pretty lost in the details at this point.

The easiest way to do this would be to use our earlier friend CaptureScreenshotIntoRenderTexture (I told you it’d come back!). As this will do exactly what the doctor ordered, capture the framebuffer into a RenderTexture. Unfortunately (or perhaps fortunately as the case may be), I didn’t know about this function and so I proceeded to dive into several ways of doing this type of thing, ultimately settling on CommandBuffers. (In the future, this will be the most outdated part of this article, as CommandBuffers are part of the legacy graphics pipeline for Unity and will most likely be removed.)

Okay, I’m going to save the code breakdown for a bit later, because otherwise I’d be putting a bunch of code in here that’s marginally different from the final code. But we’ll just say that we set up a CommandBuffer to copy the contents of the screen over to a second RenderTexture (which is really just a GPU-side texture). As part of that process, we have to blit it to a temporary buffer so that we can flip the image vertically. Due to some vagaries in how Unity handles its graphics pipeline, the image gets flipped after it is post-processed but before it is fed into the final display. We come out the other end with a RenderTexture, only it’s actually vertically oriented correctly, unlike CaptureScreenshotIntoRenderTexture, that just leaves it upside down.

This means that, unlike with our earlier staggered read, we can split the read from this RenderTexture across as many frames as we want without having any seams show up. The performance characteristics of doing this across 5 frames looks like the following:

Okay, so another improvement, but there still seems to be a large hitch at the end associated with getting the raw bytes from a Texture2D, and even the 5-frame-staggered readback from the framebuffer into the Texture2D is exceeding our 16ms frame boundary….enter:

The Solution

So, the ideal way to avoid this readback and conversion hitch is…maybe you can guess?

Well, if you didn’t want to guess, I’m gonna tell you anyway; tell the GPU to do the conversion and readback for us in an asynchronous way. The fact is that we will always end up with a bit of a hitch if we are using the CPU to tell the GPU to give us some texture data right now, because that means it has to stop or finish what it’s doing and focus entirely on that task. Instead, we can tell the GPU to give us that texture data “whenever it has a moment”, this is called a asynchronous readback request and can be done in Unity using the AsyncGPUReadback.Request function.

Because this generates garbage, I instead chose to use the variant that uses NativeArray: AsyncGPUReadback.RequestIntoNativeArray. Note that this does mean that we will want to make sure we dispose of the NativeArray when we are done with it. For my purposes, I really only dispose of the array and reallocate it if the resolution changes.

First the performance characteristics of this approach:

Ah, we are finally at something acceptable. You may notice the peak shows slightly above the 16ms line, so it’s clear there is still some hit from the async readback, however in practice the stutter is not noticeable. I may revisit this further before release to see if I can squeeze out a bit more performance here. But for now I am happy enough with this part of the equation. For Taiji, I still have to do some work to improve the actual writing of the save file, but the screenshot taking is “good enough to ship”, as far as I’m concerned.

The actual save file performance is the left spike, and the right is the screenshot taking

So, here’s a snippet of something resembling the final code that I used. This is not exactly a plug and play class, but should be a good basis for getting something working in your own projects. I tried to at least have everything necessary in the snippet:

string QueuedPath;
NativeArray<byte> imageBytes;
byte[] rawBytes;
Thread _encodeThread;

public Material screenshotFlipperMat;

Vector2 previousScreenSize;

CommandBuffer cBuff;
public RenderTexture screenRT, tempRT;

//Call this function whenever you want to take a screenshot
public void TakeScreenshot(string fileOutPath)
    StartCoroutine(SaveScreenshot(fileOutPath, Screen.width, Screen.height));

//Helper function that will be called later
void InitializeBuffers()
    //We do some checks to see if the resolution changed and we need to recreate our RenderTextures
    bool resChanged = (previousScreenSize.x != Screen.width) || (previousScreenSize.y != Screen.height);
    previousScreenSize = new Vector2(Screen.width, Screen.height);
    //The array is Screen width*height*3 because we are going to use a 24bit RGB format (TextureFormat.RGB24)
    if(imageBytes.IsCreated == false) imageBytes = new NativeArray<byte>(Screen.width*Screen.height*3, Allocator.Persistent);
    else if(resChanged)
        imageBytes = new NativeArray<byte>(Screen.width*Screen.height*3, Allocator.Persistent);
    if(tempRT == null || resChanged) tempRT = new RenderTexture(Screen.width, Screen.height, 24);
    if(screenRT == null || resChanged) screenRT = new RenderTexture(Screen.width, Screen.height, 24);
    //We build our command buffer, which includes a double blit using a special 
    //material (shader in article) so that we can flip the output from the backbuffer
    //This double blit seems to be necessary because CommandBuffer.Blit will not allow us to blit using a material
    //if we are blitting from the backbuffer (BuiltinRenderTextureType.CurrentActive)
    if(cBuff == null || resChanged)
        cBuff = new CommandBuffer();
        cBuff.name = "ScreenshotCapture";
        cBuff.Blit(BuiltinRenderTextureType.CurrentActive, tempRT);
        cBuff.Blit(tempRT, screenRT, screenshotFlipperMat);
    GetComponent<Camera>().AddCommandBuffer(CameraEvent.AfterImageEffects, cBuff);

//Function to dispose of our imageBytes from external classes
public void Dispose()

//It may be possible to do this in one frame instead of as a coroutine, but I have not tested
IEnumerator SaveScreenshot(string fileOutPath, int width, int height)
    yield return 0;
    Camera.main.RemoveCommandBuffer(CameraEvent.AfterImageEffects, cBuff);
    QueuedPath = fileOutPath;
    AsyncGPUReadback.RequestIntoNativeArray(ref imageBytes, screenRT, 0, TextureFormat.RGB24, ReadbackCompleted);

void ReadbackCompleted(AsyncGPUReadbackRequest request)
    if(request.hasError) return; //We just won't write out a screenshot, not a huge deal
    _encodeThread = new Thread(EncodeAndWriteFile);

void EncodeAndWriteFile()
    rawBytes = imageBytes.ToArray();
    byte[] jpgBytes = ImageConversion.EncodeArrayToJPG(rawBytes, UnityEngine.Experimental.Rendering.GraphicsFormat.R8G8B8_UInt, (uint)Screen.width, (uint)Screen.height, 0, 75);
    File.WriteAllBytes(QueuedPath, jpgBytes);
    Globals.updateLoadGameMenu = true;

Additionally, you’ll need a shader to do the backbuffer flipping, so here’s something for that. Better could perhaps be done, but this is based off of an internal Unity copy shader, so maybe not much better:

Shader "Custom/ScreenshotFlipper" {
    Properties{ _MainTex("Texture", any) = "" {} }
    SubShader {
        Pass {
            ZTest Always Cull Off ZWrite Off
            #pragma vertex vert
            #pragma fragment frag
            #pragma target 2.0
            #include "UnityCG.cginc"
            sampler2D _MainTex;
            uniform float4 _MainTex_ST;
            struct appdata_t {
                float4 vertex : POSITION;
                float2 texcoord : TEXCOORD0;
            struct v2f {
                float4 vertex : SV_POSITION;
                float2 texcoord : TEXCOORD0;
            v2f vert(appdata_t v)
                v2f o;
                o.vertex = UnityObjectToClipPos(v.vertex);
                float2 uv = v.texcoord.xy;
                uv.y = 1-uv.y;
                o.texcoord = TRANSFORM_TEX(uv, _MainTex);
                return o;
            fixed4 frag(v2f i) : SV_Target
                return tex2D(_MainTex, i.texcoord);
    Fallback Off

Hopefully this has been a fun deep dive for you more technically-minded folks. Lord knows it was a lot of work to write it, but I had fun too. 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s