Skip to content
This repository was archived by the owner on Dec 18, 2018. It is now read-only.

Kestrel stops processing requests for a period #2986

Closed
peppy opened this issue Oct 5, 2018 · 10 comments
Closed

Kestrel stops processing requests for a period #2986

peppy opened this issue Oct 5, 2018 · 10 comments
Milestone

Comments

@peppy
Copy link

peppy commented Oct 5, 2018

I have been migrating a .NET Framework 4.x project to .NET core. In the process I ran into an issue with HttpListener not working as expected, so migrated my code to use Kestrel.

The server application is a high-performance game server that handles request and response bodies in a completely custom way. It processes around 3000req/s. After around 5-30 minutes of correct execution, the server goes into a state where it is no longer processing incoming requests. Eventually it will recover from this state, but by this point clients believe they have lost connections.

I have been debugging over the last few days and have ruled out my code being the direct issue, but I have also been unable to reproduce under non-production load. I did manage to get a core dump in the hanging state and you can find the log of sos EEstack here. I am able to share any further lldb/sos output should it be relevant.

While I fully intend to continue investigation on my own, I want to post this here for visibility in case someone working on the project can guide me in the right direction based on the thread call stacks above.

The project was built against .NET core 2.1.4 / Kestrel 2.2.0-preview2-35157 on a macOS host using dotnet publish, running on an Ubuntu host (rid linux-x64).

Note that I was seeing similar behaviour before migrating from HttpListener, so it is quite possibly something at a lower level or unrelated to Kestrel directly.

As the project this is being used in is private, I cannot link to the source, but here is an extract showing how I am using Kestrel:

private static void StartHttpListener()
{
    var host = new WebHostBuilder().UseKestrel(options =>
        {
            options.ConfigureEndpointDefaults(o => o.NoDelay = true);
        })
        .Configure(app =>
        {
            app.Run(context =>
            {
                DogStatsd.Increment("http_requests");
                return HandleConnection(context);
            });
        })
        .UseUrls("http://*:80")
        .Build();

    host.Run();
}

private static Task HandleConnection(HttpContext context)
{
    var request = context.Request;
    var response = context.Response;
    
    var client = UserManager.GetClientFromToken(incomingToken);
    if (client == null)
    {
        response.StatusCode = 403;
        return Task.CompletedTask;
    }

    var tcs = new TaskCompletionSource<bool>();
    
    client.SetStreams(request.Body, response.Body, (int)(request.ContentLength ?? 0), tcs);
    // the request is read by the client synchronously here.

    // the response is processed and written to by a separate worker pool within one second of this call.
    // after the worker finishes using the response, it will set tcs.SetResult(true) to complete the returned task below.
    return tcs.Task;
}

Thanks in advance for any help with this issue.

@davidfowl
Copy link
Member

davidfowl commented Oct 5, 2018

It looks like you're blocking thread pool threads by calling Task.Wait in a bunch of places:

00007F45C63B6158 00007f45ef0aa192 (MethodDesc 00007f45eeb82790 + 0x292 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)), calling 00007f46689696d0 (stub for System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object))
00007F45C63B6250 00007f45ef0aa192 (MethodDesc 00007f45eeb82790 + 0x292 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)), calling 00007f46689696d0 (stub for System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object))
00007F45C63B62E0 00007f45ef0609e9 (MethodDesc 00007f45eeb7b470 + 0xb9 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken)), calling (MethodDesc 00007f45eeb82790 + 0 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken))
00007F45C63B6340 00007f45ef060879 (MethodDesc 00007f45eeb7b460 + 0x159 System.Threading.Tasks.Task.InternalWaitCore(Int32, System.Threading.CancellationToken)), calling (MethodDesc 00007f45eeb7b470 + 0 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken))
00007F45C63B6360 00007f45f21a2d03 (MethodDesc 00007f45f210ee70 + 0xb3 /root/bancho-publish/Microsoft.AspNetCore.Server.Kestrel.Core.dll!Unknown), calling (MethodDesc 00007f45eeb8b7f8 + 0 System.Threading.Tasks.ValueTask`1[[System.Int32, System.Private.CoreLib]].AsTask())

Are you doing synchronous reads/writes anywhere? Is your custom body logic reading from the request body Stream synchronously (Stream.Read/Stream.Write)? If so it sounds like classic thread pool starvation.

@peppy
Copy link
Author

peppy commented Oct 5, 2018

Yes, there is one instance of a non-async read (I believe there was a reason to do it this way originally). To ensure there is no thread pool starvation I do have the thread pool set to a very large size.

Thanks for the advice. I will remove the synchronous stream read logic and see if it helps.

@davidfowl
Copy link
Member

Maybe look at the number of items queued to the thread pool:

0:016> !dumpheap -stat -type QueueUserWorkItem
Statistics:
              MT    Count    TotalSize Class Name
00007ffd3d76d8b0        1           24 System.Threading.QueueUserWorkItemCallbackDefaultContext+<>c
00007ffd3d76d860     1000        32000 System.Threading.QueueUserWorkItemCallbackDefaultContext

@davidfowl
Copy link
Member

Hmm that may not work for queued tasks. Checking something.

@peppy
Copy link
Author

peppy commented Oct 5, 2018

(lldb) dumpheap -stat -type QueueUserWorkItem

Statistics:
              MT    Count    TotalSize Class Name
00007f45f20c5bf0        1           24 System.Threading.QueueUserWorkItemCallbackDefaultContext`1+<>c[[System.Object, System.Private.CoreLib]]
00007f45ef38da00     3301       132040 System.Threading.QueueUserWorkItemCallback
00007f45f20c5b28     6250       200000 System.Threading.QueueUserWorkItemCallbackDefaultContext`1[[System.Object, System.Private.CoreLib]]
Total 9552 objects

This is not looking too happy, that's for sure.

@davidfowl
Copy link
Member

davidfowl commented Oct 5, 2018

Starvation it is! It's likely Kestrel can't serve requests because it can't even schedule the callback to parse the incoming http requests. If you can't change the calling code, you can buffer the incoming body so you won't kill the thread pool.

Option 1: Change the Stream reading to be async.

Option 2: Buffer the incoming Streams so you can keep the synchronous stream reading code.

private static async Task HandleConnection(HttpContext context)
{
    var request = context.Request;
    var response = context.Response;
    
    var client = UserManager.GetClientFromToken(incomingToken);
    if (client == null)
    {
        response.StatusCode = 403;
        return Task.CompletedTask;
    }

    context.Request.EnableBuffering();
    await context.Request.Body.DrainAsync(CancellationToken.None);

    context.Request.Body.Seek(0L, SeekOrigin.Begin);

    var tcs = new TaskCompletionSource<bool>();
    
    client.SetStreams(request.Body, response.Body, (int)(request.ContentLength ?? 0), tcs);
    // the request is read by the client synchronously here.

    // the response is processed and written to by a separate worker pool within one second of this call.
    // after the worker finishes using the response, it will set tcs.SetResult(true) to complete the returned task below.
    await tcs.Task;
}

@davidfowl
Copy link
Member

DrainAsync is in the Microsoft.AspNetCore.WebUtilities namespace.

@peppy
Copy link
Author

peppy commented Oct 5, 2018

Thanks again for the options. I will make the changes and report back after testing with production workload tomorrow!

@JunTaoLuo JunTaoLuo added this to the Discussions milestone Oct 5, 2018
@peppy
Copy link
Author

peppy commented Oct 6, 2018

On looking back over the data I collected I noticed that I was graphing the AvailableThreads from the worker pool rather than correctly calculating how many were free/spare using (Min - (Max-Available)) (used this as a general reference). I also noticed we were setting the MaxThreads to increase the pool size, rather than MinThreads (which would have only had a negative effect).

I am also kicking myself for not realising the cause here, as I spend a good month working on a large refactor to async code in another project which encountered similar issues.

I have since re-deployed and things are looking to be 100% stable (I converted one .Read to .ReadAsync and set SetMinThreads to a sane 16 * ProcessorCount just to catch any potential lingering issues).

image

It's a wonder in this async world we live in there are no debug level diagnostics to warn the develoeper of threadpool starvation, as it's easy to run in to and hard to diagnose if you don't know what you're looking for.

Anyways, all fixed. Happily serving ~2,000req/s with no signs of struggling 👍.

@peppy peppy closed this as completed Oct 6, 2018
@davidfowl
Copy link
Member

https://twitter.com/davidfowl/status/1048441309696909312 it's coming

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants