C# Native AOT performance

How fast are .NET Native AOT applications comparing to the regular managed code? Can AOT outperform JIT? How to benchmark Native AOT applications?

Native AOT performance comparison

This article is part of a series about Native AOT in .NET. If you are not familiar with Native AOT, read the How to develop Native AOT applications in .NET part first.

This article compares .NET and Native AOT performance. First, we will review the official Microsoft benchmarks. They allow to compare different .NET deployment options for simple ASP.NET applications.

Then, you will learn how to run own benchmarks using BenchmarkDotNet and hyperfine tools. Such benchmarks allow you to measure code speed in your environment.

ASP.NET benchmarks

ASP.NET team maintains a solid infrastructure for performance testing. They test various scenarios in different environments.

We are most interested in the Native AOT benchmarks. The primary source of information is the following PowerBI dashboard. The data there is based on 3 "whales": test applications, deployment scenarios, and metrics.

Test applications

You can find the source code of benchmarks and test applications in the aspnet/Benchmarks repository.

Native AOT benchmarks compare 3 application types:

  • Stage1 - a minimal API based on HTTP and JSON. The application source code is located in /src/BenchmarksApps/BasicMinimalApi.
  • Stage1Grpc - a similar API based on gRPC (/src/BenchmarksApps/Grpc/BasicGrpc)
  • Stage2 - full web app involving database, authentication (/src/BenchmarksApps/TodosApi)

.NET Deployment scenarios

Test applications are run in different environments. At the moment, benchmarks use Windows and Linux virtual machines with 28 cores. There are also separate Linux environments for ARM and Intel processors.

Applications are also tested in different configurations. An application in some configuration defines a "scenario".

You can hold down the Ctrl (or ⌘) key to select multiple scenarios or environments on the PowerBI dashboard.

Metrics

Benchmarks collect fundamental metrics for every deployed application. For example, tests measure request count per second (RPS), startup time, max memory working set.

That allows us to compare metric values for various configurations of the same application.

Performance comparison

We will compare StageX scenarios with StageXAot and StageXAotSpeedOpt. They use the following configuration:

Scenario dotnet publish build arguments
StageX PublishAot=false
EnableRequestDelegateGenerator=false
Stage2Aot PublishAot=true
StripSymbols=true
Stage2AotSpeedOpt PublishAot=true
StripSymbols=true
OptimizationPreference=Speed

All scenarios above also use the DOTNET_GCDynamicAdaptationMode=1 environment variable.

StageXAotSpeedOpt scenarios allow to estimate the impact of the OptimizationPreference = Speed setting.

You may review StageXTrimR2RSingleFile scenarios too. Such scenarios correspond to trimmed ReadyToRun deployment, which is another form of ahead-of-time compilation in .NET. Sometimes, it is a good alternative to Native AOT.

Here are the current performance comparison results for .NET 9 Release Candidate (September 2024):

Startup time

AOT applications start much faster than managed versions. That's true for both Stage1 and Stage2 applications and for all environments. Sample results:

Scenario Startup time (ms)
Stage2AotSpeedOpt 100
Stage2Aot 109
Stage2 528

Working set

The max working set for Native AOT applications is less than for managed versions. On Linux, managed versions use about 1.5 - 2 times more RAM than AOT versions. For example:

Scenario Max working set (MB)
Stage1Aot 56
Stage1AotSpeedOpt 57
Stage1 126

On Windows, the difference is smaller. Especially, for Stage2:

Scenario Max working set (MB)
Stage2Aot 152
Stage2AotSpeedOpt 150
Stage2 167

Requests per second

Larger RPS values mean faster application. The lightweight Stage1 application usually handles about 800-900K requests per second. The larger Stage2 application only handles about 200K requests.

For the Stage2 application, the .NET version handles more requests than AOT versions in all environments. The speed of the Stage2AotSpeedOpt version is sometimes close. But, usually it lies between Stage2 and Stage2Aot. Here are the typical results:

Scenario RPS
Stage2 235,008
Stage2AotSpeedOpt 215,637
Stage2Aot 194,264

The results for the Stage1 application are similar on Intel Linux and Intel Windows. However, on Ampere Linux, AOT beats the managed version. Sample results from Ampere Linux:

Scenario RPS
Stage1AotSpeedOpt 929,524
Stage1Aot 912,344
Stage1 844,659

So, environment and the application code may significantly affect speed. It makes sense running own benchmarks to estimate the Native AOT benefits for your project. Let's write custom benchmarks without Microsoft testing infrastructure.

Benchmarking Native AOT applications

We will use 2 types of benchmarks. The first one is based on BenchmarkDotNet - the popular library for benchmarking .NET code. These benchmarks compare pure speed, excluding startup time.

The second one is based on the hyperfine tool. It allows to compare execution time of two shell commands. These benchmarks compare overall speed, including startup time.

We will not compare memory consumption here. At the moment, the NativeMemoryProfiler diagnoser in BenchmarkDotNet does not support Native AOT runtime. hyperfine does not currently track memory usage too.

You can download the source code from the NativeAotBenchmarks repository on GitHub. We encourage you to try them in your environment. This article describes results from a Windows 11 laptop with Intel Core i9-13900H processor and 16 Gb RAM.

Make sure you run benchmarks properly. Here are the common recommendations:

  • Use the Release build.
  • Turn off all the applications except the benchmark process. For example, disable antivirus software, close Visual Studio and a web browser.
  • Keep your laptop plugged in and use the best performance mode.
  • Use the same input data in the scenarios being compared.

Test cases

We will benchmark 2 scenarios in .NET 8:

1. Simple C# code for a string compression using the counts of repeated characters. For example, the string "aabcccccaaa" would become "a2b1c5a3":

string Compress(string s)
{
    StringBuilder compressed = new(s.Length);

    for (int i = 0; i < s.Length; ++i)
    {
        char c = s[i];
        for (int j = i + 1; j <= s.Length; ++j)
        {
            if (j == s.Length || s[j] != c)
            {
                compressed.Append(c + $"{j - i}");
                i = j - 1;

                if (compressed.Length > s.Length)
                    return s;

                break;
            }
        }
    }

    if (compressed.Length <= s.Length)
        return compressed.ToString();

    return s;
}

2. A heavier PDF to PNG conversion task that uses Docotic.Pdf.

Prerequisites

Install prerequisites for .NET Native AOT deployment.

Install hyperfine to run corresponding benchmarks.

For PDF to PNG benchmarks, get a free time-limited license key on the Download C# .NET PDF library page. You need to apply the license key in the Helper.cs.

BenchmarkDotNet

These benchmarks are located in the NativeAotBenchmarks project. We compare results for RuntimeMoniker.NativeAot80 and RuntimeMoniker.Net80. By default, BenchmarkDotNet builds Native AOT code with the OptimizationPreference=Speed setting.

BenchmarkDotNet performs 6 or more warmup iterations. That helps JIT to pre-compile code and collect some statistics. Thus, such benhmarks exclude startup time from comparison.

String compression

The CompressString benchmark for string compression uses a long string with duplicate characters. The common mistake would be to generate a random string. In such a case, benchmarks for Native AOT and .NET 8 would use different input strings. It is possible to use random strings, but you need to initialize a random generator with the same seed.

The Native AOT version runs about 1.08 times faster than the .NET 8 version:

Method Runtime Mean Error StdDev
Compress .NET 8.0 4.117 ms 0.0553 ms 0.0517 ms
Compress NativeAOT 8.0 3.809 ms 0.0403 ms 0.0377 ms

PDF to PNG

PDF to PNG benchmarks process PDF documents in memory. That allows to exclude the interaction with the file system. I/O operations with a disk can skew benchmark results.

We test speed with two PDF documents. The first one, Banner Edulink One.pdf, is more complex. It is converted to a 72 dpi PNG and requires more time for processing. The .NET 8 version is a slightly faster for this document:

Method Runtime Mean Error StdDev
Convert .NET 8.0 1.103 s 0.0156 s 0.0146 s
Convert NativeAOT 8.0 1.167 s 0.0160 s 0.0149 s

The second document is smaller and simpler. It is converted to a 300 dpi PNG. And speed is almost equal:

Method Runtime Mean Error StdDev
Convert .NET 8.0 290.1 ms 5.78 ms 6.88 ms
Convert NativeAOT 8.0 288.3 ms 4.44 ms 3.94 ms

hyperfine

These benchmarks are located in the NativeAotTestApp project. The project does not use the OptimizationPreference=Speed setting. You can enable it in the NativeAotTestApp.csproj: <OptimizationPreference>Speed</OptimizationPreference>

Use the benchmark.bat script to run tests on Windows. You can convert it to Bash for Unix/Linux-based operating systems. The script builds .NET 8 and Native AOT versions of the same app. Then, it compares their performance with similar commands: hyperfine --warmup 3 "net8-app.exe" "native-aot-app.exe"

Warmup runs in hyperfine help to start test applications on "warm" disk caches. Unlike BenchmarkDotNet, the hyperfine warmup does not help JIT. Therefore, hyperfine benchmarks compare total application speed, including startup time.

Our test application supports the iteration count argument. It allows to repeat the same code multiple times in a simple loop:

for (int i = 0; i < iterationCount; ++i)
    CompressString(args);

The idea is to decrease the impact of the startup time difference. Repeating the same code gives JIT chances to collect more runtime statistics and generate faster code.

A common situation is the following. First time, you run benchmarks with a single iteration. A Native AOT version works much faster. Then, you run the same benchmarks with multiple iterations and the total speed of both versions becomes equal. It means that after startup, a managed version is actually faster.

String compression

For 100,000 iterations of the same input string compression, the Native AOT performance is better:

Benchmark 1: .NET 8 version (100000 iterations)
  Time (mean ± σ):     151.5 ms ±   2.6 ms    [User: 32.1 ms, System: 1.6 ms]
  Range (min … max):   148.0 ms … 157.5 ms    19 runs

Benchmark 2: Native AOT version (100000 iterations)
  Time (mean ± σ):      55.1 ms ±   3.1 ms    [User: 15.0 ms, System: 2.1 ms]
  Range (min … max):    51.6 ms …  65.9 ms    51 runs

Summary
  Native AOT version ran 2.75 ± 0.16 times faster than .NET 8 version

But the speed becomes almost the same for 10,000,000 iterations:

Benchmark 1: .NET 8 version (10000000 iterations)
  Time (mean ± σ):      3.984 s ±  0.139 s    [User: 2.946 s, System: 0.009 s]
  Range (min … max):    3.790 s …  4.182 s    10 runs

Benchmark 2: Native AOT version (10000000 iterations)
  Time (mean ± σ):      3.956 s ±  0.041 s    [User: 2.848 s, System: 0.004 s]
  Range (min … max):    3.888 s …  4.016 s    10 runs

Summary
  Native AOT version ran 1.01 ± 0.04 times faster than .NET 8 version

PDF to PNG

For a single iteration of Banner Edulink One.pdf to PNG conversion, the AOT version runs about 1.88 times faster than the .NET 8 version:

Benchmark 1: .NET 8 version (1 iteration)
  Time (mean ± σ):      2.417 s ±  0.104 s    [User: 1.334 s, System: 0.116 s]
  Range (min … max):    2.295 s …  2.629 s    10 runs

Benchmark 2: Native AOT version (1 iteration)
  Time (mean ± σ):      1.288 s ±  0.011 s    [User: 0.573 s, System: 0.123 s]
  Range (min … max):    1.274 s …  1.310 s    10 runs

For 20 iterations, the speed difference is negligible:

Benchmark 1: .NET 8 version (20 iterations)
  Time (mean ± σ):     25.048 s ±  0.223 s    [User: 13.278 s, System: 2.312 s]
  Range (min … max):   24.751 s … 25.423 s    10 runs

Benchmark 2: Native AOT version (20 iterations)
  Time (mean ± σ):     25.213 s ±  0.114 s    [User: 12.661 s, System: 2.275 s]
  Range (min … max):   25.042 s … 25.350 s    10 runs

Summary
  .NET 8 version ran 1.01 ± 0.01 times faster than Native AOT version

For 3BigPreview.pdf, the Native AOT version is faster even with 100 iterations:

Benchmark 1: .NET 8 version (100 iterations)
  Time (mean ± σ):     10.009 s ±  0.152 s    [User: 5.298 s, System: 0.567 s]
  Range (min … max):    9.677 s … 10.189 s    10 runs

Benchmark 2: Native AOT version (100 iterations)
  Time (mean ± σ):      8.336 s ±  0.070 s    [User: 3.405 s, System: 0.505 s]
  Range (min … max):    8.247 s …  8.459 s    10 runs

Summary
  Native AOT version ran 1.20 ± 0.02 times faster than .NET 8 version

Conclusion

Native AOT applications start faster comparing to regular .NET. The official benchmarks also show that AOT applications have smaller memory footprints.

But after startup, managed applications usually show better speed. That happens because JIT has access to runtime information. In long running applications, it can regenerate more effective code based on dynamic profile-guided optimization and other techniques.

ASP.NET benchmarks allow you to compare different configurations from the performance perspective. However, results depend on an operating system and a processor architecture. You need to run own benchmarks in your target environment to find the optimal deployment configuration.