不同大小的缓冲区对 MD5 计算速度的影响

小警犬戴墨镜 · 发表于 2023-6-10 23:12:02

最*需要在计算大文件的 MD5 值时显示进度，于是我写了如下的代码：

public long Length {get; private set; }
public long Position { get; private set; }
public async Task ComputeMD5Async(string file, CancellationToken cancellationToken)
{
using var fs = File.OpenRead(file);
Length = fs.Length;
var task = MD5.HashDataAsync(fs, cancellationToken);
var timer = new PeriodicTimer(TimeSpan.FromMilliseconds(10));
while (await timer.WaitForNextTickAsync(cancellationToken))
{
Position = fs.Position;
if (task.IsCompleted)
{
break;
}
}
}

复制代码

运行的时候发现不对劲儿了，我的校验速度只能跑到 350MB/s，而别人的却能跑到 500MB/s，相同的设备怎么差距有这么大？带这个疑问我去看了看别人的源码，发现是这么写的：

protected long _progressPerFileSizeCurrent;
protected byte[] CheckHash(Stream stream, HashAlgorithm hashProvider, CancellationToken token)
{
byte[] buffer = new byte[1 << 20];
int read;
while ((read = stream.Read(buffer)) > 0)
{
token.ThrowIfCancellationRequested();
hashProvider.TransformBlock(buffer, 0, read, buffer, 0);
_progressPerFileSizeCurrent += read;
}
hashProvider.TransformFinalBlock(buffer, 0, read);
return hashProvider.Hash;
}

复制代码

这里使用了 HashAlgorithm.TransformBlock 方法，它能计算输入字节数组指定区域的哈希值，并将中间结果暂时存储起来，最后再调用 HashAlgorithm.TransformFinalBlock 结束计算。上述代码中缓冲区 buffer 大小是 1MB，我敏锐地察觉到 MD5 计算速度可能与这个值有关，接着我又去翻了翻 MD5.HashDataAsync 的源码。

// System.Security.Cryptography.LiteHashProvider
private static async ValueTask<int> ProcessStreamAsync<T>(T hash, Stream source, Memory<byte> destination, CancellationToken cancellationToken) where T : ILiteHash
{
using (hash)
{
byte[] rented = CryptoPool.Rent(4096);
int maxRead = 0;
int read;
try
{
while ((read = await source.ReadAsync(rented, cancellationToken).ConfigureAwait(false)) > 0)
{
maxRead = Math.Max(maxRead, read);
hash.Append(rented.AsSpan(0, read));
}
return hash.Finalize(destination.Span);
}
finally
{
CryptoPool.Return(rented, clearSize: maxRead);
}
}
}

复制代码

源码中最关键的是上面这部分，缓冲区 rented 设置为 4KB，与 1MB 相差甚远，原因有可能就在这里。
为了找到最佳的缓冲区值，我跑了一大堆 BenchMark，覆盖了从 32B 到 64MB 的范围。没什么技术含量，但工作量实在不小。测试使用 1GB 的文件，基准测试是对 1GB 大小的数组直接调用 MD5.HashData，实际的测试代码如下，分别使用内存流 MemoryStream 和文件流 FileStream 作为入参 Stream，对比无硬盘 IO 和实际读取文件的速度。
[code]public async Task HashDataAsync(Stream stream){ var hash = MD5.Create(); byte[] buffer = new byte[1