爱酒爱足球的大叔 发表于 2023-9-11 16:23:14

不同CPU性能大PK

前言

比较Hygon7280、Intel、AMD、鲲鹏920、飞腾2500的性能情况
CPU型号Hygon 7280AMD 7H12AMD 7T83Intel 8163鲲鹏920飞腾2500倚天710物理核数323264244864128core超线程2222   路2222221NUMA Node82424162L1d32K32K32K32K64K32K64KL2512K512K512K1024K512K2048K1024KAMD 7T83 有8个Die, 每个Die L3大小 32M,L2 大小4MiB, 每个Die上 L1I/L1D 各256KiB,每个Die有8core,2、3代都是带有独立 IO Die
倚天710是一路服务器,单芯片2块对称的 Die

参与比较的几款CPU参数

IPC的说明:
IPC: insns per cycle insn/cycles 也就是每个时钟周期能执行的指令数量,越大程序跑的越快
程序的执行时间 = 指令数/(主频*IPC) //单核下,多核的话再除以核数
Hygon 7280

Hygon 7280 就是AMD Zen架构,最大IPC能到5.
123456789101112131415161718192021222324252627282930架构: x86_64CPU 运行模式: 32-bit, 64-bit字节序: Little EndianAddress sizes: 43 bits physical, 48 bits virtualCPU: 128在线 CPU 列表: 0-127每个核的线程数: 2每个座的核数: 32座: 2NUMA 节点: 8厂商 ID: HygonGenuineCPU 系列: 24型号: 1型号名称: Hygon C86 7280 32-core Processor步进: 1CPU MHz: 2194.586BogoMIPS: 3999.63虚拟化: AMD-VL1d 缓存: 2 MiBL1i 缓存: 4 MiBL2 缓存: 32 MiBL3 缓存: 128 MiBNUMA 节点0 CPU: 0-7,64-71NUMA 节点1 CPU: 8-15,72-79NUMA 节点2 CPU: 16-23,80-87NUMA 节点3 CPU: 24-31,88-95NUMA 节点4 CPU: 32-39,96-103NUMA 节点5 CPU: 40-47,104-111NUMA 节点6 CPU: 48-55,112-119NUMA 节点7 CPU: 56-63,120-127架构说明:
每个CPU有4个Die,每个Die有两个CCX(2 core-Complexes),每个CCX最多有4core(例如7280/7285)共享一个L3 cache;每个Die有两个Memory Channel,每个CPU带有8个Memory Channel,并且每个Memory Channel最多支持2根Memory;
海光7系列架构图:

曙光H620-G30A 机型硬件结构,CPU是hygon 7280(截图只截取了Socket0)

AMD EPYC 7T83(NC)

两路服务器,4 numa node,Z3架构



详细信息:
1234567891011121314151617181920212223242526272829303132333435363738394041#lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 256On-line CPU(s) list: 0-255Thread(s) per core: 2Core(s) per socket: 64Socket(s): 2NUMA node(s): 4Vendor ID: AuthenticAMDCPU family: 25Model: 1Model name: AMD EPYC 7T83 64-Core ProcessorStepping: 1CPU MHz: 2154.005CPU max MHz: 2550.0000CPU min MHz: 1500.0000BogoMIPS: 5090.93Virtualization: AMD-VL1d cache: 32KL1i cache: 32KL2 cache: 512KL3 cache: 32768KNUMA node0 CPU(s): 0-31,128-159NUMA node1 CPU(s): 32-63,160-191NUMA node2 CPU(s): 64-95,192-223NUMA node3 CPU(s): 96-127,224-255 #cat /sys/devices/system/cpu/cpu{0,1,8,16,30,31,32,128}/cache/index3/shared_cpu_map00000000,00000000,00000000,000000ff,00000000,00000000,00000000,000000ff00000000,00000000,00000000,000000ff,00000000,00000000,00000000,000000ff00000000,00000000,00000000,0000ff00,00000000,00000000,00000000,0000ff0000000000,00000000,00000000,00ff0000,00000000,00000000,00000000,00ff000000000000,00000000,00000000,ff000000,00000000,00000000,00000000,ff00000000000000,00000000,00000000,ff000000,00000000,00000000,00000000,ff00000000000000,00000000,000000ff,00000000,00000000,00000000,000000ff,0000000000000000,00000000,00000000,000000ff,00000000,00000000,00000000,000000ff #cat /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map00000000,00000000,00000000,00000001,00000000,00000000,00000000,00000001L3是8个物理核,16个超线程共享,相当于单核2MB,一块CPU有8个L3,总共是256MB
12345678910111213141516#cat cpu0/cache/index3/shared_cpu_list0-7,128-135#cat cpu0/cache/index3/size32768K#cat cpu0/cache/index2/shared_cpu_list0,128 #cat /sys/devices/system/cpu/cpu{0,1,8,16,30,31,32,128}/cache/index3/shared_cpu_list0-7,128-1350-7,128-1358-15,136-14316-23,144-15124-31,152-15924-31,152-15932-39,160-1670-7,128-135L1D、L1I各为 2MiB,单物理核为32KB
空跑nop的IPC为6(有点吓人)
12345678910111213141516#perf stat ./cpu/testPerformance counter stats for process id '449650': 2,574.29 msec task-clock # 1.000 CPUs utilized0 context-switches # 0.000 K/sec0 cpu-migrations # 0.000 K/sec0 page-faults # 0.000 K/sec8,985,622,182 cycles # 3.491 GHz (83.33%)4,390,929 stalled-cycles-frontend # 0.05% frontend cycles idle (83.34%)4,387,560,442 stalled-cycles-backend # 48.83% backend cycles idle (83.34%)53,711,907,863 instructions # 5.98 insn per cycle# 0.08 stalled cycles per insn (83.34%)418,902,363 branches # 162.725 M/sec (83.34%)15,036 branch-misses # 0.00% of all branches (83.32%) 2.574347594 seconds time elapsedsysbench 测试7T83 比7H12 略好,可能是ECS、OS等带来的差异。
测试环境:4.19.91-011.ali4000.alios7.x86_64,5.7.34-log MySQL Community Server (GPL)
测试核数AMD EPYC 7H12 2.5G(QPS、IPC)说明单核24363 0.58CPU跑满一对HT33519 0.40CPU跑满2物理核(0-1)48423 0.57CPU跑满2物理核(0,32) 跨node46232 0.55CPU跑满2物理核(0,64) 跨socket45072 0.52CPU跑满4物理核(0-3)97759 0.58CPU跑满16物理核(0-15)367992 0.55CPU跑满,sys占比20%,si 10%32物理核(0-31)686998 0.51CPU跑满,sys占比20%, si 12%64物理核(0-63)1161079 0.50CPU跑到95%以上,sys占比20%, si 12%64物理核(0-31,64-95)964441 0.49socket2上的32核一直比较闲,数据无参考意义64物理核(0-31,64-95)1147846 0.48重启mysqld,立即绑核,sysbench 在32-63上,导致0-31的CPU只能跑到89%说明,压测过程动态通过taskset绑核,所以会有数据残留其它核的cache问题
跨socket taskset绑核的时候要压很久任务才会跨socket迁移过去,也就是刚taskset后CPU是跑不满的
1234567891011121314151617181920#numastat -p 437803 Per-node process memory usage (in MBs) for PID 437803 (mysqld)Node 0 Node 1 Node 2--------------- --------------- ---------------Huge 0.00 0.00 0.00Heap 1.15 0.00 5403.27Stack 0.00 0.00 0.09Private 1921.60 16.22 10647.66---------------- --------------- --------------- ---------------Total 1922.75 16.22 16051.02 Node 3 Total--------------- ---------------Huge 0.00 0.00Heap 0.03 5404.45Stack 0.00 0.09Private 16.20 12601.68---------------- --------------- ---------------Total 16.23 18006.22AMD EPYC 7H12(ECS)

AMD EPYC 7H12 64-Core(ECS,非物理机),最大IPC能到5.
1234567891011121314151617181920212223242526# lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 64On-line CPU(s) list: 0-63Thread(s) per core: 2Core(s) per socket: 16座: 2NUMA 节点: 2厂商 ID: AuthenticAMDCPU 系列: 23型号: 49型号名称: AMD EPYC 7H12 64-Core Processor步进: 0CPU MHz: 2595.124BogoMIPS: 5190.24虚拟化: AMD-V超管理器厂商: KVM虚拟化类型: 完全L1d 缓存: 32KL1i 缓存: 32KL2 缓存: 512KL3 缓存: 16384KNUMA 节点0 CPU: 0-31NUMA 节点1 CPU: 32-63AMD EPYC 7T83 ECS
123456789101112131415161718192021222324252627282930313233# cd /sys/devices/system/cpu/cpu0# cat cache/index0/size32K# cat cache/index1/size32K# cat cache/index2/size512K# cat cache/index3/size32768K# lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 16On-line CPU(s) list: 0-15Thread(s) per core: 2Core(s) per socket: 8座: 1NUMA 节点: 1厂商 ID: AuthenticAMDCPU 系列: 25型号: 1型号名称: AMD EPYC 7T83 64-Core Processor步进: 1CPU MHz: 2545.218BogoMIPS: 5090.43超管理器厂商: KVM虚拟化类型: 完全L1d 缓存: 32KL1i 缓存: 32KL2 缓存: 512KL3 缓存: 32768KNUMA 节点0 CPU: 0-15stream:
12345678910111213141516171819# for i in $(seq 0 15); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done0STREAM copy latency: 0.68 nanosecondsSTREAM copy bandwidth: 23509.84 MB/secSTREAM scale latency: 0.69 nanosecondsSTREAM scale bandwidth: 23285.51 MB/secSTREAM add latency: 0.96 nanosecondsSTREAM add bandwidth: 25043.73 MB/secSTREAM triad latency: 1.40 nanosecondsSTREAM triad bandwidth: 17121.79 MB/sec1STREAM copy latency: 0.68 nanosecondsSTREAM copy bandwidth: 23513.96 MB/secSTREAM scale latency: 0.68 nanosecondsSTREAM scale bandwidth: 23580.06 MB/secSTREAM add latency: 0.96 nanosecondsSTREAM add bandwidth: 25049.96 MB/secSTREAM triad latency: 1.35 nanosecondsSTREAM triad bandwidth: 17741.93 MB/secIntel 8163

这次对比测试的Intel 8163 CPU信息如下,最大IPC 是4:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253#lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 96On-line CPU(s) list: 0-95Thread(s) per core: 2Core(s) per socket: 24Socket(s): 2NUMA node(s): 1Vendor ID: GenuineIntelCPU family: 6Model: 85Model name: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHzStepping: 4CPU MHz: 2499.121CPU max MHz: 3100.0000CPU min MHz: 1000.0000BogoMIPS: 4998.90Virtualization: VT-xL1d cache: 32KL1i cache: 32KL2 cache: 1024KL3 cache: 33792KNUMA node0 CPU(s): 0-95 -----8269CY#lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 104On-line CPU(s) list: 0-103Thread(s) per core: 2Core(s) per socket: 26Socket(s): 2NUMA node(s): 2Vendor ID: GenuineIntelCPU family: 6Model: 85Model name: Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHzStepping: 7CPU MHz: 3200.000CPU max MHz: 3800.0000CPU min MHz: 1200.0000BogoMIPS: 4998.89Virtualization: VT-xL1d cache: 32KL1i cache: 32KL2 cache: 1024KL3 cache: 36608KNUMA node0 CPU(s): 0-25,52-77NUMA node1 CPU(s): 26-51,78-103不同 intel 型号的差异

如下图是8269CY和E5-2682上跑的MySQL在相同业务、相同流量下的差异:

CPU使用率差异(下图8051C是E5-2682,其它是 8269CY,主频也有30%的差异)

鲲鹏920

1234567891011121314151617181920212223242526272829303132333435363738394041424344#numactl -Havailable: 4 nodes (0-3)node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23node 0 size: 192832 MBnode 0 free: 146830 MBnode 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47node 1 size: 193533 MBnode 1 free: 175354 MBnode 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71node 2 size: 193533 MBnode 2 free: 175718 MBnode 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95node 3 size: 193532 MBnode 3 free: 183643 MBnode distances:node 0 1 2 30: 10 12 20 221: 12 10 22 242: 20 22 10 123: 22 24 12 10 #lscpuArchitecture: aarch64Byte Order: Little EndianCPU(s): 96On-line CPU(s) list: 0-95Thread(s) per core: 1Core(s) per socket: 48Socket(s): 2NUMA node(s): 4Model: 0CPU max MHz: 2600.0000CPU min MHz: 200.0000BogoMIPS: 200.00L1d cache: 64KL1i cache: 64KL2 cache: 512KL3 cache: 24576KNUMA node0 CPU(s): 0-23NUMA node1 CPU(s): 24-47NUMA node2 CPU(s): 48-71NUMA node3 CPU(s): 72-95Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm飞腾2500

飞腾2500用nop去跑IPC的话,只能到1,但是跑其它代码能到2.33
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990#lscpuArchitecture: aarch64Byte Order: Little EndianCPU(s): 128On-line CPU(s) list: 0-127Thread(s) per core: 1Core(s) per socket: 64Socket(s): 2NUMA node(s): 16Model: 3BogoMIPS: 100.00L1d cache: 32KL1i cache: 32KL2 cache: 2048KL3 cache: 65536KNUMA node0 CPU(s): 0-7NUMA node1 CPU(s): 8-15NUMA node2 CPU(s): 16-23NUMA node3 CPU(s): 24-31NUMA node4 CPU(s): 32-39NUMA node5 CPU(s): 40-47NUMA node6 CPU(s): 48-55NUMA node7 CPU(s): 56-63NUMA node8 CPU(s): 64-71NUMA node9 CPU(s): 72-79NUMA node10 CPU(s): 80-87NUMA node11 CPU(s): 88-95NUMA node12 CPU(s): 96-103NUMA node13 CPU(s): 104-111NUMA node14 CPU(s): 112-119NUMA node15 CPU(s): 120-127Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid #perf stat ./nopfailed to read counter stalled-cycles-frontendfailed to read counter stalled-cycles-backendfailed to read counter branches Performance counter stats for './nop': 78638.700540 task-clock (msec) # 0.999 CPUs utilized1479 context-switches # 0.019 K/sec55 cpu-migrations # 0.001 K/sec37 page-faults # 0.000 K/sec165127619524 cycles # 2.100 GHz stalled-cycles-frontend stalled-cycles-backend165269372437 instructions # 1.00 insns per cycle branches3057191 branch-misses # 0.00% of all branches 78.692839007 seconds time elapsed #dmidecode -t processor# dmidecode 3.0Getting SMBIOS data from sysfs.SMBIOS 3.2.0 present.# SMBIOS implementations newer than version 3.0 are not# fully supported by this version of dmidecode. Handle 0x0004, DMI type 4, 48 bytesProcessor InformationSocket Designation: BGA3576Type: Central ProcessorFamily: Manufacturer: PHYTIUMID: 00 00 00 00 70 1F 66 22Version: S2500Voltage: 0.8 VExternal Clock: 50 MHzMax Speed: 2100 MHzCurrent Speed: 2100 MHzStatus: Populated, EnabledUpgrade: OtherL1 Cache Handle: 0x0005L2 Cache Handle: 0x0007L3 Cache Handle: 0x0008Serial Number: N/AAsset Tag: No Asset TagPart Number: NULLCore Count: 64Core Enabled: 64Thread Count: 64Characteristics:64-bit capableMulti-CoreHardware ThreadExecute ProtectionEnhanced VirtualizationPower/Performance Control其它

2Die,2node
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475#lscpuArchitecture: aarch64Byte Order: Little EndianCPU(s): 128On-line CPU(s) list: 0-127Thread(s) per core: 1Core(s) per socket: 128Socket(s): 1NUMA node(s): 2Model: 0BogoMIPS: 100.00L1d cache: 64KL1i cache: 64KL2 cache: 1024KL3 cache: 65536K //64core shareNUMA node0 CPU(s): 0-63NUMA node1 CPU(s): 64-127Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh #cat cpu{0,1,8,16,30,31,32,127}/cache/index3/shared_cpu_list0-630-630-630-630-630-630-6364-127 #grep -E "core|64.000" lat.logcore:064.00000 59.653core:864.00000 62.265core:1664.00000 59.411core:2464.00000 55.836core:3264.00000 55.909core:4064.00000 56.176core:4864.00000 57.240core:5664.00000 59.485core:6464.00000 131.818core:7264.00000 127.182core:8064.00000 122.452core:8864.00000 121.673core:9664.00000 126.533core:10464.00000 125.673core:11264.00000 124.188core:12064.00000 130.202 #numactl -Havailable: 2 nodes (0-1)node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63node 0 size: 515652 MBnode 0 free: 514913 MBnode 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127node 1 size: 516086 MBnode 1 free: 514815 MBnode distances:node 0 10: 10 151: 15 10单核以及HT计算Prime性能比较

以上两款CPU但从物理上的指标来看似乎AMD要好很多,从工艺上AMD也要领先一代(2年),从单核参数上来说是2.0 VS 2.5GHz,但是IPC 是5 VS 4,算下来理想的单核性能刚好一致(25=2.5 4)。
从外面的一些跑分结果显示也是AMD 要好,但是实际性能怎么样呢?
测试命令,这个测试命令无论在哪个CPU下,用2个物理核用时都是一个物理核的一半,所以这个计算是可以完全并行的
1taskset -c 1 /usr/bin/sysbench --num-threads=1 --test=cpu --cpu-max-prime=50000 run //单核用一个threads,绑核; HT用2个threads,绑一对HT测试结果为耗时,单位秒
测试项AMD EPYC 7H12 2.5G CentOS 7.9Hygon 7280 2.1GHz CentOSHygon 7280 2.1GHz 麒麟Intel 8269 2.50GIntel 8163 CPU @ 2.50GHzIntel E5-2682 v4 @ 2.50GHz单核 prime 50000 耗时59秒 IPC 0.5677秒 IPC 0.5589秒 IPC 0.56;83 0.41105秒 IPC 0.41109秒 IPC 0.39HT prime 50000 耗时57秒 IPC 0.3174秒 IPC 0.2987秒 IPC 0.2948 0.3560秒 IPC 0.3674秒 IPC 0.29相同CPU下的 指令数 基本= 耗时 IPC 核数
以上测试结果显示Hygon 7280单核计算能力是要强过Intel 8163的,但是超线程在这个场景下太不给力,相当于没有。
当然上面的计算Prime太单纯了,代表不了复杂的业务场景,所以接下来用MySQL的查询场景来看看。
如果是arm芯片在计算prime上明显要好过x86,猜测是除法取余指令上有优化
12#taskset -c 11 sysbench cpu --threads=1 --events=50000 runsysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)测试结果为10秒钟的event
测试项FT2500 2.1G鲲鹏920-4826 2.6GHzIntel 8163 CPU @ 2.50GHzHygon C86 7280 2.1GHzAMD 7T83单核 prime 10秒 events21626 IPC 0.8930299 IPC 1.018435 IPC 0.4110349 IPC 0.6340112 IPC 1.38对比MySQL sysbench和tpcc性能

分别将MySQL 5.7.34社区版部署到intel+AliOS以及hygon 7280+CentOS上,将mysqld绑定到单核,一样的压力配置均将CPU跑到100%,然后用sysbench测试点查, HT表示将mysqld绑定到一对HT核。
sysbench点查

测试命令类似如下:
1sysbench --test='/usr/share/doc/sysbench/tests/db/select.lua' --oltp_tables_count=1 --report-interval=1 --oltp-table-size=10000000 --mysql-port=3307 --mysql-db=sysbench_single --mysql-user=root --mysql-password='Bj6f9g96!@#' --max-requests=0 --oltp_skip_trx=on --oltp_auto_inc=on --oltp_range_size=5 --mysql-table-engine=innodb --rand-init=on --max-time=300 --mysql-host=x86.51 --num-threads=4 run测试结果(测试中的差异AMD、Hygon CPU跑在CentOS7.9, intel CPU、Kunpeng 920 跑在AliOS上, xdb表示用集团的xdb替换社区的MySQL Server, 麒麟是国产OS):
测试核数AMD EPYC 7H12 2.5GHygon 7280 2.1GHygon 7280 2.1GHz 麒麟Intel 8269 2.50GIntel 8163 2.50GIntel 8163 2.50G XDB5.7鲲鹏 920-4826 2.6G鲲鹏 920-4826 2.6G XDB8.0FT2500 alisql 8.0 本地–socket单核24674 0.5413441 0.4610236 0.3928208 0.7525474 0.8429376 0.899694 0.498301 0.463602 0.53一对HT36157 0.4221747 0.3819417 0.3736754 0.4935894 0.640601 0.65无HT无HT无HT4物理核94132 0.5249822 0.4638033 0.3790434 0.69 350%87254 0.73106472 0.8334686 0.4228407 0.3914232 0.5316物理核325409 0.48171630 0.38134980 0.34371718 0.69 1500%332967 0.72446290 0.85 //16核比4核好!116122 0.3594697 0.3359199 0.6 8core:31210 0.5932物理核542192 0.43298716 0.37255586 0.33642548 0.64 2700%588318 0.67598637 0.81 CPU 2400%228601 0.36177424 0.32114020 0.65

[*]麒麟OS下CPU很难跑满,大致能跑到90%-95%左右,麒麟上装的社区版MySQL-5.7.29;飞腾要特别注意mysqld所在socket,同时以上飞腾数据都是走–sock压测所得,32core走网络压测QPS为:99496(15%的网络损耗)[^说明]
Mysqld 二进制代码所在 page cache带来的性能影响

如果是飞腾跨socket影响很大,mysqld二进制跨socket性能会下降30%以上
对于鲲鹏920,双路服务器上测试,mysqld绑在node0, 但是分别将mysqld二进制load进不同的node上的page cache,然后执行点查
mysqldnode0node1node2node3QPS190120 IPC 0.40182518 IPC 0.39189046 IPC 0.40186533 IPC 0.40以上数据可以看出这里node0到node1还是很慢的,居然比跨socket还慢,反过来说鲲鹏跨socket性能很好
绑定mysqld到不同node的page cache操作
123456789101112131415161718192021222324#systemctl stop mysql-server #vmtouch -e /usr/local/mysql/bin/mysqldFiles: 1Directories: 0Evicted Pages: 5916 (23M)Elapsed: 0.00322 seconds #vmtouch -v /usr/local/mysql/bin/mysqld/usr/local/mysql/bin/mysqld[ ] 0/5916 Files: 1Directories: 0Resident Pages: 0/5916 0/23M 0%Elapsed: 0.000204 seconds #taskset -c 24 md5sum /usr/local/mysql/bin/mysqld #grep mysqld /proc/`pidof mysqld`/numa_maps //检查mysqld具体绑定在哪个node上00400000 default file=/usr/local/mysql/bin/mysqld mapped=3392 active=1 N0=3392 kernelpagesize_kB=40199b000 default file=/usr/local/mysql/bin/mysqld anon=10 dirty=10 mapped=134 active=10 N0=134 kernelpagesize_kB=401a70000 default file=/usr/local/mysql/bin/mysqld anon=43 dirty=43 mapped=120 active=43 N0=120 kernelpagesize_kB=4网卡以及node距离带来的性能差异

在鲲鹏920+mysql5.7+alios,将内存分配锁在node0上,然后分别绑核在1、24、48、72core,进行sysbench点查对比
 Core1Core24Core48Core72QPS108001040077007700以上测试的时候业务进程分配的内存全限制在node0上(下面的网卡中断测试也是同样内存结构)
12345678910#/root/numa-maps-summary.pl &1测试结果和numactl -H 看到的node distance完全一致,芯片厂家应该就是这样测试然后把这个延迟当做距离写进去了
AMD EPYC 7T83(4 numa node)的时延相对抖动有点大,这和架构多个小Die合并成一块CPU有关
12345678910111213#grep -E "core|64.00000" lat.logcore:064.00000 71.656core:3264.00000 80.129core:6464.00000 131.334core:8864.00000 136.774core:9664.00000 129.563core:12064.00000 140.151AMD EPYC 7T83(4 numa node)比Intel 8269时延要大,但是带宽也高很多
龙芯测试数据

3A5000为龙芯,执行的命令为./lat_mem_rd 128M 4096,其中 4096 参数为跳步大小。其基本原理是,通过按 给定间隔去循环读一定大小的内存区域,测量每个读平均的时间。如果区域大小小于 L1 Cache 大 小,时间应该接近 L1 的访问延迟;如果大于 L1 小于 L2,则接近 L2 访问延迟;依此类推。图中横坐 标为访问的字节数,纵坐标为访存的拍数(cycles)。

基于跳步访问的 3A5000 和 Zen1、Skylake 各级延迟的比较(cycles)

下图给出了 LMbench 测试得到的访存操作的并发性,执行的命令为./par_mem。访存操作的并 发性是各级 Cache 和内存所支持并发访问的能力。在 LMbench 中,访存操作并发性的测试是设计一 个链表,不断地遍历访问下一个链表中的元素,链表所跳的距离和需要测量的 Cache 容量相关,在 一段时间能并发的发起对链表的追逐操作,也就是同时很多链表在遍历,如果发现这一段时间内 能同时完成 N 个链表的追逐操作,就认为访存的并发操作是 N。

下图列出了三款处理器的功能部件操作延迟数据,使用的命令是./lat_ops。

龙芯stream数据

LMbench 包含了 STREAM 带宽测试工具,可以用来测试可持续的内存访问带宽情况。图表12.25列 出了三款处理器的 STREAM 带宽数据,其中 STREAM 数组大小设置为 1 亿个元素,采用 OpenMP 版本 同时运行四个线程来测试满载带宽;相应测试平台均为 CPU 的两个内存控制器各接一根内存条, 3A5000 和 Zen1 用 DDR4 3200 内存条,Skylake 用 DDR4 2400 内存条(它最高只支持这个规格)。

从数据可以看到,虽然硬件上 3A5000 和 Zen1 都实现了 DDR4 3200,但 3A5000 的实测可持续带宽 还是有一定差距。用户程序看到的内存带宽不仅仅和内存的物理频率有关系,也和处理器内部的 各种访存队列、内存控制器的调度策略、预取器和内存时序参数设置等相关,需要进行更多分析 来定位具体的瓶颈点。像 STREAM 这样的软件测试工具,能够更好地反映某个子系统的综合能力, 因而被广泛采用。
对比结论


[*]AMD单核跑分数据比较好
[*]MySQL 查询场景下Intel的性能好很多
[*]xdb比社区版性能要好
[*]MySQL8.0比5.7在多核锁竞争场景下性能要好
[*]intel最好,AMD接近Intel,海光差的比较远但是又比鲲鹏好很多,飞腾最差,尤其是跨socket简直是灾难
[*]麒麟OS性能也比CentOS略差一些
[*]从perf指标来看 鲲鹏920的L1d命中率高于8163是因为鲲鹏L1 size大;L2命中率低于8163,同样是因为鲲鹏 L2 size小;同样L1i 鲲鹏也大于8163,但是实际跑起来L1i Miss Rate更高,这说明 ARM对 L1d 使用效率低
整体来说AMD用领先了一代的工艺(7nm VS 14nm),在MySQL查询场景中终于可以接近Intel了,但是海光、鲲鹏、飞腾还是不给力。
附表

鲲鹏920 和 8163 在 MySQL 场景下的 perf 指标对比
整体对比   指标X86ARM增加幅度IPC0.49790.495-0.6%Branchs23760641477241597989498575.1%Branch-misses810424762028983836845257.6%Branch-missed rate0.0340.070104.3%内存读带宽(GB/S)25.025.0-0.2%内存写带宽(GB/S)24.667.8175.5%内存读写带宽(GB/S)49.792.886.8%UNALIGNED_ACCESS132914664513686011901929.7%L1d_MISS_RATIO0.060550.04281-29.3%L1d_MISS_RATE0.016450.017114.0%L2_MISS_RATIO0.348240.4716235.4%L2_MISS_RATE0.005770.03493504.8%L1_ITLB_MISS_RATE0.00280.00578.6%L1_DTLB_MISS_RATE0.00250.0102308.0%context-switchs84071951161498138.2%Pagefault228371741189224.6%参考资料

CPU的制造和概念
CPU 性能和Cache Line
Perf IPC以及CPU性能
Intel、海光、鲲鹏920、飞腾2500 CPU性能对比
飞腾ARM芯片(FT2500)的性能测试的性能测试/)
十年后数据库还是不敢拥抱NUMA?
一次海光物理机资源竞争压测的记录
Intel PAUSE指令变化是如何影响自旋锁以及MySQL的性能的
lmbench测试要考虑cache等
comment:
Intel 8163 IPC是0.67,和在PostgreSQL下测得数据基本一致。Oracle可以达到更高的IPC。从8163的perf结果中,看不出来访存在总周期中的占比。可以添加几个cycle_activity.cycles_l1d_miss、cycle_activity.stalls_mem_any,看看访存耗用的周期占比。
 
作者|plantegg

来源:https://www.cnblogs.com/88223100/p/Large-PK-of-different-CPU-performance.html
免责声明:由于采集信息均来自互联网,如果侵犯了您的权益,请联系我们【E-Mail:cb@itdo.tech】 我们会及时删除侵权内容,谢谢合作!
页: [1]
查看完整版本: 不同CPU性能大PK