Intel、海光、鲲鹏920、飞腾2500 CPU性能对比
Intel 海光 鲲鹏920 飞腾2500 CPU性能对比为了让程序能快点,特意了解了CPU的各种原理,比如多核、超线程、NUMA、睿频、功耗、GPU、大小核再到分支预测、cache_line失效、加锁代价、IPC等各种指标(都有对应的代码和测试数据)都会在这系列文章中得到答案。当然一定会有程序员最关心的分支预测案例、Disruptor无锁案例、cache_line伪共享案例等等。
这次让我们从最底层的沙子开始用8篇文章来回答各种疑问以及大量的实验对比案例和测试数据。
大的方面主要是从这几个疑问来写这些文章:
[*]同样程序为什么CPU跑到800%还不如CPU跑到200%快?
[*]IPC背后的原理和和程序效率的关系?
[*]为什么数据库领域都爱把NUMA关了,这对吗?
[*]几个国产芯片的性能到底怎么样?
系列文章
CPU的制造和概念
Perf IPC以及CPU性能
CPU性能和CACHE
CPU 性能和Cache Line
十年后数据库还是不敢拥抱NUMA?
Intel PAUSE指令变化是如何影响自旋锁以及MySQL的性能的
Intel、海光、鲲鹏920、飞腾2500 CPU性能对比
一次海光物理机资源竞争压测的记录
飞腾ARM芯片(FT2500)的性能测试
本篇是收尾篇,横向对比一下x86和ARM芯片,以及不同方案权衡下的性能比较
CPU基本信息
海光
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950#lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 64On-line CPU(s) list: 0-63Thread(s) per core: 2 //每个物理core有两个超线程Core(s) per socket: 16 //每路16个物理coreSocket(s): 2 //2路NUMA node(s): 4Vendor ID: HygonGenuineCPU family: 24Model: 1Model name: Hygon C86 5280 16-core ProcessorStepping: 1CPU MHz: 2455.552CPU max MHz: 2500.0000CPU min MHz: 1600.0000BogoMIPS: 4999.26Virtualization: AMD-VL1d cache: 32KL1i cache: 64KL2 cache: 512KL3 cache: 8192KNUMA node0 CPU(s): 0-7,32-39NUMA node1 CPU(s): 8-15,40-47NUMA node2 CPU(s): 16-23,48-55NUMA node3 CPU(s): 24-31,56-63Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 MySQLeed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca #numactl -Havailable: 4 nodes (0-3)node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39node 0 size: 128854 MBnode 0 free: 89350 MBnode 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47node 1 size: 129019 MBnode 1 free: 89326 MBnode 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55node 2 size: 128965 MBnode 2 free: 86542 MBnode 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63node 3 size: 129020 MBnode 3 free: 98227 MBnode distances:node 0 1 2 30: 10 16 28 221: 16 10 22 282: 28 22 10 163: 22 28 16 10这CPU据说是胶水核,也就是把两个die拼一块封装成一块CPU,所以一块CPU内跨die之间延迟还是很高的。
64 个 core 的分配策略
12345physical core processor0 0~15 0~151 0~15 16~310 0~15 32~471 0~15 48~63
Intel CPU
Cascade Lake架构相对Broadwell L1没变,L2从256K增加到1M增加了4倍,L3从2.5下降到1.38M每core
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677#lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 104On-line CPU(s) list: 0-103Thread(s) per core: 2Core(s) per socket: 26座: 2NUMA 节点: 1厂商 ID: GenuineIntelCPU 系列: 6型号: 85型号名称: Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz步进: 7CPU MHz: 1200.000CPU max MHz: 2501.0000CPU min MHz: 1200.0000BogoMIPS: 5000.00虚拟化: VT-xL1d 缓存: 32KL1i 缓存: 32KL2 缓存: 1024KL3 缓存: 36608KNUMA 节点0 CPU: 0-103Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni spec_ctrl intel_stibp flush_l1d arch_capabilities # numactl -Havailable: 1 nodes (0)node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103node 0 size: 785826 MBnode 0 free: 108373 MBnode distances:node 00: 10 //志强E5#lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 64On-line CPU(s) list: 0-63Thread(s) per core: 2Core(s) per socket: 16Socket(s): 2NUMA node(s): 2Vendor ID: GenuineIntelCPU family: 6Model: 79Model name: Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHzStepping: 1CPU MHz: 2500.000CPU max MHz: 3000.0000CPU min MHz: 1200.0000BogoMIPS: 5000.06Virtualization: VT-xL1d cache: 32KL1i cache: 32KL2 cache: 256KL3 cache: 40960KNUMA node0 CPU(s): 0-15,32-47NUMA node1 CPU(s): 16-31,48-63Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb invpcid_single pln pts dtherm spec_ctrl ibpb_support tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local cat_l3 #numactl -Havailable: 2 nodes (0-1)node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47node 0 size: 262008 MBnode 0 free: 240846 MBnode 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63node 1 size: 262144 MBnode 1 free: 242774 MBnode distances:node 0 10: 10 211: 21 10鲲鹏920
鲲鹏920-4826的L1比8269C 大一倍,但是L2小一倍。L3鲲鹏为1M/core 8269为1.38M/core(物理core)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051#lscpuArchitecture: aarch64Byte Order: Little EndianCPU(s): 96On-line CPU(s) list: 0-95Thread(s) per core: 1Core(s) per socket: 48Socket(s): 2NUMA node(s): 1Model: 0CPU max MHz: 2600.0000CPU min MHz: 200.0000BogoMIPS: 200.00L1d cache: 64KL1i cache: 64KL2 cache: 512KL3 cache: 49152KNUMA node0 CPU(s): 0-95Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm #numactl -Havailable: 4 nodes (0-3)node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23node 0 size: 192832 MBnode 0 free: 187693 MBnode 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47node 1 size: 193533 MBnode 1 free: 191827 MBnode 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71node 2 size: 193533 MBnode 2 free: 192422 MBnode 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95node 3 size: 193532 MBnode 3 free: 193139 MBnode distances:node 0 1 2 30: 10 12 20 221: 12 10 22 242: 20 22 10 123: 22 24 12 10 #dmidecode -t processor | grep VersionVersion: Kunpeng 920-4826Version: Kunpeng 920-4826 以上四个鲲鹏920的四个NUMA node之间的距离描述如下:node 0node 2| (die distance) | (die distance)node 1 node 3要注意node1到node3比node0到node3要大,猜测Socket之间的UPI只接上了node1和node2鲲鹏920架构参考这里
Though Huawei has been keeping a tight lip on the chip design itself, the Hi1620 is actually a multi-chip design. Actually, we believe are three dies. The chip itself comprise two compute dies called the Super CPU cluster (SCCL), each one packing 32 cores. It’s also possible the SCCL only have 24 cores, in which case there are three such dies with a theoretical maximum core count of 72 cores possible but are not offered for yield reasons. Regardless of this, there are at least two SCCL dies for sure. Additionally, there is also an I/O die called the Super IO Cluster (SICL) which contains all the high-speed SerDes and low-speed I/Os.
下图是6426型号,我测试用的是4826型号,也就是一个CPU内是48core,一个CPU封装3个Die,两个Die是 core,还有一个是Super IO Cluster
鲲鹏命令规范:
鲲鹏 RoadMap
鲲鹏 Kunpeng 920-4826 跨numa性能比较
绑24core,跨numa0、numa3,是numactl -H看到的比较远距离。两分钟的 Current tpmC: 69660
123456789101112131415161718192021222324252627282930#taskset -a -cp 12-23,72-83 20799 #perf stat -e branch-misses,bus-cycles,cache-misses,cache-references,cpu-cycles,instructions,stalled-cycles-backend,stalled-cycles-frontend,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-store-misses,L1-dcache-stores,L1-icache-load-misses,L1-icache-loads,branch-load-misses,branch-loads,dTLB-load-misses,dTLB-loads,iTLB-load-misses,iTLB-loads,cpu-migrations -p 20799^CPerformance counter stats for process id '20799': 2,866,418,154 branch-misses (59.84%)549,673,215,827 bus-cycles (59.89%)2,179,816,578 cache-misses # 2.360 % of all cache refs (59.93%)92,377,674,343 cache-references (60.04%)549,605,057,475 cpu-cycles (65.05%)229,958,980,614 instructions # 0.42 insn per cycle# 1.31 stalled cycles per insn (65.05%)146,201,062,116 stalled-cycles-backend # 26.60% backend cycles idle (65.08%)301,814,831,043 stalled-cycles-frontend # 54.91% frontend cycles idle (65.08%)2,177,062,319 L1-dcache-load-misses # 2.35% of all L1-dcache hits (65.04%)92,481,797,426 L1-dcache-loads (65.11%)2,175,030,428 L1-dcache-store-misses (65.15%)92,507,474,710 L1-dcache-stores (65.14%)9,299,812,249 L1-icache-load-misses # 12.47% of all L1-icache hits (65.20%)74,579,909,037 L1-icache-loads (65.16%)2,862,664,443 branch-load-misses (65.08%)52,826,930,842 branch-loads (65.04%)3,729,265,130 dTLB-load-misses # 3.11% of all dTLB cache hits (64.95%)119,896,014,498 dTLB-loads (59.90%)1,350,782,047 iTLB-load-misses # 1.83% of all iTLB cache hits (59.84%)74,005,620,378 iTLB-loads (59.82%)510 cpu-migrations 9.483137760 seconds time elapsed绑72-95core,在同一个numa下,但是没有重启进程,导致有一半内存仍然在numa0上,2分钟的Current tpmC: 75900
123456789101112131415161718192021222324252627282930313233343536373839404142#taskset -a -cp 72-95 20799 #perf stat -e branch-misses,bus-cycles,cache-misses,cache-references,cpu-cycles,instructions,stalled-cycles-backend,stalled-cycles-frontend,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-store-misses,L1-dcache-stores,L1-icache-load-misses,L1-icache-loads,branch-load-misses,branch-loads,dTLB-load-misses,dTLB-loads,iTLB-load-misses,iTLB-loads,cpu-migrations -p 20799^CPerformance counter stats for process id '20799': 2,665,583,722 branch-misses (59.90%)500,184,789,050 bus-cycles (59.95%)1,997,726,097 cache-misses # 2.254 % of all cache refs (59.94%)88,628,013,529 cache-references (59.93%)500,111,712,450 cpu-cycles (64.98%)221,098,464,920 instructions # 0.44 insn per cycle# 1.35 stalled cycles per insn (65.02%)105,957,124,479 stalled-cycles-backend # 21.19% backend cycles idle (65.02%)298,186,439,955 stalled-cycles-frontend # 59.62% frontend cycles idle (65.02%)1,996,313,908 L1-dcache-load-misses # 2.25% of all L1-dcache hits (65.04%)88,701,699,646 L1-dcache-loads (65.09%)1,997,851,364 L1-dcache-store-misses (65.10%)88,614,658,960 L1-dcache-stores (65.10%)8,635,807,737 L1-icache-load-misses # 12.30% of all L1-icache hits (65.13%)70,233,323,630 L1-icache-loads (65.16%)2,665,567,783 branch-load-misses (65.10%)50,482,936,168 branch-loads (65.09%)3,614,564,473 dTLB-load-misses # 3.15% of all dTLB cache hits (65.04%)114,619,822,486 dTLB-loads (59.96%)1,270,926,362 iTLB-load-misses # 1.81% of all iTLB cache hits (59.97%)70,248,645,721 iTLB-loads (59.94%)128 cpu-migrations 8.610934700 seconds time elapsed #/root/numa-maps-summary.pl /dev/null ‘</p></blockquote> 执行时间(秒)主频海光31.061s2.5G鲲鹏92023.521s2.6G飞腾2500 2.1GIntel22.979s(8163)2.5G71015.570s2.75G多核一起跑的话可以这样:
for i in {0..95}; do time echo “scale=5000; 4*a(1)” | bc -l -q >/dev/null & done
perf stat -e branch-misses,bus-cycles,cache-misses,cache-references,cpu-cycles,instructions,stalled-cycles-backend,stalled-cycles-frontend,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-store-misses,L1-dcache-stores,L1-icache-load-misses,L1-icache-loads,branch-load-misses,branch-loads,dTLB-load-misses,dTLB-loads,iTLB-load-misses,iTLB-loads –
710
耗时15.83秒,ipc 2.64
1234567891011121314151617perf stat -e branch-misses,bus-cycles,cache-misses,cache-references,cpu-cycles,instructions,stalled-cycles-backend,stalled-cycles-frontend,alignment-faults,bpf-output,context-switches,cpu-clock,cpu-migrations,dummy,emulation-faults,major-faults,minor-faults,page-faults,task-clock,L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses,L1-icache-loads,LLC-load-misses,LLC-loads,branch-load-misses,branch-loads,dTLB-load-misses,dTLB-loads,iTLB-load-misses,iTLB-loads -- bash -c 'echo "7^999999" | bc > /dev/null' Performance counter stats for 'bash -c echo "7^999999" | bc > /dev/null': 985,496,277 branch-misses (29.97%)43,509,183,948 bus-cycles # 2748.210 M/sec (29.97%)7,068,868 cache-misses # 0.020 % of all cache refs (29.96%)35,165,185,942 cache-references # 2221.170 M/sec (29.97%)43,508,579,063 cpu-cycles # 2.748 GHz (34.97%)114,779,081,188 instructions # 2.64 insn per cycle# 0.04 stalled cycles per insn (34.99%)4,913,750,141 stalled-cycles-backend # 11.29% backend cycles idle (35.02%)4,255,139,235 stalled-cycles-frontend # 9.78% frontend cycles idle (35.02%)0 alignment-faults # 0.000 K/sec0 bpf-output # 0.000 K/sec24 context-switches # 0.002 K/sec15,831.82 msec cpu-clock # 1.000 CPUs utilizedintel
耗时18.60秒,ipc 2.19
12345678910111213141516171819202122232425262728293031323334353637# sudo perf stat -e branch-instructions,branch-misses,bus-cycles,cache-misses,cache-references,cpu-cycles,instructions,ref-cycles,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses,LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores,branch-load-misses,branch-loads,dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads,node-load-misses,node-loads,node-store-misses,node-stores -- bash -c 'echo "7^999999" | bc > /dev/null' Performance counter stats for 'bash -c echo "7^999999" | bc > /dev/null': 25,130,886,211 branch-instructions (10.72%)1,200,086,175 branch-misses # 4.78% of all branches (14.29%)460,824,074 bus-cycles (14.29%)1,983,459 cache-misses # 46.066 % of all cache refs (14.30%)4,305,730 cache-references (14.30%)58,626,314,801 cpu-cycles (17.87%)128,284,870,917 instructions # 2.19 insn per cycle (21.45%)46,040,656,499 ref-cycles (25.02%)22,821,794 L1-dcache-load-misses # 0.10% of all L1-dcache hits (25.02%)23,041,732,649 L1-dcache-loads (25.01%)5,386,243,625 L1-dcache-stores (25.00%)12,443,154 L1-icache-load-misses (25.00%)178,790 LLC-load-misses # 30.52% of all LL-cache hits (14.28%)585,724 LLC-loads (14.28%)469,381 LLC-store-misses (7.14%)664,865 LLC-stores (7.14%)1,201,547,113 branch-load-misses (10.71%)25,139,625,428 branch-loads (14.28%)63,334 dTLB-load-misses # 0.00% of all dTLB cache hits (14.28%)23,023,969,089 dTLB-loads (14.28%)17,355 dTLB-store-misses (14.28%)5,378,496,562 dTLB-stores (14.28%)341,119 iTLB-load-misses # 119.92% of all iTLB cache hits (14.28%)284,445 iTLB-loads (14.28%)151,608 node-load-misses (14.28%)37,553 node-loads (14.29%)434,537 node-store-misses (7.14%)65,709 node-stores (7.14%) 18.603323495 seconds time elapsed 18.525904000 seconds user0.015197000 seconds sys鲲鹏920
耗时24.6秒, IPC 1.84
123456789101112131415161718192021222324252627#perf stat -e branch-misses,bus-cycles,cache-misses,cache-references,cpu-cycles,instructions,stalled-cycles-backend,stalled-cycles-frontend,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-store-misses,L1-dcache-stores,L1-icache-load-misses,L1-icache-loads,branch-load-misses,branch-loads,dTLB-load-misses,dTLB-loads,iTLB-load-misses,iTLB-loads -- bash -c 'echo "7^999999" | bc > /dev/null' Performance counter stats for 'bash -c echo "7^999999" | bc > /dev/null': 1,467,769,425 branch-misses (59.94%)63,866,536,853 bus-cycles (59.94%)6,571,273 cache-misses # 0.021 % of all cache refs (59.94%)30,768,754,927 cache-references (59.96%)63,865,354,560 cpu-cycles (64.97%)117,790,453,518 instructions # 1.84 insns per cycle# 0.07 stalled cycles per insn (64.98%)833,090,930 stalled-cycles-backend # 1.30% backend cycles idle (65.00%)7,918,227,782 stalled-cycles-frontend # 12.40% frontend cycles idle (65.01%)6,962,902 L1-dcache-load-misses # 0.02% of all L1-dcache hits (65.03%)30,804,266,645 L1-dcache-loads (65.05%)6,960,157 L1-dcache-store-misses (65.06%)30,807,954,068 L1-dcache-stores (65.06%)1,012,171 L1-icache-load-misses (65.06%)45,256,066,296 L1-icache-loads (65.04%)1,470,467,198 branch-load-misses (65.03%)27,108,794,972 branch-loads (65.01%)475,707 dTLB-load-misses # 0.00% of all dTLB cache hits (65.00%)35,159,826,836 dTLB-loads (59.97%)912 iTLB-load-misses # 0.00% of all iTLB cache hits (59.96%)45,325,885,822 iTLB-loads (59.94%) 24.604603640 seconds time elapsed海光
耗时 26.73秒, IPC 0.92
12345678910111213141516171819202122232425sudo perf stat -e branch-instructions,branch-misses,cache-references,cpu-cycles,instructions,stalled-cycles-backend,stalled-cycles-frontend,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-prefetches,L1-icache-load-misses,L1-icache-loads,branch-load-misses,branch-loads,dTLB-load-misses,dTLB-loads,iTLB-load-misses,iTLB-loads -a -- bash -c 'echo "7^999999" | bc > /dev/null' Performance counter stats for 'system wide': 57,795,675,025 branch-instructions (27.78%)2,459,509,459 branch-misses # 4.26% of all branches (27.78%)12,171,133,272 cache-references (27.79%)317,353,262,523 cpu-cycles (27.79%)293,162,940,548 instructions # 0.92 insn per cycle# 0.19 stalled cycles per insn (27.79%)55,152,807,029 stalled-cycles-backend # 17.38% backend cycles idle (27.79%)44,410,732,991 stalled-cycles-frontend # 13.99% frontend cycles idle (27.79%)4,065,273,083 L1-dcache-load-misses # 3.58% of all L1-dcache hits (27.79%)113,699,208,151 L1-dcache-loads (27.79%)1,351,513,191 L1-dcache-prefetches (27.79%)2,091,035,340 L1-icache-load-misses # 4.43% of all L1-icache hits (27.79%)47,240,289,316 L1-icache-loads (27.79%)2,459,838,728 branch-load-misses (27.79%)57,855,156,991 branch-loads (27.78%)69,731,473 dTLB-load-misses # 20.40% of all dTLB cache hits (27.78%)341,773,319 dTLB-loads (27.78%)26,351,132 iTLB-load-misses # 15.91% of all iTLB cache hits (27.78%)165,656,863 iTLB-loads (27.78%) 26.729972414 seconds time elapsed飞腾
1234567891011121314151617181920212223242526time perf stat -e branch-misses,bus-cycles,cache-misses,cache-references,cpu-cycles,instructions,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-store-misses,L1-dcache-stores,L1-icache-load-misses,L1-icache-loads,branch-load-misses,branch-loads,dTLB-load-misses,iTLB-load-misses -a -- bash -c 'echo "7^999999" | bc > /dev/null' Performance counter stats for 'system wide': 2552812813 branch-misses (38.08%)602038279874 bus-cycles (37.54%)1742826523 cache-misses # 2.017 % of all cache refs (37.54%)86400294181 cache-references (37.55%)612467194375 cpu-cycles (43.79%)263691445872 instructions # 0.43 insns per cycle (43.79%)1706247569 L1-dcache-load-misses # 2.00% of all L1-dcache hits (43.78%)85122454139 L1-dcache-loads (43.77%)1711243358 L1-dcache-store-misses (39.38%)86288158984 L1-dcache-stores (37.52%)2006641212 L1-icache-load-misses (37.51%)146380907111 L1-icache-loads (37.51%)2560208048 branch-load-misses (37.52%)63127187342 branch-loads (41.38%)768494735 dTLB-load-misses (43.77%)124424415 iTLB-load-misses (43.77%) 39.654819568 seconds time elapsed real 0m39.763suser 0m39.635ssys 0m0.127sperf 数据对比
Intel
intel的cpu随着线程的增加,ipc稳定减少,但不是线性的
海光
如下数据可以看到在用满32个物理core之前,ipc保持稳定,超过32core后随着并发增加ipc相应减少,性能再也上不去了。
鲲鹏920
可以看到鲲鹏920多核跑openssl是没有什么争抢的,所以还能保证完全线性
小结
intel的流水线适合跑高带宽应用,不适合跑密集计算应用,也就是intel的pipeline数量少,但是内存读写上面优化好,乱序优化好。跑纯计算,不是intel的强项。
数据库场景下鲲鹏920大概相当于X86的70%的能力
prime计算一般走的fpu,不走cpu
intel x86 cpu bound和memory bond数据
测试代码
12345678910111213141516171819202122232425262728293031#include #include #include #include char a = 1; void memory_bound() {register unsigned i=0;register char b; for (i=0;i
页:
[1]