そして最近の分岐予測について2
先日の日記で最近のIntel CPUでは間接分岐の分岐予測がほとんどミスしなくなっているという話を紹介しましたが、Branch Prediction and the Performance of Interpreters - Don't Trust Folkloreという論文にまさに同じことが書かれているのを見つけました。ていうか、この論文わたし見た形跡がある……。
去年にこの論文を見かけたときは「Direct threaded codeとかオワコン」って話までしか見てなかったんですが、今改めて見ると分岐予測が世代ごとに進化していてすごいって話に加えて、ITTAGEという分岐予測手法を使うと同じくらい当たるって書いてありますね。
ITTAGEはTAGE (TAgged GEometric length predictor)の間接分岐版で、TAGEは原論文がA case for (partially) tagged Geometric History Length Branch Predictionにあります。また、Why TAGE is the bestという解説がありました。
なお、少なくともSandy Bridge世代では最適化マニュアルに"This unit predicts the target address not only based on the EIP of the branch but also based on the execution path through which execution reached this EIP. "とあり、これはTAGEの動きと異なるように見えます。Haswellでは再設計と共にこの記述は消え、"Next generation branch prediction"に変わりましたが、Intelからの発表は"Improved performance and saves wasted work"にとどまっています。Agner氏のまとめでは"These observations may indicate that there are two branch prediction methods: a fast method tied to the µop cache and the instruction cache, and a slower method using a branch target buffer."とあるので、TAGEとはまた違う手法なのかもしれません。
魔法が現実なのはわかったが、ブラックボックスなのは相変わらずなのであった。
それだけだとアレなので、optcarrotでのperf statの結果を載せておきます。optcarrotとは何かという話はRuby で高速なプログラムを書くを。なお、mameさんのマシンはBroadwellだと聞きました。
Sandy Bridge
$ perf stat ./miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux] fps: 13.28153785633246 checksum: 59662 Performance counter stats for './miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes': 17860.390687 task-clock (msec) # 0.998 CPUs utilized 214 context-switches # 0.012 K/sec 0 cpu-migrations # 0.000 K/sec 15,470 page-faults # 0.866 K/sec 40,509,030,133 cycles # 2.268 GHz [83.34%] 18,724,702,913 stalled-cycles-frontend # 46.22% frontend cycles idle [83.33%] 11,614,302,978 stalled-cycles-backend # 28.67% backend cycles idle [66.67%] 52,111,018,708 instructions # 1.29 insns per cycle # 0.36 stalled cycles per insn [83.34%] 9,668,028,188 branches # 541.311 M/sec [83.32%] 505,219,551 branch-misses # 5.23% of all branches [83.33%] 17.897844545 seconds time elapsed $ perf stat ./miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux] fps: 13.027601466360375 checksum: 59662 Performance counter stats for './miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes': 14796.375448 task-clock (msec) # 0.998 CPUs utilized 185 context-switches # 0.013 K/sec 0 cpu-migrations # 0.000 K/sec 14,404 page-faults # 0.973 K/sec 33,417,895,565 cycles # 2.259 GHz [83.29%] 11,061,511,903 stalled-cycles-frontend # 33.10% frontend cycles idle [83.34%] 7,225,689,220 stalled-cycles-backend # 21.62% backend cycles idle [66.70%] 51,166,479,085 instructions # 1.53 insns per cycle # 0.22 stalled cycles per insn [83.38%] 9,698,878,651 branches # 655.490 M/sec [83.35%] 507,996,700 branch-misses # 5.24% of all branches [83.34%] 14.820478694 seconds time elapsed
Ivy Bridge
% perf stat ./miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux] fps: 17.73951805406009 checksum: 59662 Performance counter stats for './miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes': 11017.367463 task-clock (msec) # 0.997 CPUs utilized 273 context-switches # 0.025 K/sec 3 cpu-migrations # 0.000 K/sec 14877 page-faults # 0.001 M/sec 31898032583 cycles # 2.895 GHz 9862135088 stalled-cycles-frontend # 30.92% frontend cycles idle <not supported> stalled-cycles-backend 52153024400 instructions # 1.63 insns per cycle # 0.19 stalled cycles per insn 9663177149 branches # 877.086 M/sec 492985311 branch-misses # 5.10% of all branches 11.046473500 seconds time elapsed # 1472224481 0:14:41 naruse@tk2-243-31075:~ % perf stat ./miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux] fps: 16.002392702765267 checksum: 59662 Performance counter stats for './miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes': 11865.744103 task-clock (msec) # 0.999 CPUs utilized 41 context-switches # 0.003 K/sec 1 cpu-migrations # 0.000 K/sec 14884 page-faults # 0.001 M/sec 34584790642 cycles # 2.915 GHz 11820020205 stalled-cycles-frontend # 34.18% frontend cycles idle <not supported> stalled-cycles-backend 51054848410 instructions # 1.48 insns per cycle # 0.23 stalled cycles per insn 9677225367 branches # 815.560 M/sec 567975115 branch-misses # 5.87% of all branches 11.871797668 seconds time elapsed
Haswell
% perf stat ./miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux] fps: 28.363176942909437 checksum: 59662 Performance counter stats for './miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes': 7021.035983 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 14846 page-faults:u # 0.002 M/sec 18991845254 cycles:u # 2.705 GHz 52072400309 instructions:u # 2.74 insn per cycle 9648278758 branches:u # 1374.196 M/sec 61237531 branch-misses:u # 0.63% of all branches 7.026493931 seconds time elapsed % perf stat ./miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux] fps: 27.269785989915373 checksum: 59662 Performance counter stats for './miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes': 7214.329827 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 14849 page-faults:u # 0.002 M/sec 19775865899 cycles:u # 2.741 GHz 50975216372 instructions:u # 2.58 insn per cycle 9662536352 branches:u # 1339.353 M/sec 67884021 branch-misses:u # 0.70% of all branches 7.218712639 seconds time elapsed
Skylake
% perf stat ./miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux] fps: 45.18173733234584 checksum: 59662 Performance counter stats for './miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes': 4496.992449 task-clock (msec) # 1.000 CPUs utilized 6 context-switches # 0.001 K/sec 1 cpu-migrations # 0.000 K/sec 14857 page-faults # 0.003 M/sec 17488535114 cycles # 3.889 GHz 52123910662 instructions # 2.98 insn per cycle 9657149412 branches # 2147.468 M/sec 54148235 branch-misses # 0.56% of all branches 4.498014828 seconds time elapsed % perf stat ./miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux] fps: 42.543931298125635 checksum: 59662 Performance counter stats for './miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes': 4662.130752 task-clock (msec) # 1.000 CPUs utilized 4 context-switches # 0.001 K/sec 1 cpu-migrations # 0.000 K/sec 13765 page-faults # 0.003 M/sec 18093681660 cycles # 3.881 GHz 51137911290 instructions # 2.83 insn per cycle 9698686892 branches # 2080.312 M/sec 74335218 branch-misses # 0.77% of all branches 4.662798673 seconds time elapsed