そして最近の分岐予測について2

先日の日記で最近のIntel CPUでは間接分岐の分岐予測がほとんどミスしなくなっているという話を紹介しましたが、Branch Prediction and the Performance of Interpreters - Don't Trust Folkloreという論文にまさに同じことが書かれているのを見つけました。ていうか、この論文わたし見た形跡がある……。

去年にこの論文を見かけたときは「Direct threaded codeとかオワコン」って話までしか見てなかったんですが、今改めて見ると分岐予測が世代ごとに進化していてすごいって話に加えて、ITTAGEという分岐予測手法を使うと同じくらい当たるって書いてありますね。

ITTAGEはTAGE (TAgged GEometric length predictor)の間接分岐版で、TAGEは原論文がA case for (partially) tagged Geometric History Length Branch Predictionにあります。また、Why TAGE is the bestという解説がありました。

なお、少なくともSandy Bridge世代では最適化マニュアルに"This unit predicts the target address not only based on the EIP of the branch but also based on the execution path through which execution reached this EIP. "とあり、これはTAGEの動きと異なるように見えます。Haswellでは再設計と共にこの記述は消え、"Next generation branch prediction"に変わりましたが、Intelからの発表は"Improved performance and saves wasted work"にとどまっています。Agner氏のまとめでは"These observations may indicate that there are two branch prediction methods: a fast method tied to the µop cache and the instruction cache, and a slower method using a branch target buffer."とあるので、TAGEとはまた違う手法なのかもしれません。

魔法が現実なのはわかったが、ブラックボックスなのは相変わらずなのであった。

それだけだとアレなので、optcarrotでのperf statの結果を載せておきます。optcarrotとは何かという話はRuby で高速なプログラムを書くを。なお、mameさんのマシンはBroadwellだと聞きました。

Sandy Bridge

$ perf stat ./miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes
ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux]
fps: 13.28153785633246
checksum: 59662

 Performance counter stats for './miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes':

      17860.390687 task-clock (msec)         #    0.998 CPUs utilized
               214 context-switches          #    0.012 K/sec
                 0 cpu-migrations            #    0.000 K/sec
            15,470 page-faults               #    0.866 K/sec
    40,509,030,133 cycles                    #    2.268 GHz                     [83.34%]
    18,724,702,913 stalled-cycles-frontend   #   46.22% frontend cycles idle    [83.33%]
    11,614,302,978 stalled-cycles-backend    #   28.67% backend  cycles idle    [66.67%]
    52,111,018,708 instructions              #    1.29  insns per cycle
                                             #    0.36  stalled cycles per insn [83.34%]
     9,668,028,188 branches                  #  541.311 M/sec                   [83.32%]
       505,219,551 branch-misses             #    5.23% of all branches         [83.33%]

      17.897844545 seconds time elapsed

$ perf stat ./miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes
ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux]
fps: 13.027601466360375
checksum: 59662

 Performance counter stats for './miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes':

      14796.375448 task-clock (msec)         #    0.998 CPUs utilized
               185 context-switches          #    0.013 K/sec
                 0 cpu-migrations            #    0.000 K/sec
            14,404 page-faults               #    0.973 K/sec
    33,417,895,565 cycles                    #    2.259 GHz                     [83.29%]
    11,061,511,903 stalled-cycles-frontend   #   33.10% frontend cycles idle    [83.34%]
     7,225,689,220 stalled-cycles-backend    #   21.62% backend  cycles idle    [66.70%]
    51,166,479,085 instructions              #    1.53  insns per cycle
                                             #    0.22  stalled cycles per insn [83.38%]
     9,698,878,651 branches                  #  655.490 M/sec                   [83.35%]
       507,996,700 branch-misses             #    5.24% of all branches         [83.34%]

      14.820478694 seconds time elapsed

Ivy Bridge

%  perf stat ./miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes
ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux]
fps: 17.73951805406009
checksum: 59662

 Performance counter stats for './miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes':

      11017.367463      task-clock (msec)         #    0.997 CPUs utilized
               273      context-switches          #    0.025 K/sec
                 3      cpu-migrations            #    0.000 K/sec
             14877      page-faults               #    0.001 M/sec
       31898032583      cycles                    #    2.895 GHz
        9862135088      stalled-cycles-frontend   #   30.92% frontend cycles idle
   <not supported>      stalled-cycles-backend
       52153024400      instructions              #    1.63  insns per cycle
                                                  #    0.19  stalled cycles per insn
        9663177149      branches                  #  877.086 M/sec
         492985311      branch-misses             #    5.10% of all branches

      11.046473500 seconds time elapsed

# 1472224481 0:14:41 naruse@tk2-243-31075:~
%  perf stat ./miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes
ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux]
fps: 16.002392702765267
checksum: 59662

 Performance counter stats for './miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes':

      11865.744103      task-clock (msec)         #    0.999 CPUs utilized
                41      context-switches          #    0.003 K/sec
                 1      cpu-migrations            #    0.000 K/sec
             14884      page-faults               #    0.001 M/sec
       34584790642      cycles                    #    2.915 GHz
       11820020205      stalled-cycles-frontend   #   34.18% frontend cycles idle
   <not supported>      stalled-cycles-backend
       51054848410      instructions              #    1.48  insns per cycle
                                                  #    0.23  stalled cycles per insn
        9677225367      branches                  #  815.560 M/sec
         567975115      branch-misses             #    5.87% of all branches

      11.871797668 seconds time elapsed

Haswell

%  perf stat ./miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes
ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux]
fps: 28.363176942909437
checksum: 59662

 Performance counter stats for './miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes':

       7021.035983      task-clock:u (msec)       #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
             14846      page-faults:u             #    0.002 M/sec
       18991845254      cycles:u                  #    2.705 GHz
       52072400309      instructions:u            #    2.74  insn per cycle
        9648278758      branches:u                # 1374.196 M/sec
          61237531      branch-misses:u           #    0.63% of all branches

       7.026493931 seconds time elapsed

%  perf stat ./miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes
ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux]
fps: 27.269785989915373
checksum: 59662

 Performance counter stats for './miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes':

       7214.329827      task-clock:u (msec)       #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
             14849      page-faults:u             #    0.002 M/sec
       19775865899      cycles:u                  #    2.741 GHz
       50975216372      instructions:u            #    2.58  insn per cycle
        9662536352      branches:u                # 1339.353 M/sec
          67884021      branch-misses:u           #    0.70% of all branches

       7.218712639 seconds time elapsed

Skylake

%  perf stat ./miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes
ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux]
fps: 45.18173733234584
checksum: 59662

 Performance counter stats for './miniruby-gcc48 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes':

       4496.992449      task-clock (msec)         #    1.000 CPUs utilized
                 6      context-switches          #    0.001 K/sec
                 1      cpu-migrations            #    0.000 K/sec
             14857      page-faults               #    0.003 M/sec
       17488535114      cycles                    #    3.889 GHz
       52123910662      instructions              #    2.98  insn per cycle
        9657149412      branches                  # 2147.468 M/sec
          54148235      branch-misses             #    0.56% of all branches

       4.498014828 seconds time elapsed

%  perf stat ./miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes
ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux]
fps: 42.543931298125635
checksum: 59662

 Performance counter stats for './miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes':

       4662.130752      task-clock (msec)         #    1.000 CPUs utilized
                 4      context-switches          #    0.001 K/sec
                 1      cpu-migrations            #    0.000 K/sec
             13765      page-faults               #    0.003 M/sec
       18093681660      cycles                    #    3.881 GHz
       51137911290      instructions              #    2.83  insn per cycle
        9698686892      branches                  # 2080.312 M/sec
          74335218      branch-misses             #    0.77% of all branches

       4.662798673 seconds time elapsed