效能分析工具簡介：valgrind + kcachegrind、Intel Vtune、Perf

這篇簡介3個效能分析工具的基礎用法，並試做效能分析；這次作為效率分析測試用的程式是MCML :

執行檔檔名 : mcml
測試資料 : wcy_lo.mci (No. of photons # 102400)

[測資介紹] MCML: Monte Carlo Simulation of Light Transport in Tissue
http://mropengate.blogspot.tw/2016/09/mcml-and-conv-monte-carlo-simulation-of.html

一、Profiling tools : valgrind + kcachegrind

Valgrind是一款用於記憶體偵錯、記憶體漏失檢測以及效能分析的軟體開發工具。Valgrind遵守GNU通用公眾授權條款條款，是一款自由軟體。

其中，Valgrind 的兩個工具 callgrind 和 cachegrind 可以分析程式的執行效率，找出拖慢執行時間瓶頸點，是很常見的分析工具。

1. Installation

$ sudo apt-get install valgrind kcachegrind graphviz

2. Usage

$ valgrind --tool=callgrind [program][program_options]

//in our example
$ valgrind --tool=callgrind ./mcml wcy_lo.mci

The information below will show on the screen if the program run smoothly.

==4486==
==4486== Events    : Ir
==4486== Collected : 16493899159
==4486==
==4486== I   refs:      16,493,899,159

The result will be stored in a callgrind.out.XXX file where XXX will be the process identifier.

Then, using the kcachegrind tools can show the profile information in a graphic way.

$ kcachegrind [callgrind.out.xxx]

Be ware the Ir counts. The Ir counts are basically the count of assembly instructions executed.

Ir: I cache reads (instructions executed)
I1mr: I1 cache read misses (instruction wasn't in I1 cache but was in L2)
I2mr: L2 cache instruction read misses (instruction wasn't in I1 or L2 cache, had to be fetched from memory)
Dr: D cache reads (memory reads)
D1mr: D1 cache read misses (data location not in D1 cache, but in L2)
D2mr: L2 cache data read misses (location not in D1 or L2)
Dw: D cache writes (memory writes)
D1mw: D1 cache write misses (location not in D1 cache, but in L2)
D2mw: L2 cache data write misses (location not in D1 or L2)

In this picture, it's easy for us to find out which function is the "notorious pest" in our program.

The Call Graph is also an amazing tool to see the distribution of Irs.

二、Profiling tools : Intel Vtune

Intel VTune Amplifier performance profiler is a commercial application for software performance analysis of 32 and 64-bit x86 based machines.

1. Installation

下載Intel Parallel Studio
https://registrationcenter.intel.com/en/forms/?licensetype=2&productid=2486

$ tar -zxvf parallel_studio_xe_2016_update3.tgz
$ ./install_GUI.sh
// an installation GUI will show

2. Usage

$ source /opt/intel/vtune_amplifier_xe_2016/amplxe-vars.sh

$ amplxe-cl --collect hotspot ./mcml wcy_lo.mci

程式執行完畢，會產生出一個資料夾名稱為”r000hs”位於MCML目錄底下。

$ amplxe-gui

GUI介面開啟，選擇 [Open Result] 將 [rXXXhs] 中的 [rXXXhs.amplxe] 打開。

三、Profiling tools : Perf

Perf 全名是 Performance Event，是在 Linux 2.6.31 以後內建的系統效能分析工具，它隨著核心一併釋出。藉由 perf，應用程式可以利用 PMU (Performance Monitoring Unit), tracepoint 和核心內部的特殊計數器 (counter) 來進行統計，另外還能同時分析運行中的核心程式碼，從而更全面了解應用程式中的效能瓶頸。

1. Installation

$ sudo apt-get install linux-tools-generic

2. Usage

$ perf record [program] [program_options]
//our example
$ perf record ./mcml wcy_lo.mci
$ perf report

輸入完以上指令即可產生以下選單。

可以觀察各個指令的取樣率（以最耗時的__cos_avx() 為例）：

四、MCML效能分析結果

從call tree中結果發現HopDropSpin()為程式執行的瓶頸。

但由於三者體現「計算量（或耗時）」的方式稍有不同，Valgrind是用 Ir counts 、Vtune是用 CPU time、Perf是用 CPU cycles (samples)，因此分析結果也稍有不同。

3個工具都認為 Spin() 這個函式是主要瓶頸點，但Valgrind 認為 Spin() 中的3個子函式影響力相似；而Vtune 和 Perf 認為 Spin() 中的 __cos_avx() 才是主要瓶頸點。一個可能的解釋是：指令數與執行時間不一定完全正比。但無論是指令數目瓶頸或是執行時間瓶頸，都是計算密集 (CPU bound) 的現象。

另外，Perf 在計算函式執行時間時，會扣掉子程式的時間，因此HopDropSpin()、HopDropSpinIInTissue()都很低，最高的反而是 __cos_avx() 。

五、效能分析工具比較

Valgrind : valgrind 加上 kcachegrind 後有著優異的可視化能力，個人很喜歡 Call Graph 的表示方法，非常直觀，而且是自由軟體，取用上較為容易。

Vtune : 有報表式的分析，感覺是商業化(Intel)一點的分析軟體。（註：Vtune 只支援intel 硬體，因此可以動用到較多硬體資源做較深入的效能檢測，所以結果一般來說比較精確）

Perf : 計算方式稍有不同，提供一個簡便的效能分析方案，另有即時的效能分析等應用。

References

Valgrind Tutorial
http://blog.yoco.io/2010/01/valgrind-tutorial.html

CS107 : Guide to callgrind
https://web.stanford.edu/class/cs107/guide_callgrind.html

Callgrind: a call-graph generating cache and branch prediction profiler
http://valgrind.org/docs/manual/cl-manual.html

How to profile C++ application with Callgrind / KCacheGrind
http://baptiste-wicht.com/posts/2011/09/profile-c-application-with-callgrind-kcachegrind.html

Linux 效能分析工具: Perf
http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial

實驗二：MCML 未平行化版本效能分析
https://docs.google.com/document/d/1LIG2j_Mpat_n3qv0Sx5L-zEp_A3Fl7SWm046F1aN-RY/edit#heading=h.264jqz59hmx8

pixelbeat - profiling
http://www.pixelbeat.org/programming/profiling/

How to profile your applications using the Linux perf tools / KCacheGrind
http://baptiste-wicht.com/posts/2011/07/profile-applications-linux-perf-tools.html

Valgrind is *NOT* a leak checker
http://maintainablecode.logdown.com/posts/245425-valgrind-is-not-a-leak-checker

Pages

2016年10月4日星期二