程序分析利器perf 简介 全称为performance counter,是一个linux内核实现的框架来进行性能分析,覆盖了硬件CPU/PMU和(软件计数器,监测点)之类的特性。
本文重点介绍perf的使用和实践,不对原理性知识进行讲解
背景知识 硬件 cpu流水线 cache tick 时钟周期,假设一台处理器的处理频率为2GHZ,那么时钟周期为2GHZ,一次tick的时间间隔为1/2ns
软件 clock tick 时钟滴答,在经过多个时钟周期后,机器内部会产生一次时钟滴答,即一次时钟中断,用来对任务进行管理。
tracepoints 追踪点,由ftrace框架实现,一些函数在内核中被特殊标记,在运行到这些函数时进行函数记录并记录在内核环形缓冲区,消耗时间很小,并可以通过文件节点打印出来,详见/sys/kernel/debug/tracing
原理 perf通过进行事件采样来对性能进行分析,事件根据类型分为几类:
Kernel PMU event Hardware event 比如 branch-miss Hardware cache event dTLB和iTLB相关,即数据转换后备缓存区和指令转换后备缓冲区 software event 比如 cpu-migrations tracepoints 缩写 IPC:每个时钟周期内的指令数 PMU:performance monitoring units 使用 安装 1 sudo apt-get install perf-tools-unstable
命令 参数 常用的参数如下:
-e 指定事件 -p 指定pid perf help 通过perf --help
打印全部的命令,perf 的全部选项命令如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 bench General framework for benchmark suites 基准测试 test Runs sanity tests. 检查系统是否支持perf的全部功能 list List all symbolic event types top System profiling tool. 实时热点事件分析 stat Run a command and gather performance counter statistics record Run a command and record its profile into perf.data report Read perf.data (created by perf record) and display the profile probe Define new dynamic tracepoints 动态追踪点 trace strace inspired tool 同strace功能 kmem Tool to trace/measure kernel memory properties kernel内存 lock Analyze lock events mem Profile memory accesses sched Tool to trace/measure scheduler properties (latencies) 调度子系统 kvm Tool to trace/measure kvm guest os timechart Tool to visualize total system behavior during a workload evlist List the event names in a perf.data file 列出record文件的所有事件
perf bench 系统基准测试,通过对4个子系统的运行来评估系统性能
sched 包括进程调度和IPC进程通信(socket和pipe)
1 2 3 4 5 6 7 8 9 10 11 12 13 kernel_tool_test_C/ $ perf bench sched all # Running sched/messaging benchmark... >>20个进程对通信 # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.031 [sec] # Running sched /pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 6.182 [sec] 6.182868 usecs/op 161737 ops/sec
mem 内存copy性能评估,包括两种方式memcpy和memset,结果显示为吞吐率
1 2 3 4 5 kernel_tool_test_C/ $ perf bench mem all # Running mem/memcpy benchmark... # function 'default' (Default memcpy() provided by glibc) # Copying 1MB bytes ... 25.699013 GB/sec
futex 锁压力评估
perf list 打印系统支持的事件 sudo perf list
, 之后在 perf 的其他命令中可以-e 指定事件
这些事件中hw,PMU, cache属于硬件, sw属于内核计数器,Tracepoints event属于ftrace捕捉的内核追踪点
软件evnet 1 2 3 4 5 6 7 8 9 10 11 alignment-faults [Software event] bpf-output [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] page-faults OR faults [Software event] task-clock [Software event]
硬件event 1 2 3 4 5 6 7 8 branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event]
cache evnet 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 L1-dcache-load-misses [Hardware cache event] L1-dcache-loads [Hardware cache event] L1-dcache-stores [Hardware cache event] L1-icache-load-misses [Hardware cache event] LLC-load-misses [Hardware cache event] LLC-loads [Hardware cache event] LLC-store-misses [Hardware cache event] LLC-stores [Hardware cache event] branch-load-misses [Hardware cache event] branch-loads [Hardware cache event] dTLB-load-misses [Hardware cache event] dTLB-loads [Hardware cache event] dTLB-store-misses [Hardware cache event] dTLB-stores [Hardware cache event] iTLB-load-misses [Hardware cache event] iTLB-loads [Hardware cache event] node-load-misses [Hardware cache event] node-loads [Hardware cache event] node-store-misses [Hardware cache event] node-stores [Hardware cache event]
PMU event 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 branch-instructions OR cpu/branch-instructions/ [Kernel PMU event] branch-misses OR cpu/branch-misses/ [Kernel PMU event] bus-cycles OR cpu/bus-cycles/ [Kernel PMU event] cache-misses OR cpu/cache-misses/ [Kernel PMU event] cache-references OR cpu/cache-references/ [Kernel PMU event] cpu-cycles OR cpu/cpu-cycles/ [Kernel PMU event] cycles-ct OR cpu/cycles-ct/ [Kernel PMU event] cycles-t OR cpu/cycles-t/ [Kernel PMU event] el-abort OR cpu/el-abort/ [Kernel PMU event] el-capacity OR cpu/el-capacity/ [Kernel PMU event] el-commit OR cpu/el-commit/ [Kernel PMU event] el-conflict OR cpu/el-conflict/ [Kernel PMU event] el-start OR cpu/el-start/ [Kernel PMU event] instructions OR cpu/instructions/ [Kernel PMU event] intel_bts// [Kernel PMU event] intel_cqm/llc_occupancy/ [Kernel PMU event] intel_cqm/local_bytes/ [Kernel PMU event] intel_cqm/total_bytes/ [Kernel PMU event] intel_pt// [Kernel PMU event] mem-loads OR cpu/mem-loads/ [Kernel PMU event] mem-stores OR cpu/mem-stores/ [Kernel PMU event] ...
tracepoint evnet 1 2 3 4 5 alarmtimer:alarmtimer_cancel [Tracepoint event] alarmtimer:alarmtimer_fired [Tracepoint event] alarmtimer:alarmtimer_start [Tracepoint event] alarmtimer:alarmtimer_suspend [Tracepoint event] block:block_bio_backmerge [Tracepoint event]
perf top 打印说明 可以实时生成和显示系统中事件计数的函数排名,新版内核可以查看函数对应的汇编指令
第一列 Overhead指事件比例
第二列 DOS (Dynamic Shared Object) 可以为kernel,应用程序perf,动态链接库libc-2.23.so,模块
第三列 DOS类型 [k]表示对象属于内核或者模块, [.]表示对象为二进制程序,为应用程序或者动态链接库
第四列 符号名,没有symbols的显示为地址
ubuntu服务器上执行如下,默认为cycles,即cpu周期占用比例
1 2 3 4 5 6 7 8 9 10 11 12 13 Samples: 2K of event 'cycles', Event count (approx.): 507514780 Overhead Shared Object Symbol 4.11% [kernel] [k] delay_tsc 2.78% [kernel] [k] kstat_irqs 2.51% perf [.] 0x00000000000a4784 2.32% libc-2.23.so [.] _int_malloc 2.03% [kernel] [k] _find_next_bit.part.0 1.84% [kernel] [k] ixgbe_read_reg 1.52% [kernel] [k] vsnprintf 1.31% [kernel] [k] kallsyms_expand_symbol.constprop.1 1.31% [kernel] [k] update_blocked_averages 1.28% [kernel] [k] format_decode 1.25% perf [.] 0x0000000000099837
手机上执行如下
1 2 3 4 5 6 7 8 9 10 11 12 PerfTop: 418 irqs/sec kernel:72.2% exact: 0.0% [4000Hz cycles:ppp], (all, 8 CPUs) --------------------------------------------------------------------------------------------- 14.42% [kernel] [k] 0xffffff8008bcb854 >>地址 9.80% [kernel] [k] 0xffffff8008fc6acc 9.08% [kernel] [k] 0xffffff8008fc6a68 1.90% [kernel] [k] 0xffffff80080820a8 0.94% [kernel] [k] 0xffffff8008301c00 0.84% perf [.] 0x0000000000088250 0.75% [kernel] [k] 0xffffff8008bf9928 0.63% [kernel] [k] 0xffffff8008fa5550 0.55% perf [.] 0x00000000000d87b8
因为权限设置,无法获取symbols,因为无法将地址转化为函数
1 2 3 4 5 6 7 Failed to open [kernel.kallsyms], continuing without symbols Warning: Kernel address maps (/proc/{kallsyms,modules}) are restricted. Check /proc/sys/kernel/kptr_restrict. Kernel samples will not be resolved.
常用命令 参数 说明 -e 指定事件 -d 指定屏幕刷新间隔 -K 不显示内核或者模块对象 -U 不显示上层应用或者动态链接库 -p 对指定的进程和其创建的线程采样
perf stat 对一个命令或者程序进行事件统计
打印说明 1 2 3 4 5 6 7 8 9 10 11 12 13 14 kernel_tool_test_C/ $ sudo perf stat ./test-run ..... Performance counter stats for './test-run': 1.206903 task-clock (msec) # 0.000 CPUs utilized >>CPU利用率 4 context-switches # 0.003 M/sec >>上下文切换 1 cpu-migrations # 0.829 K/sec >>CPU调度迁移 119 page-faults # 0.099 M/sec >>缺页异常 2,804,835 cycles # 2.324 GHz >>处理器周期数 1,627,455 instructions # 0.58 insn per cycle >>指令数 338,915 branches # 280.814 M/sec >>分支指令数 14,436 branch-misses # 4.26% of all branches >>预测错误的分支指令数 12.012281176 seconds time elapsed
常用命令 -e和-p为通用选项
参数 说明 -r 重复指定命令n次,结果显示标准偏差和期望的比值 -d 打印更多信息,包括L1和LLC(L3,三级缓存,数据共享)数据缓存 -d -d 打印更多信息, 包括dTLB和iTLB
perf record 运行命令并将其事件记录在per.data中,之后通过perf report来展示。一般需要加上-g参数来进行函数调用显示
1 2 3 4 5 :kernel_tool_test_C/ $ sudo perf record -g ls [11:39:23] Makefile perf.data perf.data.old README.md src test-run [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.013 MB perf.data (1 samples) ]
perf report 读取per.data文件并给出分析显示,可以看出函数的调用逻辑顺序
perf probe perf probe可针对内核和ELF进行动态 tracepoint的追踪,更加详细见 man perf-probe
增加动态追踪点 1 2 kernel_tool_test_C/ $ sudo perf probe -x test-run test_blk Added new event: probe_test:test_blk (on test_blk in /work/chejian/test/kernel_tool_test_C/test-run)
抓取动态追踪点 1 sudo perf record -e probe_test:test_blk -gaR ./test-run
删除动态追踪点 1 kernel_tool_test_C/ $ sudo perf probe -d test_blk Removed event: probe_test:test_blk
列出动态追踪点 1 2 3 kernel_tool_test_C/ $ sudo perf probe -l probe:schedule (on schedule) >>内核 probe_test:test_blk (on test_blk@src/modules/strace/strace.c in /work/chejian/test/kernel_tool_test_C/test-run) >>ELF程序
perf trace 同strace,包含抓取syscall的功能
1 2 3 4 5 6 7 8 9 10 sudo perf trace -o output.txt ./test-run 11993.175 ( 0.003 ms): test-run/267526 rt_sigaction(sig: CHLD, act: 0x7ffc7d5ad570, oact: 0x7ffc7d5ad610, sigsetsize: 8) = 0 11993.180 ( 0.012 ms): test-run/267526 open(filename: 0x6031c0, flags: RDWR ) = 4 11993.195 ( 0.005 ms): test-run/267526 stat(filename: 0x6031c0, statbuf: 0x7ffc7d5ad828 ) = 0 11993.203 ( 0.023 ms): test-run/267526 mmap(len: 32, prot: READ|WRITE, flags: SHARED, fd: 4 ) = 0x7fdbf0da0000 >>共享内存映射地址, fd为 4 11995.233 ( 0.002 ms): test-run/267526 close(fd: 4 ) = 0 11995.239 ( 0.009 ms): test-run/267526 munmap(addr: 0x7fdbf0da0000, len: 32 ) = 0
-s
和-S
选项和strace 下的 -c -w
功能类似,都是对调用次数和消耗时间的统计
1 2 3 4 5 6 7 8 9 10 11 12 Summary of events: test-run (269586), 8 events, 3.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ nanosleep 1 4995.994 4995.994 4995.994 4995.994 0.00% write 2 0.020 0.008 0.010 0.012 18.33% clone 1 0.000 0.000 0.000 0.000 0.00%
但是发现write时的字符串参数没有具体打印是啥,只是显示了一个地址,因此strace的write参数可以更好的确定源代码位置
perf kmem 内核内存的分配统计,一般为slab和page存取两类,操作为record和stat搭配使用
slab slab的解析分为两部分,一部分为--caller
导致的Callsite部分,一部分为–alloc导致的Alloc Ptr地址打印部分,具体的打印解释如下:
Callsite,表示谁调用了kmalloc来进行slab申请 Total_alloc/Per 表示调用者一共分配的内存和每次平均分配的内存 Total_req/Per 表示调用者一共申请的内存和每次平均申请的内存 Hit 表示调用者调用kmalloc的次数 Ping-pong 多cpu架构kfree和kmalloc的共享内存对应的cpu不同,这看做是一次ping-pong,理想是越小越好。 Frag 碎片率,为(实际分配-申请)/实际分配 Alloc Ptr 为申请的地址 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 kernel_tool_test_C/ $ sudo perf kmem record --slab ls Makefile output.svg output.txt output.txt.old perf.data perf.data.old README.md src test-run [ perf record: Woken up 0 times to write data ] [ perf record: Captured and wrote 2.225 MB perf.data (12859 samples) ] kernel_tool_test_C/ $ sudo perf kmem --alloc --caller --slab stat --------------------------------------------------------------------------------------------------------- Callsite | Total_alloc/Per | Total_req/Per | Hit | Ping-pong | Frag --------------------------------------------------------------------------------------------------------- proc_reg_open+32 | 64/64 | 40/40 | 1 | 0 | 37.500% apparmor_file_alloc_security+5c | 608/32 | 456/24 | 19 | 0 | 25.000% ext4_readdir+91e | 64/64 | 48/48 | 1 | 0 | 25.000% ext4_htree_store_dirent+3e | 800/66 | 698/58 | 12 | 0 | 12.750% .... --------------------------------------------------------------------------------------------------------- Alloc Ptr | Total_alloc/Per | Total_req/Per | Hit | Ping-pong | Frag --------------------------------------------------------------------------------------------------------- 0xffff995a4bb4c000 | 1024/1024 | 640/640 | 1 | 0 | 37.500% 0xffff995b83910f80 | 128/64 | 88/44 | 2 | 0 | 31.250% 0xffff995c561c3ec0 | 96/32 | 72/24 | 3 | 0 | 25.000%
page(to do) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 kernel_tool_test_C/ $ sudo perf kmem record --page --live ls # # GFP flags # --------- # 002152c0: I|F|NWR|NR|C|NMA|NT: __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_NOTRACK # 01000200: NW|NWR: GFP_NOWAIT|__GFP_NOWARN # 014000c0: K: GFP_KERNEL # 014200ca: HUM: GFP_HIGHUSER_MOVABLE # 014280ca: HUM|Z: GFP_HIGHUSER_MOVABLE|__GFP_ZERO # 015080c0: KAC|Z: GFP_KERNEL_ACCOUNT|__GFP_ZERO # 015200c2: HU|AC: GFP_HIGHUSER|__GFP_ACCOUNT # 017080c0: KAC|Z|NT: GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK # 01c200ca: HUM|WR: GFP_HIGHUSER_MOVABLE|__GFP_WRITE --------------------------------------------------------------------------------------------------------- Total alloc (KB) | Hits | Order | Mig.type | GFP flags | Callsite --------------------------------------------------------------------------------------------------------- 120 | 30 | 0 | UNMOVABL | KAC|Z|NT | pte_alloc_one 84 | 21 | 0 | RECLAIM | HUM|Z | handle_mm_fault 40 | 10 | 0 | RECLAIM | HUM | wp_page_copy 32 | 8 | 0 | RECLAIM | HUM | handle_mm_fault 20 | 5 | 0 | UNMOVABL | NW|NWR | tlb_next_batch.isra.42 20 | 5 | 0 | UNMOVABL | KAC|Z | __pmd_alloc 12 | 3 | 0 | UNMOVABL | K | __pollwait 8 | 2 | 0 | UNMOVABL | KAC|Z | __pud_alloc 8 | 1 | 1 | UNMOVABL | I|F|NWR|NR|C|NMA|NT | new_slab 4 | 1 | 0 | UNMOVABL | KAC|Z|NT | pgd_alloc 4 | 1 | 0 | RECLAIM | HUM|WR | __page_cache_alloc 4 | 1 | 0 | RECLAIM | HUM|Z | wp_page_copy 4 | 1 | 0 | UNMOVABL | HU|AC | pipe_write --------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------- PFN | Total alloc (KB) | Hits | Order | Mig.type | GFP flags | Callsite --------------------------------------------------------------------------------------------------------- 23073150 | 8 | 2 | 0 | UNMOVABL | KAC|Z | __pmd_alloc 28346272 | 8 | 1 | 1 | UNMOVABL | I|F|NWR|NR|C|NMA|NT | new_slab 17327464 | 4 | 1 | 0 | UNMOVABL | KAC|Z|NT | pte_alloc_one
perf lock(todo) 内核锁事件分析
使用:perf lock {record|report|script|info}
条件: 内核需要使能CONFIG_LOCKDEP和CONFIG_LOCK_STAT
1 sudo perf lock record ls tracepoint lock:lock_acquire is not enabled. Are CONFIG_LOCKDEP and CONFIG_LOCK_STAT enabled?
perf mem 内存访问分析,内存操作分为两类,为load和store,对应取和存,对应的事件为cpu/mem-loads,ldlat=30/P
和cpu/mem-stores/P
1 2 3 4 5 6 7 8 9 kernel_tool_test_C/ $ sudo perf mem record ./test-run Samples: 22 of event 'cpu/mem-loads,ldlat=30/P', Event count (approx.): 4449 Overhead Samples Local Weight Memory access Symbol Shared Object Data Symbol Data Object Snoop TLB access Locked 12.00% 1 534 LFB hit [k] filemap_map_pages [kernel.kallsyms] [k] 0xffffd75181feaaa0 [kernel.kallsyms] None L1 or L2 hit No 11.80% 1 525 LFB hit [k] filemap_map_pages [kernel.kallsyms] [k] 0xffffd751809b0888 [kernel.kallsyms] None L1 or L2 hit No 9.78% 1 435 L3 miss [k] tty_write [kernel.kallsyms] [k] 0xffff995caac199c0 [kernel.kallsyms] N/A L1 or L2 hit No 9.44% 1 420 LFB hit [k] unmap_page_range [kernel.kallsyms] [k] 0xffffd7518094fca0 [kernel.kallsyms] None L1 or L2 hit No 9.42% 1 419 L1 hit [k] page_add_file_rmap [kernel.kallsyms] [k] 0xffffd75181fe0d58 [kernel.kallsyms] None L1 or L2 hit Yes
perf sched 调度器分析
使用:perf sched {record|latency|map|replay|script}
抓取调度相关事件 1 kernel_tool_test_C/ $ sudo perf sched record ./test-run
查看调度延迟 1 2 3 4 5 kernel_tool_test_C/ $sudo perf sched latency -s runtime ---------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | ---------------------------------------------------------------------------------------------------- test-run:(3) | 1.436 ms | 7 | avg: 0.020 ms | max: 0.068 ms | max at: 9477856.501406 s
各列的含义如下:
Task 进程名 Runtime ms 运行时间 Switches 进程切换次数 Average delay 调度平均延迟 Maximum delay ms 调度最大延迟,是影响较大的因素 Maximum delay at 调度延迟发生时间点 打印抓取的调度事件 各列的含义为:进程名, 进程号, 所在的cpu,时间戳, 调度事件
1 2 3 kernel_tool_test_C/ $ sudo perf sched script swapper 0 [005] 9477854.475353: sched:sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_sched next_pid=8 next_prio=120 rcu_sched 8 [005] 9477854.475357: sched:sched_switch: prev_comm=rcu_sched prev_pid=8 prev_prio=120 prev_state=S ==> next_comm=swapper/5 next_pid=0 next_prio=120
复现调度场景 重放record文件中的调度场景,但是当进程有sleep函数时重现就会卡主
1 kernel_tool_test_C/ $ sudo perf sched replay >>复现调度打印卡住,暂无发现实践意义
调度迁移(上下文切换) perf sched map
可以打印任务在cpu之间直观的调度状态
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 kernel_tool_test_C/ $ sudo perf sched map -f D0 A0 H0 G0 *R0 L0 . D0 A0 H0 G0 R0 L0 . *S0 D0 A0 H0 G0 R0 L0 *T0 S0 D0 A0 H0 G0 R0 L0 T0 *.4 D0 A0 H0 *P0 R0 L0 T0 . D0 A0 H0 P0 R0 L0 T0 *S0 D0 A0 H0 P0 R0 L0 *. S0 D0 *U0 H0 P0 R0 L0 . S0 *Q0 U0 H0 P0 R0 L0 . S0 Q0 U0 H0 *G0 R0 L0 . S0 Q0 U0 *. G0 R0 L0 . S0 Q0 U0 . G0 R0 *V0 . S0 Q0 *A0 . G0 R0 V0 . S0 Q0 A0 . G0 R0 V0 *W0 S0
*标示调度事件发生所在的cpu,.表示CPU处于IDEL状态 ,一个正常的map可以显示一个任务的调度会尽力给到闲置的cpu,而不是忙碌的cpu
perf evlist 显示record文件中的events
1 2 3 4 5 6 7 8 9 kernel_tool_test_C/ $ sudo perf evlist sched:sched_switch sched:sched_stat_wait sched:sched_stat_sleep sched:sched_stat_iowait sched:sched_stat_runtime sched:sched_process_fork sched:sched_wakeup sched:sched_wakeup_new sched:sched_migrate_task
perf timechart 一个通过抓取事件转化为直观的视图svg文件来显示调度器,CPU, IO等事件的工具
1 2 kernel_tool_test_C/ $ sudo perf timechart record -I -T ./test-run kernel_tool_test_C/ $ sudo perf timechart -p test-run Written 12.0 seconds of trace to output.svg. >>需要将文件拖入到浏览器