ION内存泄漏导致稳定性排查 这是一次ion接口操作不规范引起内存泄漏,导致binder通信失败,最终定时器超时触发crash的稳定性问题排查
现象 从tombstones来看,是aps interface对aps算法线程设置了超时定时器,线程超时导致的进程自杀 从视频来看,夜景连续拍照会导致视频预览长时间卡死,最终退出
1 2 #00 pc 00000000000 c76b8 /apex/com.android.runtime/lib64/bionic/libc.so (tgkill+8 ) (BuildId: f24f65acf653c409ca4332aced817a71) #01 pc 00000000000 c4d58 /odm/lib64/libAlgoProcess.so (android::AlgoInterface::thirdPartyAlgoTimerFunc(sigval)+512 ) (BuildId: ec87928e98c05a86c6d90024c9541a5e)
Log分析 既然aps算法线程超时,查看从定时器开始到超时期间,算法做了什么超时。超时时间定时器设置为20s-30s,因此不太可能是cpu分配不到,更怀疑是阻塞型I/O或者线程挂起,内核态陷入。这里注意如果线程被挂起,因为定时器属于中断,也可以使得线程开始执行超时句柄。如果线程陷入内核态,也是可以中断触发执行的 。
从log 线程来看:
1 2 3 Line 161635 : 01 -01 17 :14 :18.052863 15936 15976 I APS_CORE: [INFO][ ALGO_INTERFACE ] AlgoInterface.cpp: 796 apsTimerCreate() timer create ret: 0 , type: 1 , timeout: 30 s ... Line 214095 : 01 -01 17 :14 :22.359111 15936 15976 I APS_CORE: [INFO][ ALGO_ISPDEVICE ] ISPDeviceMgr.cpp: 716 algoProcess() requestNo: 9670 , timestamp: 14982536252000 , userCropRegion: <8 , 2 , 4060 , 3056 >,
从这里开始,之后就再无打印,怀疑陷入内核态,检查系统调用
1 2 3 16 :43 :37.759339 ioctl(102 , BINDER_WRITE_READ, 0x7de92d0c00 ) = -1 EINTR (Interrupted system call) <26.456231 >16 :44 :04.215725 --- SIGRT_1 {si_signo=SIGRT_1, si_code=SI_TKILL, si_pid=32270 , si_uid=10123 } ---
明确为ioctl通过binder通信陷入内核态,直到被定时器杀死
Binder分析 这种binder通信陷入内核态,属于被binder通信挂起,等待对端处理完毕,再唤醒的,因此怀疑对端处理超时,检查对端线程为哪一个?需要通过debugfs下的binder接口去分析,在连续拍照卡死之后,进行binder通信失败分析
1 2 3 4 RE8DDCL1:/ # cat /d/binder/failed_transaction_log ./transactions:7 : thread 15976 : l 10 need_return 0 tr 0 ./transactions:8 : outgoing transaction 3727983 : 0000000000000000 from 15936 :15976 to 14964 :16752 code 8 flags 10 pri 0 :120 r1 ./transactions:21 : incoming transaction 3727983 : 0000000000000000 from 15936 :15976 to 14964 :16752 code 8 flags 10 pri 0 :120 r1 node 3228233 size 572 :104 data 0000000000000000
从上可以看出,和ap线程15976进行通信的hal线程为16752。检查该线程的行为:
1 2 3 Line 144686 : <5 >[14990.319328 ][T416752] HwBinder:14964 _: [name:camera_mem&][CAM_MEM][cam_mem_ioctl] CAM_MEM_ION_MAP_PA: map pa failed, memID(663 ) ... Line 147028 : <5 >[14990.530975 ][T716752] HwBinder:14964 _: [name:camera_mem&][CAM_MEM][cam_mem_ioctl] CAM_MEM_ION_MAP_PA: map pa failed, memID(663 )
该线程在尝试CAM_MEM_ION_MAP_PA十次之后导致进程自杀,如下
1 2 3 4 5 6 7 8 9 while (ioctl(this ->m_cam_drv_Fd, CAM_MEM_ION_MAP_PA, &MapTable) < 0 ){ ... tryCnt++; if (tryCnt < 10 ) usleep(20 * 1000 ); else AEE_ASSERT("\nCRDISPATCH_KEY:ioctl (CAM_MEM_ION_MAP_PA) fail!" ); }
目前可以确认为CAM_MEM_ION_MAP_PA导致失败。
ION分析 ioctl进行CAM_MEM_ION_MAP_PA通信失败,检查kernel log
1 2 3 4 5 6 7 8 9 10 11 12 13 <6 >[14990.389533 ][T516752] HwBinder:14964 _: [name:iova&][iommu_debug] __alloc_and_insert_iova_range fail, size:0x1800 ,limit:0x300000 , new :0x2f2200 , start:0x200000 <6 >[14990.389592 ][T516752] HwBinder:14964 _: [name:iova&][iommu_debug] __alloc_and_insert_iova_range fail, size:0x1800 ,limit:0x300000 , new :0x2f2200 , start:0x200000 <6 >[14990.389602 ][T516752] HwBinder:14964 _: mtk_iommu: debug [name:iommu_debug&]mtk_iova_dbg_alloc fail! dev:cam_mem, size:0x1800000 <6 >[14990.389760 ][T516752] HwBinder:14964 _: mtk_iommu: debug [name:iommu_debug&]iommu iova alloc total:(449 /3645120 KB), dom:(410 /3415980 KB,0 ,2 ) top 10 user: <6 >[14990.389767 ][T516752] HwBinder:14964 _: mtk_iommu: debug [name:iommu_debug&]tab_id dom_id count size dev <6 >[14990.389775 ][T516752] HwBinder:14964 _: mtk_iommu: debug [name:iommu_debug&] 0 2 410 3415980 KB cam_mem <6 >[14990.389787 ][T516752] HwBinder:14964 _: dma_heap: system [name:system_heap&]mtk_mm_heap_map_dma_buf map fail tab:0 , dom:2 , dev:cam_mem <5 >[14990.389795 ][T516752] HwBinder:14964 _: [name:camera_mem&][CAM_MEM][cam_mem_mmu_get_dma_buffer] dma_buf_map_attachment failed! memID(663 ) size(0x1800000 ) <5 >[14990.389809 ][T516752] HwBinder:14964 _: [name:camera_mem&][CAM_MEM][cam_mem_ioctl] CAM_MEM_ION_MAP_PA: map pa failed, memID(663 ) <5 >[14990.389837 ][T516752] HwBinder:14964 _: [name:camera_mem&][CAM_MEM][dumpAllBufferList] #000 14 0x2ffb90000 0x0006000 1 1536 x16-STA_BYTE-P2_0 14392.370720010 ...其余的ion 散链表区域 <5 >[14990.392135 ][T416752] HwBinder:14964 _: [name:camera_mem&][CAM_MEM][dumpAllBufferList] #245 364 0x2cda00000 0x024c000 1 1440 x1080-NV21-halbuf:b4 14990.345927354 <5 >[14990.392179 ][T416752] HwBinder:14964 _: [name:camera_mem&][CAM_MEM][dumpAllBufferList] Total Size= 0x1627b000 (362988 KBytes)
发现统计不一致,iommu已经统计为3.6G左右,而dump统计的散链表统计只有362M,查看mtk_iommu驱动实现。
目前已经明确是ION出了问题,因此检查系统调用是否存在大内存的申请未释放。
系统调用检查确认,发现确实有25M ion内存的mmap,但是每次拍照完成后,在下一次拍照触发时都会unmap掉,正常逻辑是不太存在内存泄漏的。
通过申请的虚拟地址在log中匹配,发现如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf8c00000/0x2d9400000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf7400000/0x2d5400000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf5c00000/0x2d2000000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf4400000/0x2cec00000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf1d0c000/0x2cae00000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cef618000/0x2c6c00000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7ceb927000/0x29e200000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7ccbd58000/0x29ca00000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf5c00000/0x2d2000000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf8c00000/0x2d9400000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf7400000/0x2d5400000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf4400000/0x2cec00000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf1d0c000/0x2cae00000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cef618000/0x2c6c00000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7ceb927000/0x29e200000) to output port MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7ccbd58000/0x297400000) to output port
这是两次拍照的log,发现最后两帧或者最后一帧对应PA一直在变,而且在持续变小,因此怀疑ION存在泄漏的可能性更大。
ION Debug 拍照前后,对比dma_buf,确实存在多出来的25M*n的部分,可以通过如下确认
1 adb pull /proc/dma_heap/heaps/mtk_mm
查看iommu驱动,发现接口供debug
1 2 3 4 5 6 7 8 9 10 4162 4163 DEFINE_PROC_FOPS_RO(mtk_iommu_help_fops);4164 4165 DEFINE_PROC_FOPS_RO(mtk_iommu_dump_fops);4166 4167 DEFINE_PROC_FOPS_RO(mtk_iommu_iova_alloc_fops);4168 4169 DEFINE_PROC_FOPS_RO(mtk_iommu_iova_map_fops);
iommu pa分布 查看help并进行使能接口,并每次拍照后进行iova_map查看,发现25M的pa 个数持续增加,且和实际拍照P2A log打印的PA地址一致
1 2 3 4 RE8DDCL1:/proc/iommu_debug # cat iova_map | grep -n "0x1800000" 43 :0 0x297400000 0x1800000 48468.066908 55 :0 0x29ca00000 0x1800000 48446.930479 57 :0 0x29e200000 0x1800000 48446.113458
iommu行为 同步检查ION申请行为,发现0x29ca00000申请未释放
1 2 3 4 5 adb pull /proc/iommu_debug/iommu_dump Line 7416 : alloc 0 0x29ca00000 0x1800000 0x29e1fffff 48446.929254 cam_mem Line 7417 : map 0 0x29ca00000 0x1800000 0x29e1fffff 48446.930479 Line 7418 : sync 1 0x29ca00000 0x1800000 0x29e1fffff 48446.930494
iommu相关信息 检查0x29ca00000这块ion的相关信息
1 2 3 4 5 6 adb pull /proc/dma_heap/heaps/mtk_mm inode:5771744 size(Byte):25165824 count:1 cache_sg:1 exp :mtk_mm name:25165824 x-BLOB-halbuf:b5 buf_priv: uncached:0 alloc_pid:784 (allocator@4.0 -s)tid:784 (allocator@4.0 -s) alloc_time:48446499954u s buf_priv: tab:0 dom:2 map :1 iova:0x29ca00000 attr:0x20 dir:0 dev:cam_mem pid:29202 (camerahalserver) inode:5771744 size:25165824 fd_cnt:1 mmap_cnt:0 fd_list:400
发现此时hal依旧持有fd 400,奇怪,通过fdinfo确定确实持有
1 2 3 4 5 6 7 8 9 10 RE8DDCL1:/proc/29202 /fdinfo # cat 400 pos: 0 flags: 02 mnt_id: 8 ino: 5771744 size: 25165824 count: 1 exp_name: mtk_mm name: 25165824 x-BLOB-halbuf:b5
原因确认 系统调用检查,发现mmap的fd时447,之后被 munmap释放,为什么这块ion被fd 400持有,检查fd 400的来源发现存在dup,且没有被close
1 2 30714 12 :12 :34.791556 dup(447 <unfinished ...>30714 12 :12 :34.791640 <... dup resumed> ) = 400 <0.000076 >
根据时间log跟踪代码发现dpframework 存在dup调用
1 2 3 4 5 6 7 8 9 10 11 12 实现如下: bool function (int32_t FD) { m_shareFD = dup(FD); if (-1 == m_shareFD) { . .. return false ; } return true ; }
确认问题根源为:mtk中对共享内存中的fd 在dup调用后未close,导致共享内存未正确释放,为mtk平台原生问题