Sholck

不积跬步,无以至千里.不积小流,无以成江海

0%

ION内存泄漏导致稳定性排查

ION内存泄漏导致稳定性排查

这是一次ion接口操作不规范引起内存泄漏,导致binder通信失败,最终定时器超时触发crash的稳定性问题排查

现象

从tombstones来看,是aps interface对aps算法线程设置了超时定时器,线程超时导致的进程自杀
从视频来看,夜景连续拍照会导致视频预览长时间卡死,最终退出

1
2
#00 pc 00000000000c76b8  /apex/com.android.runtime/lib64/bionic/libc.so (tgkill+8) (BuildId: f24f65acf653c409ca4332aced817a71)
#01 pc 00000000000c4d58 /odm/lib64/libAlgoProcess.so (android::AlgoInterface::thirdPartyAlgoTimerFunc(sigval)+512) (BuildId: ec87928e98c05a86c6d90024c9541a5e)

Log分析

既然aps算法线程超时,查看从定时器开始到超时期间,算法做了什么超时。超时时间定时器设置为20s-30s,因此不太可能是cpu分配不到,更怀疑是阻塞型I/O或者线程挂起,内核态陷入。这里注意如果线程被挂起,因为定时器属于中断,也可以使得线程开始执行超时句柄。如果线程陷入内核态,也是可以中断触发执行的 。

从log 线程来看:

1
2
3
Line 161635: 01-01 17:14:18.052863 15936 15976 I APS_CORE: [INFO][ ALGO_INTERFACE ] AlgoInterface.cpp: 796  apsTimerCreate()  timer create ret: 0, type: 1, timeout: 30s
...
Line 214095: 01-01 17:14:22.359111 15936 15976 I APS_CORE: [INFO][ ALGO_ISPDEVICE ] ISPDeviceMgr.cpp: 716 algoProcess() requestNo: 9670, timestamp: 14982536252000, userCropRegion: <8, 2, 4060, 3056>,

从这里开始,之后就再无打印,怀疑陷入内核态,检查系统调用

1
2
3
//时间点不一致是因为不是同一次抓取的
16:43:37.759339 ioctl(102, BINDER_WRITE_READ, 0x7de92d0c00) = -1 EINTR (Interrupted system call) <26.456231>
16:44:04.215725 --- SIGRT_1 {si_signo=SIGRT_1, si_code=SI_TKILL, si_pid=32270, si_uid=10123} ---

明确为ioctl通过binder通信陷入内核态,直到被定时器杀死

Binder分析

这种binder通信陷入内核态,属于被binder通信挂起,等待对端处理完毕,再唤醒的,因此怀疑对端处理超时,检查对端线程为哪一个?需要通过debugfs下的binder接口去分析,在连续拍照卡死之后,进行binder通信失败分析

1
2
3
4
RE8DDCL1:/ # cat /d/binder/failed_transaction_log
./transactions:7: thread 15976: l 10 need_return 0 tr 0
./transactions:8: outgoing transaction 3727983: 0000000000000000 from 15936:15976 to 14964:16752 code 8 flags 10 pri 0:120 r1
./transactions:21: incoming transaction 3727983: 0000000000000000 from 15936:15976 to 14964:16752 code 8 flags 10 pri 0:120 r1 node 3228233 size 572:104 data 0000000000000000

从上可以看出,和ap线程15976进行通信的hal线程为16752。检查该线程的行为:

1
2
3
Line 144686: <5>[14990.319328][T416752] HwBinder:14964_: [name:camera_mem&][CAM_MEM][cam_mem_ioctl] CAM_MEM_ION_MAP_PA: map pa failed, memID(663)
...
Line 147028: <5>[14990.530975][T716752] HwBinder:14964_: [name:camera_mem&][CAM_MEM][cam_mem_ioctl] CAM_MEM_ION_MAP_PA: map pa failed, memID(663)

该线程在尝试CAM_MEM_ION_MAP_PA十次之后导致进程自杀,如下

1
2
3
4
5
6
7
8
9
while (ioctl(this->m_cam_drv_Fd, CAM_MEM_ION_MAP_PA, &MapTable) < 0)
{
...
tryCnt++;
if (tryCnt < 10)
usleep(20 * 1000);
else
AEE_ASSERT("\nCRDISPATCH_KEY:ioctl (CAM_MEM_ION_MAP_PA) fail!");
}

目前可以确认为CAM_MEM_ION_MAP_PA导致失败。

ION分析

ioctl进行CAM_MEM_ION_MAP_PA通信失败,检查kernel log

1
2
3
4
5
6
7
8
9
10
11
12
13
<6>[14990.389533][T516752] HwBinder:14964_: [name:iova&][iommu_debug] __alloc_and_insert_iova_range fail, size:0x1800,limit:0x300000, new:0x2f2200, start:0x200000
<6>[14990.389592][T516752] HwBinder:14964_: [name:iova&][iommu_debug] __alloc_and_insert_iova_range fail, size:0x1800,limit:0x300000, new:0x2f2200, start:0x200000
<6>[14990.389602][T516752] HwBinder:14964_: mtk_iommu: debug [name:iommu_debug&]mtk_iova_dbg_alloc fail! dev:cam_mem, size:0x1800000
<6>[14990.389760][T516752] HwBinder:14964_: mtk_iommu: debug [name:iommu_debug&]iommu iova alloc total:(449/3645120KB), dom:(410/3415980KB,0,2) top 10 user:
<6>[14990.389767][T516752] HwBinder:14964_: mtk_iommu: debug [name:iommu_debug&]tab_id dom_id count size dev
<6>[14990.389775][T516752] HwBinder:14964_: mtk_iommu: debug [name:iommu_debug&] 0 2 410 3415980KB cam_mem
<6>[14990.389787][T516752] HwBinder:14964_: dma_heap: system [name:system_heap&]mtk_mm_heap_map_dma_buf map fail tab:0, dom:2, dev:cam_mem
<5>[14990.389795][T516752] HwBinder:14964_: [name:camera_mem&][CAM_MEM][cam_mem_mmu_get_dma_buffer] dma_buf_map_attachment failed! memID(663) size(0x1800000)
<5>[14990.389809][T516752] HwBinder:14964_: [name:camera_mem&][CAM_MEM][cam_mem_ioctl] CAM_MEM_ION_MAP_PA: map pa failed, memID(663)
<5>[14990.389837][T516752] HwBinder:14964_: [name:camera_mem&][CAM_MEM][dumpAllBufferList] #000 14 0x2ffb90000 0x0006000 1 1536x16-STA_BYTE-P2_0 14392.370720010
...其余的ion 散链表区域
<5>[14990.392135][T416752] HwBinder:14964_: [name:camera_mem&][CAM_MEM][dumpAllBufferList] #245 364 0x2cda00000 0x024c000 1 1440x1080-NV21-halbuf:b4 14990.345927354
<5>[14990.392179][T416752] HwBinder:14964_: [name:camera_mem&][CAM_MEM][dumpAllBufferList] Total Size= 0x1627b000 (362988 KBytes)

发现统计不一致,iommu已经统计为3.6G左右,而dump统计的散链表统计只有362M,查看mtk_iommu驱动实现。

目前已经明确是ION出了问题,因此检查系统调用是否存在大内存的申请未释放。

系统调用检查确认,发现确实有25M ion内存的mmap,但是每次拍照完成后,在下一次拍照触发时都会unmap掉,正常逻辑是不太存在内存泄漏的。

通过申请的虚拟地址在log中匹配,发现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf8c00000/0x2d9400000) to output port
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf7400000/0x2d5400000) to output port
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf5c00000/0x2d2000000) to output port
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf4400000/0x2cec00000) to output port
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf1d0c000/0x2cae00000) to output port
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cef618000/0x2c6c00000) to output port
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7ceb927000/0x29e200000) to output port
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7ccbd58000/0x29ca00000) to output port

MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf5c00000/0x2d2000000) to output port
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf8c00000/0x2d9400000) to output port
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf7400000/0x2d5400000) to output port
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf4400000/0x2cec00000) to output port
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cf1d0c000/0x2cae00000) to output port
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7cef618000/0x2c6c00000) to output port
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7ceb927000/0x29e200000) to output port
MtkCam/CapturePipe/P2ANode: [run]add RawImg Fmt(0x2214) VA/PA(0x7ccbd58000/0x297400000) to output port

这是两次拍照的log,发现最后两帧或者最后一帧对应PA一直在变,而且在持续变小,因此怀疑ION存在泄漏的可能性更大。

ION Debug

拍照前后,对比dma_buf,确实存在多出来的25M*n的部分,可以通过如下确认

1
adb pull /proc/dma_heap/heaps/mtk_mm

查看iommu驱动,发现接口供debug

1
2
3
4
5
6
7
8
9
10
//file kernel-5.10/drivers/misc/mediatek/iommu/iommu_debug.c
//code
4162 /* adb shell cat /proc/iommu_debug/help */
4163 DEFINE_PROC_FOPS_RO(mtk_iommu_help_fops);
4164 /* adb shell cat /proc/iommu_debug/iommu_dump */
4165 DEFINE_PROC_FOPS_RO(mtk_iommu_dump_fops);
4166 /* adb shell cat /proc/iommu_debug/iova_alloc */
4167 DEFINE_PROC_FOPS_RO(mtk_iommu_iova_alloc_fops);
4168 /* adb shell cat /proc/iommu_debug/iova_map */
4169 DEFINE_PROC_FOPS_RO(mtk_iommu_iova_map_fops);

iommu pa分布

查看help并进行使能接口,并每次拍照后进行iova_map查看,发现25M的pa 个数持续增加,且和实际拍照P2A log打印的PA地址一致

1
2
3
4
RE8DDCL1:/proc/iommu_debug # cat iova_map | grep -n "0x1800000"
43:0 0x297400000 0x1800000 48468.066908
55:0 0x29ca00000 0x1800000 48446.930479
57:0 0x29e200000 0x1800000 48446.113458

iommu行为

同步检查ION申请行为,发现0x29ca00000申请未释放

1
2
3
4
5
adb pull /proc/iommu_debug/iommu_dump

Line 7416: alloc 0 0x29ca00000 0x1800000 0x29e1fffff 48446.929254 cam_mem
Line 7417: map 0 0x29ca00000 0x1800000 0x29e1fffff 48446.930479
Line 7418: sync 1 0x29ca00000 0x1800000 0x29e1fffff 48446.930494

iommu相关信息

检查0x29ca00000这块ion的相关信息

1
2
3
4
5
6
adb pull /proc/dma_heap/heaps/mtk_mm

inode:5771744 size(Byte):25165824 count:1 cache_sg:1 exp:mtk_mm name:25165824x-BLOB-halbuf:b5
buf_priv: uncached:0 alloc_pid:784(allocator@4.0-s)tid:784(allocator@4.0-s) alloc_time:48446499954us
buf_priv: tab:0 dom:2 map:1 iova:0x29ca00000 attr:0x20 dir:0 dev:cam_mem
pid:29202(camerahalserver) inode:5771744 size:25165824 fd_cnt:1 mmap_cnt:0 fd_list:400

发现此时hal依旧持有fd 400,奇怪,通过fdinfo确定确实持有

1
2
3
4
5
6
7
8
9
10
//非同一次抓取,别纠结进程号
RE8DDCL1:/proc/29202/fdinfo # cat 400
pos: 0
flags: 02
mnt_id: 8
ino: 5771744
size: 25165824
count: 1
exp_name: mtk_mm
name: 25165824x-BLOB-halbuf:b5

原因确认

系统调用检查,发现mmap的fd时447,之后被 munmap释放,为什么这块ion被fd 400持有,检查fd 400的来源发现存在dup,且没有被close

1
2
30714 12:12:34.791556 dup(447 <unfinished ...>
30714 12:12:34.791640 <... dup resumed> ) = 400 <0.000076>

根据时间log跟踪代码发现dpframework 存在dup调用

1
2
3
4
5
6
7
8
9
10
11
12
实现如下:
bool function(int32_t FD)
{
m_shareFD = dup(FD); //dup调用

if (-1 == m_shareFD)
{
. ..
return false;
}
return true;
}

确认问题根源为:mtk中对共享内存中的fd 在dup调用后未close,导致共享内存未正确释放,为mtk平台原生问题