There are a few issues that I spotted in the code:
1. The code make mistake a kernel thread as a normal user space process
2. If it thinks it is a kernel thread it puts the task name in [ ] brackets
3. The hashing is performed on the modified task name
So.. it may be that the crude kernel thread detection code failed, and then the process gets a different hash because the task name changed, and so we get a "new" hash timer entry, hence the large spike for a "new item" because the old history is wrong on incorrectly hashed timer stat.
I think the fixes needed are:
1. Add more levels of smarts into the kernel thread detection, namely:
a) is cmdline zero length -> it is a kernel thread (should always work)
b) is pgid of the pid zero -> it is a kernel thread (useful fallback)
c) if can't determine above, compare against a database of known kernel threads (hacky, but consistent)
2. Keep two copies of the task name:
a) The original
b) The modified task name of it is a kernel thread
3. Hash on the pid, taskname, callback and timer function
- if we get a process that dies and a new one matches a hash on that then I consider that very unlikely.
There are a few issues that I spotted in the code:
1. The code make mistake a kernel thread as a normal user space process
2. If it thinks it is a kernel thread it puts the task name in [ ] brackets
3. The hashing is performed on the modified task name
So.. it may be that the crude kernel thread detection code failed, and then the process gets a different hash because the task name changed, and so we get a "new" hash timer entry, hence the large spike for a "new item" because the old history is wrong on incorrectly hashed timer stat.
I think the fixes needed are:
1. Add more levels of smarts into the kernel thread detection, namely:
a) is cmdline zero length -> it is a kernel thread (should always work)
b) is pgid of the pid zero -> it is a kernel thread (useful fallback)
c) if can't determine above, compare against a database of known kernel threads (hacky, but consistent)
2. Keep two copies of the task name:
a) The original
b) The modified task name of it is a kernel thread
3. Hash on the pid, taskname, callback and timer function
- if we get a process that dies and a new one matches a hash on that then I consider that very unlikely.