Fixed MTP to work with TWRP

This commit is contained in:
awab228 2018-06-19 23:16:04 +02:00
commit f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions

View file

@ -0,0 +1,107 @@
Subsystem Trace Points: kmem
The kmem tracing system captures events related to object and page allocation
within the kernel. Broadly speaking there are five major subheadings.
o Slab allocation of small objects of unknown type (kmalloc)
o Slab allocation of small objects of known type
o Page allocation
o Per-CPU Allocator Activity
o External Fragmentation
This document describes what each of the tracepoints is and why they
might be useful.
1. Slab allocation of small objects of unknown type
===================================================
kmalloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
kmalloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
kfree call_site=%lx ptr=%p
Heavy activity for these events may indicate that a specific cache is
justified, particularly if kmalloc slab pages are getting significantly
internal fragmented as a result of the allocation pattern. By correlating
kmalloc with kfree, it may be possible to identify memory leaks and where
the allocation sites were.
2. Slab allocation of small objects of known type
=================================================
kmem_cache_alloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
kmem_cache_alloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
kmem_cache_free call_site=%lx ptr=%p
These events are similar in usage to the kmalloc-related events except that
it is likely easier to pin the event down to a specific cache. At the time
of writing, no information is available on what slab is being allocated from,
but the call_site can usually be used to extrapolate that information.
3. Page allocation
==================
mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
mm_page_free page=%p pfn=%lu order=%d
mm_page_free_batched page=%p pfn=%lu order=%d cold=%d
These four events deal with page allocation and freeing. mm_page_alloc is
a simple indicator of page allocator activity. Pages may be allocated from
the per-CPU allocator (high performance) or the buddy allocator.
If pages are allocated directly from the buddy allocator, the
mm_page_alloc_zone_locked event is triggered. This event is important as high
amounts of activity imply high activity on the zone->lock. Taking this lock
impairs performance by disabling interrupts, dirtying cache lines between
CPUs and serialising many CPUs.
When a page is freed directly by the caller, the only mm_page_free event
is triggered. Significant amounts of activity here could indicate that the
callers should be batching their activities.
When pages are freed in batch, the also mm_page_free_batched is triggered.
Broadly speaking, pages are taken off the LRU lock in bulk and
freed in batch with a page list. Significant amounts of activity here could
indicate that the system is under memory pressure and can also indicate
contention on the zone->lru_lock.
4. Per-CPU Allocator Activity
=============================
mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
mm_page_pcpu_drain page=%p pfn=%lu order=%d cpu=%d migratetype=%d
In front of the page allocator is a per-cpu page allocator. It exists only
for order-0 pages, reduces contention on the zone->lock and reduces the
amount of writing on struct page.
When a per-CPU list is empty or pages of the wrong type are allocated,
the zone->lock will be taken once and the per-CPU list refilled. The event
triggered is mm_page_alloc_zone_locked for each page allocated with the
event indicating whether it is for a percpu_refill or not.
When the per-CPU list is too full, a number of pages are freed, each one
which triggers a mm_page_pcpu_drain event.
The individual nature of the events is so that pages can be tracked
between allocation and freeing. A number of drain or refill pages that occur
consecutively imply the zone->lock being taken once. Large amounts of per-CPU
refills and drains could imply an imbalance between CPUs where too much work
is being concentrated in one place. It could also indicate that the per-CPU
lists should be a larger size. Finally, large amounts of refills on one CPU
and drains on another could be a factor in causing large amounts of cache
line bounces due to writes between CPUs and worth investigating if pages
can be allocated and freed on the same CPU through some algorithm change.
5. External Fragmentation
=========================
mm_page_alloc_extfrag page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
External fragmentation affects whether a high-order allocation will be
successful or not. For some types of hardware, this is important although
it is avoided where possible. If the system is using huge pages and needs
to be able to resize the pool over the lifetime of the system, this value
is important.
Large numbers of this event implies that memory is fragmenting and
high-order allocations will start failing at some time in the future. One
means of reducing the occurrence of this event is to increase the size of
min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
pageblock_size is usually the size of the default hugepage size.

View file

@ -0,0 +1,43 @@
NMI Trace Events
These events normally show up here:
/sys/kernel/debug/tracing/events/nmi
--
nmi_handler:
You might want to use this tracepoint if you suspect that your
NMI handlers are hogging large amounts of CPU time. The kernel
will warn if it sees long-running handlers:
INFO: NMI handler took too long to run: 9.207 msecs
and this tracepoint will allow you to drill down and get some
more details.
Let's say you suspect that perf_event_nmi_handler() is causing
you some problems and you only want to trace that handler
specifically. You need to find its address:
$ grep perf_event_nmi_handler /proc/kallsyms
ffffffff81625600 t perf_event_nmi_handler
Let's also say you are only interested in when that function is
really hogging a lot of CPU time, like a millisecond at a time.
Note that the kernel's output is in milliseconds, but the input
to the filter is in nanoseconds! You can filter on 'delta_ns':
cd /sys/kernel/debug/tracing/events/nmi/nmi_handler
echo 'handler==0xffffffff81625600 && delta_ns>1000000' > filter
echo 1 > enable
Your output would then look like:
$ cat /sys/kernel/debug/tracing/trace_pipe
<idle>-0 [000] d.h3 505.397558: nmi_handler: perf_event_nmi_handler() delta_ns: 3236765 handled: 1
<idle>-0 [000] d.h3 505.805893: nmi_handler: perf_event_nmi_handler() delta_ns: 3174234 handled: 1
<idle>-0 [000] d.h3 506.158206: nmi_handler: perf_event_nmi_handler() delta_ns: 3084642 handled: 1
<idle>-0 [000] d.h3 506.334346: nmi_handler: perf_event_nmi_handler() delta_ns: 3080351 handled: 1

View file

@ -0,0 +1,97 @@
Subsystem Trace Points: power
The power tracing system captures events related to power transitions
within the kernel. Broadly speaking there are three major subheadings:
o Power state switch which reports events related to suspend (S-states),
cpuidle (C-states) and cpufreq (P-states)
o System clock related changes
o Power domains related changes and transitions
This document describes what each of the tracepoints is and why they
might be useful.
Cf. include/trace/events/power.h for the events definitions.
1. Power state switch events
============================
1.1 Trace API
-----------------
A 'cpu' event class gathers the CPU-related events: cpuidle and
cpufreq.
cpu_idle "state=%lu cpu_id=%lu"
cpu_frequency "state=%lu cpu_id=%lu"
cpu_frequency_limits "min=%lu max=%lu cpu_id=%lu"
A suspend event is used to indicate the system going in and out of the
suspend mode:
machine_suspend "state=%lu"
Note: the value of '-1' or '4294967295' for state means an exit from the current state,
i.e. trace_cpu_idle(4, smp_processor_id()) means that the system
enters the idle state 4, while trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id())
means that the system exits the previous idle state.
The event which has 'state=4294967295' in the trace is very important to the user
space tools which are using it to detect the end of the current state, and so to
correctly draw the states diagrams and to calculate accurate statistics etc.
2. Clocks events
================
The clock events are used for clock enable/disable and for
clock rate change.
clock_enable "%s state=%lu cpu_id=%lu"
clock_disable "%s state=%lu cpu_id=%lu"
clock_set_rate "%s state=%lu cpu_id=%lu"
The first parameter gives the clock name (e.g. "gpio1_iclk").
The second parameter is '1' for enable, '0' for disable, the target
clock rate for set_rate.
3. Power domains events
=======================
The power domain events are used for power domains transitions
power_domain_target "%s state=%lu cpu_id=%lu"
The first parameter gives the power domain name (e.g. "mpu_pwrdm").
The second parameter is the power domain target state.
4. PM QoS events
================
The PM QoS events are used for QoS add/update/remove request and for
target/flags update.
pm_qos_add_request "pm_qos_class=%s value=%d"
pm_qos_update_request "pm_qos_class=%s value=%d"
pm_qos_remove_request "pm_qos_class=%s value=%d"
pm_qos_update_request_timeout "pm_qos_class=%s value=%d, timeout_us=%ld"
The first parameter gives the QoS class name (e.g. "CPU_DMA_LATENCY").
The second parameter is value to be added/updated/removed.
The third parameter is timeout value in usec.
pm_qos_update_target "action=%s prev_value=%d curr_value=%d"
pm_qos_update_flags "action=%s prev_value=0x%x curr_value=0x%x"
The first parameter gives the QoS action name (e.g. "ADD_REQ").
The second parameter is the previous QoS value.
The third parameter is the current QoS value to update.
And, there are also events used for device PM QoS add/update/remove request.
dev_pm_qos_add_request "device=%s type=%s new_value=%d"
dev_pm_qos_update_request "device=%s type=%s new_value=%d"
dev_pm_qos_remove_request "device=%s type=%s new_value=%d"
The first parameter gives the device name which tries to add/update/remove
QoS requests.
The second parameter gives the request type (e.g. "DEV_PM_QOS_RESUME_LATENCY").
The third parameter is value to be added/updated/removed.

View file

@ -0,0 +1,496 @@
Event Tracing
Documentation written by Theodore Ts'o
Updated by Li Zefan and Tom Zanussi
1. Introduction
===============
Tracepoints (see Documentation/trace/tracepoints.txt) can be used
without creating custom kernel modules to register probe functions
using the event tracing infrastructure.
Not all tracepoints can be traced using the event tracing system;
the kernel developer must provide code snippets which define how the
tracing information is saved into the tracing buffer, and how the
tracing information should be printed.
2. Using Event Tracing
======================
2.1 Via the 'set_event' interface
---------------------------------
The events which are available for tracing can be found in the file
/sys/kernel/debug/tracing/available_events.
To enable a particular event, such as 'sched_wakeup', simply echo it
to /sys/kernel/debug/tracing/set_event. For example:
# echo sched_wakeup >> /sys/kernel/debug/tracing/set_event
[ Note: '>>' is necessary, otherwise it will firstly disable
all the events. ]
To disable an event, echo the event name to the set_event file prefixed
with an exclamation point:
# echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event
To disable all events, echo an empty line to the set_event file:
# echo > /sys/kernel/debug/tracing/set_event
To enable all events, echo '*:*' or '*:' to the set_event file:
# echo *:* > /sys/kernel/debug/tracing/set_event
The events are organized into subsystems, such as ext4, irq, sched,
etc., and a full event name looks like this: <subsystem>:<event>. The
subsystem name is optional, but it is displayed in the available_events
file. All of the events in a subsystem can be specified via the syntax
"<subsystem>:*"; for example, to enable all irq events, you can use the
command:
# echo 'irq:*' > /sys/kernel/debug/tracing/set_event
2.2 Via the 'enable' toggle
---------------------------
The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy
of directories.
To enable event 'sched_wakeup':
# echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
To disable it:
# echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
To enable all events in sched subsystem:
# echo 1 > /sys/kernel/debug/tracing/events/sched/enable
To enable all events:
# echo 1 > /sys/kernel/debug/tracing/events/enable
When reading one of these enable files, there are four results:
0 - all events this file affects are disabled
1 - all events this file affects are enabled
X - there is a mixture of events enabled and disabled
? - this file does not affect any event
2.3 Boot option
---------------
In order to facilitate early boot debugging, use boot option:
trace_event=[event-list]
event-list is a comma separated list of events. See section 2.1 for event
format.
3. Defining an event-enabled tracepoint
=======================================
See The example provided in samples/trace_events
4. Event formats
================
Each trace event has a 'format' file associated with it that contains
a description of each field in a logged event. This information can
be used to parse the binary trace stream, and is also the place to
find the field names that can be used in event filters (see section 5).
It also displays the format string that will be used to print the
event in text mode, along with the event name and ID used for
profiling.
Every event has a set of 'common' fields associated with it; these are
the fields prefixed with 'common_'. The other fields vary between
events and correspond to the fields defined in the TRACE_EVENT
definition for that event.
Each field in the format has the form:
field:field-type field-name; offset:N; size:N;
where offset is the offset of the field in the trace record and size
is the size of the data item, in bytes.
For example, here's the information displayed for the 'sched_wakeup'
event:
# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/format
name: sched_wakeup
ID: 60
format:
field:unsigned short common_type; offset:0; size:2;
field:unsigned char common_flags; offset:2; size:1;
field:unsigned char common_preempt_count; offset:3; size:1;
field:int common_pid; offset:4; size:4;
field:int common_tgid; offset:8; size:4;
field:char comm[TASK_COMM_LEN]; offset:12; size:16;
field:pid_t pid; offset:28; size:4;
field:int prio; offset:32; size:4;
field:int success; offset:36; size:4;
field:int cpu; offset:40; size:4;
print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid,
REC->prio, REC->success, REC->cpu
This event contains 10 fields, the first 5 common and the remaining 5
event-specific. All the fields for this event are numeric, except for
'comm' which is a string, a distinction important for event filtering.
5. Event filtering
==================
Trace events can be filtered in the kernel by associating boolean
'filter expressions' with them. As soon as an event is logged into
the trace buffer, its fields are checked against the filter expression
associated with that event type. An event with field values that
'match' the filter will appear in the trace output, and an event whose
values don't match will be discarded. An event with no filter
associated with it matches everything, and is the default when no
filter has been set for an event.
5.1 Expression syntax
---------------------
A filter expression consists of one or more 'predicates' that can be
combined using the logical operators '&&' and '||'. A predicate is
simply a clause that compares the value of a field contained within a
logged event with a constant value and returns either 0 or 1 depending
on whether the field value matched (1) or didn't match (0):
field-name relational-operator value
Parentheses can be used to provide arbitrary logical groupings and
double-quotes can be used to prevent the shell from interpreting
operators as shell metacharacters.
The field-names available for use in filters can be found in the
'format' files for trace events (see section 4).
The relational-operators depend on the type of the field being tested:
The operators available for numeric fields are:
==, !=, <, <=, >, >=, &
And for string fields they are:
==, !=, ~
The glob (~) only accepts a wild card character (*) at the start and or
end of the string. For example:
prev_comm ~ "*sh"
prev_comm ~ "sh*"
prev_comm ~ "*sh*"
But does not allow for it to be within the string:
prev_comm ~ "ba*sh" <-- is invalid
5.2 Setting filters
-------------------
A filter for an individual event is set by writing a filter expression
to the 'filter' file for the given event.
For example:
# cd /sys/kernel/debug/tracing/events/sched/sched_wakeup
# echo "common_preempt_count > 4" > filter
A slightly more involved example:
# cd /sys/kernel/debug/tracing/events/signal/signal_generate
# echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter
If there is an error in the expression, you'll get an 'Invalid
argument' error when setting it, and the erroneous string along with
an error message can be seen by looking at the filter e.g.:
# cd /sys/kernel/debug/tracing/events/signal/signal_generate
# echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter
-bash: echo: write error: Invalid argument
# cat filter
((sig >= 10 && sig < 15) || dsig == 17) && comm != bash
^
parse_error: Field not found
Currently the caret ('^') for an error always appears at the beginning of
the filter string; the error message should still be useful though
even without more accurate position info.
5.3 Clearing filters
--------------------
To clear the filter for an event, write a '0' to the event's filter
file.
To clear the filters for all events in a subsystem, write a '0' to the
subsystem's filter file.
5.3 Subsystem filters
---------------------
For convenience, filters for every event in a subsystem can be set or
cleared as a group by writing a filter expression into the filter file
at the root of the subsystem. Note however, that if a filter for any
event within the subsystem lacks a field specified in the subsystem
filter, or if the filter can't be applied for any other reason, the
filter for that event will retain its previous setting. This can
result in an unintended mixture of filters which could lead to
confusing (to the user who might think different filters are in
effect) trace output. Only filters that reference just the common
fields can be guaranteed to propagate successfully to all events.
Here are a few subsystem filter examples that also illustrate the
above points:
Clear the filters on all events in the sched subsystem:
# cd /sys/kernel/debug/tracing/events/sched
# echo 0 > filter
# cat sched_switch/filter
none
# cat sched_wakeup/filter
none
Set a filter using only common fields for all events in the sched
subsystem (all events end up with the same filter):
# cd /sys/kernel/debug/tracing/events/sched
# echo common_pid == 0 > filter
# cat sched_switch/filter
common_pid == 0
# cat sched_wakeup/filter
common_pid == 0
Attempt to set a filter using a non-common field for all events in the
sched subsystem (all events but those that have a prev_pid field retain
their old filters):
# cd /sys/kernel/debug/tracing/events/sched
# echo prev_pid == 0 > filter
# cat sched_switch/filter
prev_pid == 0
# cat sched_wakeup/filter
common_pid == 0
6. Event triggers
=================
Trace events can be made to conditionally invoke trigger 'commands'
which can take various forms and are described in detail below;
examples would be enabling or disabling other trace events or invoking
a stack trace whenever the trace event is hit. Whenever a trace event
with attached triggers is invoked, the set of trigger commands
associated with that event is invoked. Any given trigger can
additionally have an event filter of the same form as described in
section 5 (Event filtering) associated with it - the command will only
be invoked if the event being invoked passes the associated filter.
If no filter is associated with the trigger, it always passes.
Triggers are added to and removed from a particular event by writing
trigger expressions to the 'trigger' file for the given event.
A given event can have any number of triggers associated with it,
subject to any restrictions that individual commands may have in that
regard.
Event triggers are implemented on top of "soft" mode, which means that
whenever a trace event has one or more triggers associated with it,
the event is activated even if it isn't actually enabled, but is
disabled in a "soft" mode. That is, the tracepoint will be called,
but just will not be traced, unless of course it's actually enabled.
This scheme allows triggers to be invoked even for events that aren't
enabled, and also allows the current event filter implementation to be
used for conditionally invoking triggers.
The syntax for event triggers is roughly based on the syntax for
set_ftrace_filter 'ftrace filter commands' (see the 'Filter commands'
section of Documentation/trace/ftrace.txt), but there are major
differences and the implementation isn't currently tied to it in any
way, so beware about making generalizations between the two.
6.1 Expression syntax
---------------------
Triggers are added by echoing the command to the 'trigger' file:
# echo 'command[:count] [if filter]' > trigger
Triggers are removed by echoing the same command but starting with '!'
to the 'trigger' file:
# echo '!command[:count] [if filter]' > trigger
The [if filter] part isn't used in matching commands when removing, so
leaving that off in a '!' command will accomplish the same thing as
having it in.
The filter syntax is the same as that described in the 'Event
filtering' section above.
For ease of use, writing to the trigger file using '>' currently just
adds or removes a single trigger and there's no explicit '>>' support
('>' actually behaves like '>>') or truncation support to remove all
triggers (you have to use '!' for each one added.)
6.2 Supported trigger commands
------------------------------
The following commands are supported:
- enable_event/disable_event
These commands can enable or disable another trace event whenever
the triggering event is hit. When these commands are registered,
the other trace event is activated, but disabled in a "soft" mode.
That is, the tracepoint will be called, but just will not be traced.
The event tracepoint stays in this mode as long as there's a trigger
in effect that can trigger it.
For example, the following trigger causes kmalloc events to be
traced when a read system call is entered, and the :1 at the end
specifies that this enablement happens only once:
# echo 'enable_event:kmem:kmalloc:1' > \
/sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
The following trigger causes kmalloc events to stop being traced
when a read system call exits. This disablement happens on every
read system call exit:
# echo 'disable_event:kmem:kmalloc' > \
/sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger
The format is:
enable_event:<system>:<event>[:count]
disable_event:<system>:<event>[:count]
To remove the above commands:
# echo '!enable_event:kmem:kmalloc:1' > \
/sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
# echo '!disable_event:kmem:kmalloc' > \
/sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger
Note that there can be any number of enable/disable_event triggers
per triggering event, but there can only be one trigger per
triggered event. e.g. sys_enter_read can have triggers enabling both
kmem:kmalloc and sched:sched_switch, but can't have two kmem:kmalloc
versions such as kmem:kmalloc and kmem:kmalloc:1 or 'kmem:kmalloc if
bytes_req == 256' and 'kmem:kmalloc if bytes_alloc == 256' (they
could be combined into a single filter on kmem:kmalloc though).
- stacktrace
This command dumps a stacktrace in the trace buffer whenever the
triggering event occurs.
For example, the following trigger dumps a stacktrace every time the
kmalloc tracepoint is hit:
# echo 'stacktrace' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
The following trigger dumps a stacktrace the first 5 times a kmalloc
request happens with a size >= 64K
# echo 'stacktrace:5 if bytes_req >= 65536' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
The format is:
stacktrace[:count]
To remove the above commands:
# echo '!stacktrace' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
# echo '!stacktrace:5 if bytes_req >= 65536' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
The latter can also be removed more simply by the following (without
the filter):
# echo '!stacktrace:5' > \
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
Note that there can be only one stacktrace trigger per triggering
event.
- snapshot
This command causes a snapshot to be triggered whenever the
triggering event occurs.
The following command creates a snapshot every time a block request
queue is unplugged with a depth > 1. If you were tracing a set of
events or functions at the time, the snapshot trace buffer would
capture those events when the trigger event occurred:
# echo 'snapshot if nr_rq > 1' > \
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
To only snapshot once:
# echo 'snapshot:1 if nr_rq > 1' > \
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
To remove the above commands:
# echo '!snapshot if nr_rq > 1' > \
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
# echo '!snapshot:1 if nr_rq > 1' > \
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
Note that there can be only one snapshot trigger per triggering
event.
- traceon/traceoff
These commands turn tracing on and off when the specified events are
hit. The parameter determines how many times the tracing system is
turned on and off. If unspecified, there is no limit.
The following command turns tracing off the first time a block
request queue is unplugged with a depth > 1. If you were tracing a
set of events or functions at the time, you could then examine the
trace buffer to see the sequence of events that led up to the
trigger event:
# echo 'traceoff:1 if nr_rq > 1' > \
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
To always disable tracing when nr_rq > 1 :
# echo 'traceoff if nr_rq > 1' > \
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
To remove the above commands:
# echo '!traceoff:1 if nr_rq > 1' > \
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
# echo '!traceoff if nr_rq > 1' > \
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
Note that there can be only one traceon or traceoff trigger per
triggering event.

View file

@ -0,0 +1,382 @@
function tracer guts
====================
By Mike Frysinger
Introduction
------------
Here we will cover the architecture pieces that the common function tracing
code relies on for proper functioning. Things are broken down into increasing
complexity so that you can start simple and at least get basic functionality.
Note that this focuses on architecture implementation details only. If you
want more explanation of a feature in terms of common code, review the common
ftrace.txt file.
Ideally, everyone who wishes to retain performance while supporting tracing in
their kernel should make it all the way to dynamic ftrace support.
Prerequisites
-------------
Ftrace relies on these features being implemented:
STACKTRACE_SUPPORT - implement save_stack_trace()
TRACE_IRQFLAGS_SUPPORT - implement include/asm/irqflags.h
HAVE_FUNCTION_TRACER
--------------------
You will need to implement the mcount and the ftrace_stub functions.
The exact mcount symbol name will depend on your toolchain. Some call it
"mcount", "_mcount", or even "__mcount". You can probably figure it out by
running something like:
$ echo 'main(){}' | gcc -x c -S -o - - -pg | grep mcount
call mcount
We'll make the assumption below that the symbol is "mcount" just to keep things
nice and simple in the examples.
Keep in mind that the ABI that is in effect inside of the mcount function is
*highly* architecture/toolchain specific. We cannot help you in this regard,
sorry. Dig up some old documentation and/or find someone more familiar than
you to bang ideas off of. Typically, register usage (argument/scratch/etc...)
is a major issue at this point, especially in relation to the location of the
mcount call (before/after function prologue). You might also want to look at
how glibc has implemented the mcount function for your architecture. It might
be (semi-)relevant.
The mcount function should check the function pointer ftrace_trace_function
to see if it is set to ftrace_stub. If it is, there is nothing for you to do,
so return immediately. If it isn't, then call that function in the same way
the mcount function normally calls __mcount_internal -- the first argument is
the "frompc" while the second argument is the "selfpc" (adjusted to remove the
size of the mcount call that is embedded in the function).
For example, if the function foo() calls bar(), when the bar() function calls
mcount(), the arguments mcount() will pass to the tracer are:
"frompc" - the address bar() will use to return to foo()
"selfpc" - the address bar() (with mcount() size adjustment)
Also keep in mind that this mcount function will be called *a lot*, so
optimizing for the default case of no tracer will help the smooth running of
your system when tracing is disabled. So the start of the mcount function is
typically the bare minimum with checking things before returning. That also
means the code flow should usually be kept linear (i.e. no branching in the nop
case). This is of course an optimization and not a hard requirement.
Here is some pseudo code that should help (these functions should actually be
implemented in assembly):
void ftrace_stub(void)
{
return;
}
void mcount(void)
{
/* save any bare state needed in order to do initial checking */
extern void (*ftrace_trace_function)(unsigned long, unsigned long);
if (ftrace_trace_function != ftrace_stub)
goto do_trace;
/* restore any bare state */
return;
do_trace:
/* save all state needed by the ABI (see paragraph above) */
unsigned long frompc = ...;
unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE;
ftrace_trace_function(frompc, selfpc);
/* restore all state needed by the ABI */
}
Don't forget to export mcount for modules !
extern void mcount(void);
EXPORT_SYMBOL(mcount);
HAVE_FUNCTION_GRAPH_TRACER
--------------------------
Deep breath ... time to do some real work. Here you will need to update the
mcount function to check ftrace graph function pointers, as well as implement
some functions to save (hijack) and restore the return address.
The mcount function should check the function pointers ftrace_graph_return
(compare to ftrace_stub) and ftrace_graph_entry (compare to
ftrace_graph_entry_stub). If either of those is not set to the relevant stub
function, call the arch-specific function ftrace_graph_caller which in turn
calls the arch-specific function prepare_ftrace_return. Neither of these
function names is strictly required, but you should use them anyway to stay
consistent across the architecture ports -- easier to compare & contrast
things.
The arguments to prepare_ftrace_return are slightly different than what are
passed to ftrace_trace_function. The second argument "selfpc" is the same,
but the first argument should be a pointer to the "frompc". Typically this is
located on the stack. This allows the function to hijack the return address
temporarily to have it point to the arch-specific function return_to_handler.
That function will simply call the common ftrace_return_to_handler function and
that will return the original return address with which you can return to the
original call site.
Here is the updated mcount pseudo code:
void mcount(void)
{
...
if (ftrace_trace_function != ftrace_stub)
goto do_trace;
+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+ extern void (*ftrace_graph_return)(...);
+ extern void (*ftrace_graph_entry)(...);
+ if (ftrace_graph_return != ftrace_stub ||
+ ftrace_graph_entry != ftrace_graph_entry_stub)
+ ftrace_graph_caller();
+#endif
/* restore any bare state */
...
Here is the pseudo code for the new ftrace_graph_caller assembly function:
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
void ftrace_graph_caller(void)
{
/* save all state needed by the ABI */
unsigned long *frompc = &...;
unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE;
/* passing frame pointer up is optional -- see below */
prepare_ftrace_return(frompc, selfpc, frame_pointer);
/* restore all state needed by the ABI */
}
#endif
For information on how to implement prepare_ftrace_return(), simply look at the
x86 version (the frame pointer passing is optional; see the next section for
more information). The only architecture-specific piece in it is the setup of
the fault recovery table (the asm(...) code). The rest should be the same
across architectures.
Here is the pseudo code for the new return_to_handler assembly function. Note
that the ABI that applies here is different from what applies to the mcount
code. Since you are returning from a function (after the epilogue), you might
be able to skimp on things saved/restored (usually just registers used to pass
return values).
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
void return_to_handler(void)
{
/* save all state needed by the ABI (see paragraph above) */
void (*original_return_point)(void) = ftrace_return_to_handler();
/* restore all state needed by the ABI */
/* this is usually either a return or a jump */
original_return_point();
}
#endif
HAVE_FUNCTION_GRAPH_FP_TEST
---------------------------
An arch may pass in a unique value (frame pointer) to both the entering and
exiting of a function. On exit, the value is compared and if it does not
match, then it will panic the kernel. This is largely a sanity check for bad
code generation with gcc. If gcc for your port sanely updates the frame
pointer under different optimization levels, then ignore this option.
However, adding support for it isn't terribly difficult. In your assembly code
that calls prepare_ftrace_return(), pass the frame pointer as the 3rd argument.
Then in the C version of that function, do what the x86 port does and pass it
along to ftrace_push_return_trace() instead of a stub value of 0.
Similarly, when you call ftrace_return_to_handler(), pass it the frame pointer.
HAVE_FTRACE_NMI_ENTER
---------------------
If you can't trace NMI functions, then skip this option.
<details to be filled>
HAVE_SYSCALL_TRACEPOINTS
------------------------
You need very few things to get the syscalls tracing in an arch.
- Support HAVE_ARCH_TRACEHOOK (see arch/Kconfig).
- Have a NR_syscalls variable in <asm/unistd.h> that provides the number
of syscalls supported by the arch.
- Support the TIF_SYSCALL_TRACEPOINT thread flags.
- Put the trace_sys_enter() and trace_sys_exit() tracepoints calls from ptrace
in the ptrace syscalls tracing path.
- If the system call table on this arch is more complicated than a simple array
of addresses of the system calls, implement an arch_syscall_addr to return
the address of a given system call.
- If the symbol names of the system calls do not match the function names on
this arch, define ARCH_HAS_SYSCALL_MATCH_SYM_NAME in asm/ftrace.h and
implement arch_syscall_match_sym_name with the appropriate logic to return
true if the function name corresponds with the symbol name.
- Tag this arch as HAVE_SYSCALL_TRACEPOINTS.
HAVE_FTRACE_MCOUNT_RECORD
-------------------------
See scripts/recordmcount.pl for more info. Just fill in the arch-specific
details for how to locate the addresses of mcount call sites via objdump.
This option doesn't make much sense without also implementing dynamic ftrace.
HAVE_DYNAMIC_FTRACE
-------------------
You will first need HAVE_FTRACE_MCOUNT_RECORD and HAVE_FUNCTION_TRACER, so
scroll your reader back up if you got over eager.
Once those are out of the way, you will need to implement:
- asm/ftrace.h:
- MCOUNT_ADDR
- ftrace_call_adjust()
- struct dyn_arch_ftrace{}
- asm code:
- mcount() (new stub)
- ftrace_caller()
- ftrace_call()
- ftrace_stub()
- C code:
- ftrace_dyn_arch_init()
- ftrace_make_nop()
- ftrace_make_call()
- ftrace_update_ftrace_func()
First you will need to fill out some arch details in your asm/ftrace.h.
Define MCOUNT_ADDR as the address of your mcount symbol similar to:
#define MCOUNT_ADDR ((unsigned long)mcount)
Since no one else will have a decl for that function, you will need to:
extern void mcount(void);
You will also need the helper function ftrace_call_adjust(). Most people
will be able to stub it out like so:
static inline unsigned long ftrace_call_adjust(unsigned long addr)
{
return addr;
}
<details to be filled>
Lastly you will need the custom dyn_arch_ftrace structure. If you need
some extra state when runtime patching arbitrary call sites, this is the
place. For now though, create an empty struct:
struct dyn_arch_ftrace {
/* No extra data needed */
};
With the header out of the way, we can fill out the assembly code. While we
did already create a mcount() function earlier, dynamic ftrace only wants a
stub function. This is because the mcount() will only be used during boot
and then all references to it will be patched out never to return. Instead,
the guts of the old mcount() will be used to create a new ftrace_caller()
function. Because the two are hard to merge, it will most likely be a lot
easier to have two separate definitions split up by #ifdefs. Same goes for
the ftrace_stub() as that will now be inlined in ftrace_caller().
Before we get confused anymore, let's check out some pseudo code so you can
implement your own stuff in assembly:
void mcount(void)
{
return;
}
void ftrace_caller(void)
{
/* save all state needed by the ABI (see paragraph above) */
unsigned long frompc = ...;
unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE;
ftrace_call:
ftrace_stub(frompc, selfpc);
/* restore all state needed by the ABI */
ftrace_stub:
return;
}
This might look a little odd at first, but keep in mind that we will be runtime
patching multiple things. First, only functions that we actually want to trace
will be patched to call ftrace_caller(). Second, since we only have one tracer
active at a time, we will patch the ftrace_caller() function itself to call the
specific tracer in question. That is the point of the ftrace_call label.
With that in mind, let's move on to the C code that will actually be doing the
runtime patching. You'll need a little knowledge of your arch's opcodes in
order to make it through the next section.
Every arch has an init callback function. If you need to do something early on
to initialize some state, this is the time to do that. Otherwise, this simple
function below should be sufficient for most people:
int __init ftrace_dyn_arch_init(void)
{
return 0;
}
There are two functions that are used to do runtime patching of arbitrary
functions. The first is used to turn the mcount call site into a nop (which
is what helps us retain runtime performance when not tracing). The second is
used to turn the mcount call site into a call to an arbitrary location (but
typically that is ftracer_caller()). See the general function definition in
linux/ftrace.h for the functions:
ftrace_make_nop()
ftrace_make_call()
The rec->ip value is the address of the mcount call site that was collected
by the scripts/recordmcount.pl during build time.
The last function is used to do runtime patching of the active tracer. This
will be modifying the assembly code at the location of the ftrace_call symbol
inside of the ftrace_caller() function. So you should have sufficient padding
at that location to support the new function calls you'll be inserting. Some
people will be using a "call" type instruction while others will be using a
"branch" type instruction. Specifically, the function is:
ftrace_update_ftrace_func()
HAVE_DYNAMIC_FTRACE + HAVE_FUNCTION_GRAPH_TRACER
------------------------------------------------
The function grapher needs a few tweaks in order to work with dynamic ftrace.
Basically, you will need to:
- update:
- ftrace_caller()
- ftrace_graph_call()
- ftrace_graph_caller()
- implement:
- ftrace_enable_ftrace_graph_caller()
- ftrace_disable_ftrace_graph_caller()
<details to be filled>
Quick notes:
- add a nop stub after the ftrace_call location named ftrace_graph_call;
stub needs to be large enough to support a call to ftrace_graph_caller()
- update ftrace_graph_caller() to work with being called by the new
ftrace_caller() since some semantics may have changed
- ftrace_enable_ftrace_graph_caller() will runtime patch the
ftrace_graph_call location with a call to ftrace_graph_caller()
- ftrace_disable_ftrace_graph_caller() will runtime patch the
ftrace_graph_call location with nops

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,42 @@
" Enable folding for ftrace function_graph traces.
"
" To use, :source this file while viewing a function_graph trace, or use vim's
" -S option to load from the command-line together with a trace. You can then
" use the usual vim fold commands, such as "za", to open and close nested
" functions. While closed, a fold will show the total time taken for a call,
" as would normally appear on the line with the closing brace. Folded
" functions will not include finish_task_switch(), so folding should remain
" relatively sane even through a context switch.
"
" Note that this will almost certainly only work well with a
" single-CPU trace (e.g. trace-cmd report --cpu 1).
function! FunctionGraphFoldExpr(lnum)
let line = getline(a:lnum)
if line[-1:] == '{'
if line =~ 'finish_task_switch() {$'
return '>1'
endif
return 'a1'
elseif line[-1:] == '}'
return 's1'
else
return '='
endif
endfunction
function! FunctionGraphFoldText()
let s = split(getline(v:foldstart), '|', 1)
if getline(v:foldend+1) =~ 'finish_task_switch() {$'
let s[2] = ' task switch '
else
let e = split(getline(v:foldend), '|', 1)
let s[2] = e[2]
endif
return join(s, '|')
endfunction
setlocal foldexpr=FunctionGraphFoldExpr(v:lnum)
setlocal foldtext=FunctionGraphFoldText()
setlocal foldcolumn=12
setlocal foldmethod=expr

View file

@ -0,0 +1,172 @@
Kprobe-based Event Tracing
==========================
Documentation is written by Masami Hiramatsu
Overview
--------
These events are similar to tracepoint based events. Instead of Tracepoint,
this is based on kprobes (kprobe and kretprobe). So it can probe wherever
kprobes can probe (this means, all functions body except for __kprobes
functions). Unlike the Tracepoint based event, this can be added and removed
dynamically, on the fly.
To enable this feature, build your kernel with CONFIG_KPROBE_EVENT=y.
Similar to the events tracer, this doesn't need to be activated via
current_tracer. Instead of that, add probe points via
/sys/kernel/debug/tracing/kprobe_events, and enable it via
/sys/kernel/debug/tracing/events/kprobes/<EVENT>/enabled.
Synopsis of kprobe_events
-------------------------
p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe
r[:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe
-:[GRP/]EVENT : Clear a probe
GRP : Group name. If omitted, use "kprobes" for it.
EVENT : Event name. If omitted, the event name is generated
based on SYM+offs or MEMADDR.
MOD : Module name which has given SYM.
SYM[+offs] : Symbol+offset where the probe is inserted.
MEMADDR : Address where the probe is inserted.
FETCHARGS : Arguments. Each probe can have up to 128 args.
%REG : Fetch register REG
@ADDR : Fetch memory at ADDR (ADDR should be in kernel)
@SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
$stackN : Fetch Nth entry of stack (N >= 0)
$stack : Fetch stack address.
$retval : Fetch return value.(*)
+|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**)
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
(u8/u16/u32/u64/s8/s16/s32/s64), "string" and bitfield
are supported.
(*) only for return probe.
(**) this is useful for fetching a field of data structures.
Types
-----
Several types are supported for fetch-args. Kprobe tracer will access memory
by given type. Prefix 's' and 'u' means those types are signed and unsigned
respectively. Traced arguments are shown in decimal (signed) or hex (unsigned).
String type is a special type, which fetches a "null-terminated" string from
kernel space. This means it will fail and store NULL if the string container
has been paged out.
Bitfield is another special type, which takes 3 parameters, bit-width, bit-
offset, and container-size (usually 32). The syntax is;
b<bit-width>@<bit-offset>/<container-size>
Per-Probe Event Filtering
-------------------------
Per-probe event filtering feature allows you to set different filter on each
probe and gives you what arguments will be shown in trace buffer. If an event
name is specified right after 'p:' or 'r:' in kprobe_events, it adds an event
under tracing/events/kprobes/<EVENT>, at the directory you can see 'id',
'enabled', 'format' and 'filter'.
enabled:
You can enable/disable the probe by writing 1 or 0 on it.
format:
This shows the format of this probe event.
filter:
You can write filtering rules of this event.
id:
This shows the id of this probe event.
Event Profiling
---------------
You can check the total number of probe hits and probe miss-hits via
/sys/kernel/debug/tracing/kprobe_profile.
The first column is event name, the second is the number of probe hits,
the third is the number of probe miss-hits.
Usage examples
--------------
To add a probe as a new event, write a new definition to kprobe_events
as below.
echo 'p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack)' > /sys/kernel/debug/tracing/kprobe_events
This sets a kprobe on the top of do_sys_open() function with recording
1st to 4th arguments as "myprobe" event. Note, which register/stack entry is
assigned to each function argument depends on arch-specific ABI. If you unsure
the ABI, please try to use probe subcommand of perf-tools (you can find it
under tools/perf/).
As this example shows, users can choose more familiar names for each arguments.
echo 'r:myretprobe do_sys_open $retval' >> /sys/kernel/debug/tracing/kprobe_events
This sets a kretprobe on the return point of do_sys_open() function with
recording return value as "myretprobe" event.
You can see the format of these events via
/sys/kernel/debug/tracing/events/kprobes/<EVENT>/format.
cat /sys/kernel/debug/tracing/events/kprobes/myprobe/format
name: myprobe
ID: 780
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1;signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned long __probe_ip; offset:12; size:4; signed:0;
field:int __probe_nargs; offset:16; size:4; signed:1;
field:unsigned long dfd; offset:20; size:4; signed:0;
field:unsigned long filename; offset:24; size:4; signed:0;
field:unsigned long flags; offset:28; size:4; signed:0;
field:unsigned long mode; offset:32; size:4; signed:0;
print fmt: "(%lx) dfd=%lx filename=%lx flags=%lx mode=%lx", REC->__probe_ip,
REC->dfd, REC->filename, REC->flags, REC->mode
You can see that the event has 4 arguments as in the expressions you specified.
echo > /sys/kernel/debug/tracing/kprobe_events
This clears all probe points.
Or,
echo -:myprobe >> kprobe_events
This clears probe points selectively.
Right after definition, each event is disabled by default. For tracing these
events, you need to enable it.
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myretprobe/enable
And you can see the traced information via /sys/kernel/debug/tracing/trace.
cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<...>-1447 [001] 1038282.286875: myprobe: (do_sys_open+0x0/0xd6) dfd=3 filename=7fffd1ec4440 flags=8000 mode=0
<...>-1447 [001] 1038282.286878: myretprobe: (sys_openat+0xc/0xe <- do_sys_open) $retval=fffffffffffffffe
<...>-1447 [001] 1038282.286885: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=40413c flags=8000 mode=1b6
<...>-1447 [001] 1038282.286915: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3
<...>-1447 [001] 1038282.286969: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=4041c6 flags=98800 mode=10
<...>-1447 [001] 1038282.286976: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3
Each line shows when the kernel hits an event, and <- SYMBOL means kernel
returns from SYMBOL(e.g. "sys_open+0x1b/0x1d <- do_sys_open" means kernel
returns from do_sys_open to sys_open+0x1b).

View file

@ -0,0 +1,164 @@
In-kernel memory-mapped I/O tracing
Home page and links to optional user space tools:
http://nouveau.freedesktop.org/wiki/MmioTrace
MMIO tracing was originally developed by Intel around 2003 for their Fault
Injection Test Harness. In Dec 2006 - Jan 2007, using the code from Intel,
Jeff Muizelaar created a tool for tracing MMIO accesses with the Nouveau
project in mind. Since then many people have contributed.
Mmiotrace was built for reverse engineering any memory-mapped IO device with
the Nouveau project as the first real user. Only x86 and x86_64 architectures
are supported.
Out-of-tree mmiotrace was originally modified for mainline inclusion and
ftrace framework by Pekka Paalanen <pq@iki.fi>.
Preparation
-----------
Mmiotrace feature is compiled in by the CONFIG_MMIOTRACE option. Tracing is
disabled by default, so it is safe to have this set to yes. SMP systems are
supported, but tracing is unreliable and may miss events if more than one CPU
is on-line, therefore mmiotrace takes all but one CPU off-line during run-time
activation. You can re-enable CPUs by hand, but you have been warned, there
is no way to automatically detect if you are losing events due to CPUs racing.
Usage Quick Reference
---------------------
$ mount -t debugfs debugfs /sys/kernel/debug
$ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer
$ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt &
Start X or whatever.
$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker
$ echo nop > /sys/kernel/debug/tracing/current_tracer
Check for lost events.
Usage
-----
Make sure debugfs is mounted to /sys/kernel/debug.
If not (requires root privileges):
$ mount -t debugfs debugfs /sys/kernel/debug
Check that the driver you are about to trace is not loaded.
Activate mmiotrace (requires root privileges):
$ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer
Start storing the trace:
$ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt &
The 'cat' process should stay running (sleeping) in the background.
Load the driver you want to trace and use it. Mmiotrace will only catch MMIO
accesses to areas that are ioremapped while mmiotrace is active.
During tracing you can place comments (markers) into the trace by
$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker
This makes it easier to see which part of the (huge) trace corresponds to
which action. It is recommended to place descriptive markers about what you
do.
Shut down mmiotrace (requires root privileges):
$ echo nop > /sys/kernel/debug/tracing/current_tracer
The 'cat' process exits. If it does not, kill it by issuing 'fg' command and
pressing ctrl+c.
Check that mmiotrace did not lose events due to a buffer filling up. Either
$ grep -i lost mydump.txt
which tells you exactly how many events were lost, or use
$ dmesg
to view your kernel log and look for "mmiotrace has lost events" warning. If
events were lost, the trace is incomplete. You should enlarge the buffers and
try again. Buffers are enlarged by first seeing how large the current buffers
are:
$ cat /sys/kernel/debug/tracing/buffer_size_kb
gives you a number. Approximately double this number and write it back, for
instance:
$ echo 128000 > /sys/kernel/debug/tracing/buffer_size_kb
Then start again from the top.
If you are doing a trace for a driver project, e.g. Nouveau, you should also
do the following before sending your results:
$ lspci -vvv > lspci.txt
$ dmesg > dmesg.txt
$ tar zcf pciid-nick-mmiotrace.tar.gz mydump.txt lspci.txt dmesg.txt
and then send the .tar.gz file. The trace compresses considerably. Replace
"pciid" and "nick" with the PCI ID or model name of your piece of hardware
under investigation and your nickname.
How Mmiotrace Works
-------------------
Access to hardware IO-memory is gained by mapping addresses from PCI bus by
calling one of the ioremap_*() functions. Mmiotrace is hooked into the
__ioremap() function and gets called whenever a mapping is created. Mapping is
an event that is recorded into the trace log. Note that ISA range mappings
are not caught, since the mapping always exists and is returned directly.
MMIO accesses are recorded via page faults. Just before __ioremap() returns,
the mapped pages are marked as not present. Any access to the pages causes a
fault. The page fault handler calls mmiotrace to handle the fault. Mmiotrace
marks the page present, sets TF flag to achieve single stepping and exits the
fault handler. The instruction that faulted is executed and debug trap is
entered. Here mmiotrace again marks the page as not present. The instruction
is decoded to get the type of operation (read/write), data width and the value
read or written. These are stored to the trace log.
Setting the page present in the page fault handler has a race condition on SMP
machines. During the single stepping other CPUs may run freely on that page
and events can be missed without a notice. Re-enabling other CPUs during
tracing is discouraged.
Trace Log Format
----------------
The raw log is text and easily filtered with e.g. grep and awk. One record is
one line in the log. A record starts with a keyword, followed by keyword-
dependent arguments. Arguments are separated by a space, or continue until the
end of line. The format for version 20070824 is as follows:
Explanation Keyword Space-separated arguments
---------------------------------------------------------------------------
read event R width, timestamp, map id, physical, value, PC, PID
write event W width, timestamp, map id, physical, value, PC, PID
ioremap event MAP timestamp, map id, physical, virtual, length, PC, PID
iounmap event UNMAP timestamp, map id, PC, PID
marker MARK timestamp, text
version VERSION the string "20070824"
info for reader LSPCI one line from lspci -v
PCI address map PCIDEV space-separated /proc/bus/pci/devices data
unk. opcode UNKNOWN timestamp, map id, physical, data, PC, PID
Timestamp is in seconds with decimals. Physical is a PCI bus address, virtual
is a kernel virtual address. Width is the data width in bytes and value is the
data value. Map id is an arbitrary id number identifying the mapping that was
used in an operation. PC is the program counter and PID is process id. PC is
zero if it is not recorded. PID is always zero as tracing MMIO accesses
originating in user space memory is not yet supported.
For instance, the following awk filter will pass all 32-bit writes that target
physical addresses in the range [0xfb73ce40, 0xfb800000[
$ awk '/W 4 / { adr=strtonum($5); if (adr >= 0xfb73ce40 &&
adr < 0xfb800000) print; }'
Tools for Developers
--------------------
The user space tools include utilities for:
- replacing numeric addresses and values with hardware register names
- replaying MMIO logs, i.e., re-executing the recorded writes

View file

@ -0,0 +1,418 @@
#!/usr/bin/perl
# This is a POC (proof of concept or piece of crap, take your pick) for reading the
# text representation of trace output related to page allocation. It makes an attempt
# to extract some high-level information on what is going on. The accuracy of the parser
# may vary considerably
#
# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
# other options
# --prepend-parent Report on the parent proc and PID
# --read-procstat If the trace lacks process info, get it from /proc
# --ignore-pid Aggregate processes of the same name together
#
# Copyright (c) IBM Corporation 2009
# Author: Mel Gorman <mel@csn.ul.ie>
use strict;
use Getopt::Long;
# Tracepoint events
use constant MM_PAGE_ALLOC => 1;
use constant MM_PAGE_FREE => 2;
use constant MM_PAGE_FREE_BATCHED => 3;
use constant MM_PAGE_PCPU_DRAIN => 4;
use constant MM_PAGE_ALLOC_ZONE_LOCKED => 5;
use constant MM_PAGE_ALLOC_EXTFRAG => 6;
use constant EVENT_UNKNOWN => 7;
# Constants used to track state
use constant STATE_PCPU_PAGES_DRAINED => 8;
use constant STATE_PCPU_PAGES_REFILLED => 9;
# High-level events extrapolated from tracepoints
use constant HIGH_PCPU_DRAINS => 10;
use constant HIGH_PCPU_REFILLS => 11;
use constant HIGH_EXT_FRAGMENT => 12;
use constant HIGH_EXT_FRAGMENT_SEVERE => 13;
use constant HIGH_EXT_FRAGMENT_MODERATE => 14;
use constant HIGH_EXT_FRAGMENT_CHANGED => 15;
my %perprocesspid;
my %perprocess;
my $opt_ignorepid;
my $opt_read_procstat;
my $opt_prepend_parent;
# Catch sigint and exit on request
my $sigint_report = 0;
my $sigint_exit = 0;
my $sigint_pending = 0;
my $sigint_received = 0;
sub sigint_handler {
my $current_time = time;
if ($current_time - 2 > $sigint_received) {
print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
$sigint_report = 1;
} else {
if (!$sigint_exit) {
print "Second SIGINT received quickly, exiting\n";
}
$sigint_exit++;
}
if ($sigint_exit > 3) {
print "Many SIGINTs received, exiting now without report\n";
exit;
}
$sigint_received = $current_time;
$sigint_pending = 1;
}
$SIG{INT} = "sigint_handler";
# Parse command line options
GetOptions(
'ignore-pid' => \$opt_ignorepid,
'read-procstat' => \$opt_read_procstat,
'prepend-parent' => \$opt_prepend_parent,
);
# Defaults for dynamically discovered regex's
my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])';
# Dyanically discovered regex
my $regex_fragdetails;
# Static regex used. Specified like this for readability and for use with /o
# (process_pid) (cpus ) ( time ) (tpoint ) (details)
my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
my $regex_statname = '[-0-9]*\s\((.*)\).*';
my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
sub generate_traceevent_regex {
my $event = shift;
my $default = shift;
my $regex;
# Read the event format or use the default
if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
$regex = $default;
} else {
my $line;
while (!eof(FORMAT)) {
$line = <FORMAT>;
if ($line =~ /^print fmt:\s"(.*)",.*/) {
$regex = $1;
$regex =~ s/%p/\([0-9a-f]*\)/g;
$regex =~ s/%d/\([-0-9]*\)/g;
$regex =~ s/%lu/\([0-9]*\)/g;
}
}
}
# Verify fields are in the right order
my $tuple;
foreach $tuple (split /\s/, $regex) {
my ($key, $value) = split(/=/, $tuple);
my $expected = shift;
if ($key ne $expected) {
print("WARNING: Format not as expected '$key' != '$expected'");
$regex =~ s/$key=\((.*)\)/$key=$1/;
}
}
if (defined shift) {
die("Fewer fields than expected in format");
}
return $regex;
}
$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag",
$regex_fragdetails_default,
"page", "pfn",
"alloc_order", "fallback_order", "pageblock_order",
"alloc_migratetype", "fallback_migratetype",
"fragmenting", "change_ownership");
sub read_statline($) {
my $pid = $_[0];
my $statline;
if (open(STAT, "/proc/$pid/stat")) {
$statline = <STAT>;
close(STAT);
}
if ($statline eq '') {
$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
}
return $statline;
}
sub guess_process_pid($$) {
my $pid = $_[0];
my $statline = $_[1];
if ($pid == 0) {
return "swapper-0";
}
if ($statline !~ /$regex_statname/o) {
die("Failed to math stat line for process name :: $statline");
}
return "$1-$pid";
}
sub parent_info($$) {
my $pid = $_[0];
my $statline = $_[1];
my $ppid;
if ($pid == 0) {
return "NOPARENT-0";
}
if ($statline !~ /$regex_statppid/o) {
die("Failed to match stat line process ppid:: $statline");
}
# Read the ppid stat line
$ppid = $1;
return guess_process_pid($ppid, read_statline($ppid));
}
sub process_events {
my $traceevent;
my $process_pid;
my $cpus;
my $timestamp;
my $tracepoint;
my $details;
my $statline;
# Read each line of the event log
EVENT_PROCESS:
while ($traceevent = <STDIN>) {
if ($traceevent =~ /$regex_traceevent/o) {
$process_pid = $1;
$tracepoint = $4;
if ($opt_read_procstat || $opt_prepend_parent) {
$process_pid =~ /(.*)-([0-9]*)$/;
my $process = $1;
my $pid = $2;
$statline = read_statline($pid);
if ($opt_read_procstat && $process eq '') {
$process_pid = guess_process_pid($pid, $statline);
}
if ($opt_prepend_parent) {
$process_pid = parent_info($pid, $statline) . " :: $process_pid";
}
}
# Unnecessary in this script. Uncomment if required
# $cpus = $2;
# $timestamp = $3;
} else {
next;
}
# Perl Switch() sucks majorly
if ($tracepoint eq "mm_page_alloc") {
$perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
} elsif ($tracepoint eq "mm_page_free") {
$perprocesspid{$process_pid}->{MM_PAGE_FREE}++
} elsif ($tracepoint eq "mm_page_free_batched") {
$perprocesspid{$process_pid}->{MM_PAGE_FREE_BATCHED}++;
} elsif ($tracepoint eq "mm_page_pcpu_drain") {
$perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
} elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
} elsif ($tracepoint eq "mm_page_alloc_extfrag") {
# Extract the details of the event now
$details = $5;
my ($page, $pfn);
my ($alloc_order, $fallback_order, $pageblock_order);
my ($alloc_migratetype, $fallback_migratetype);
my ($fragmenting, $change_ownership);
if ($details !~ /$regex_fragdetails/o) {
print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n";
next;
}
$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
$page = $1;
$pfn = $2;
$alloc_order = $3;
$fallback_order = $4;
$pageblock_order = $5;
$alloc_migratetype = $6;
$fallback_migratetype = $7;
$fragmenting = $8;
$change_ownership = $9;
if ($fragmenting) {
$perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
if ($fallback_order <= 3) {
$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
} else {
$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
}
}
if ($change_ownership) {
$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
}
} else {
$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
}
# Catch a full pcpu drain event
if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
$tracepoint ne "mm_page_pcpu_drain") {
$perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
}
# Catch a full pcpu refill event
if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
$tracepoint ne "mm_page_alloc_zone_locked") {
$perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
}
if ($sigint_pending) {
last EVENT_PROCESS;
}
}
}
sub dump_stats {
my $hashref = shift;
my %stats = %$hashref;
# Dump per-process stats
my $process_pid;
my $max_strlen = 0;
# Get the maximum process name
foreach $process_pid (keys %perprocesspid) {
my $len = length($process_pid);
if ($len > $max_strlen) {
$max_strlen = $len;
}
}
$max_strlen += 2;
printf("\n");
printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
"Process", "Pages", "Pages", "Pages", "Pages", "PCPU", "PCPU", "PCPU", "Fragment", "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
"details", "allocd", "allocd", "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing", "Changed", "Severe", "Moderate", "");
printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
"", "", "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
foreach $process_pid (keys %stats) {
# Dump final aggregates
if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
$stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
$stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
}
if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
$stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
$stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
}
printf("%-" . $max_strlen . "s %8d %10d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d\n",
$process_pid,
$stats{$process_pid}->{MM_PAGE_ALLOC},
$stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
$stats{$process_pid}->{MM_PAGE_FREE},
$stats{$process_pid}->{MM_PAGE_FREE_BATCHED},
$stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
$stats{$process_pid}->{HIGH_PCPU_DRAINS},
$stats{$process_pid}->{HIGH_PCPU_REFILLS},
$stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
$stats{$process_pid}->{HIGH_EXT_FRAG},
$stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
$stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
$stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
$stats{$process_pid}->{EVENT_UNKNOWN});
}
}
sub aggregate_perprocesspid() {
my $process_pid;
my $process;
undef %perprocess;
foreach $process_pid (keys %perprocesspid) {
$process = $process_pid;
$process =~ s/-([0-9])*$//;
if ($process eq '') {
$process = "NO_PROCESS_NAME";
}
$perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
$perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
$perprocess{$process}->{MM_PAGE_FREE} += $perprocesspid{$process_pid}->{MM_PAGE_FREE};
$perprocess{$process}->{MM_PAGE_FREE_BATCHED} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_BATCHED};
$perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
$perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
$perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
$perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
$perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
$perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
$perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
$perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
$perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
}
}
sub report() {
if (!$opt_ignorepid) {
dump_stats(\%perprocesspid);
} else {
aggregate_perprocesspid();
dump_stats(\%perprocess);
}
}
# Process events or signals until neither is available
sub signal_loop() {
my $sigint_processed;
do {
$sigint_processed = 0;
process_events();
# Handle pending signals if any
if ($sigint_pending) {
my $current_time = time;
if ($sigint_exit) {
print "Received exit signal\n";
$sigint_pending = 0;
}
if ($sigint_report) {
if ($current_time >= $sigint_received + 2) {
report();
$sigint_report = 0;
$sigint_pending = 0;
$sigint_processed = 1;
}
}
}
} while ($sigint_pending || $sigint_processed);
}
signal_loop();
report();

View file

@ -0,0 +1,757 @@
#!/usr/bin/perl
# This is a POC for reading the text representation of trace output related to
# page reclaim. It makes an attempt to extract some high-level information on
# what is going on. The accuracy of the parser may vary
#
# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
# other options
# --read-procstat If the trace lacks process info, get it from /proc
# --ignore-pid Aggregate processes of the same name together
#
# Copyright (c) IBM Corporation 2009
# Author: Mel Gorman <mel@csn.ul.ie>
use strict;
use Getopt::Long;
# Tracepoint events
use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN => 1;
use constant MM_VMSCAN_DIRECT_RECLAIM_END => 2;
use constant MM_VMSCAN_KSWAPD_WAKE => 3;
use constant MM_VMSCAN_KSWAPD_SLEEP => 4;
use constant MM_VMSCAN_LRU_SHRINK_ACTIVE => 5;
use constant MM_VMSCAN_LRU_SHRINK_INACTIVE => 6;
use constant MM_VMSCAN_LRU_ISOLATE => 7;
use constant MM_VMSCAN_WRITEPAGE_FILE_SYNC => 8;
use constant MM_VMSCAN_WRITEPAGE_ANON_SYNC => 9;
use constant MM_VMSCAN_WRITEPAGE_FILE_ASYNC => 10;
use constant MM_VMSCAN_WRITEPAGE_ANON_ASYNC => 11;
use constant MM_VMSCAN_WRITEPAGE_ASYNC => 12;
use constant EVENT_UNKNOWN => 13;
# Per-order events
use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER => 12;
use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER => 13;
use constant HIGH_KSWAPD_REWAKEUP_PERORDER => 14;
# Constants used to track state
use constant STATE_DIRECT_BEGIN => 15;
use constant STATE_DIRECT_ORDER => 16;
use constant STATE_KSWAPD_BEGIN => 17;
use constant STATE_KSWAPD_ORDER => 18;
# High-level events extrapolated from tracepoints
use constant HIGH_DIRECT_RECLAIM_LATENCY => 19;
use constant HIGH_KSWAPD_LATENCY => 20;
use constant HIGH_KSWAPD_REWAKEUP => 21;
use constant HIGH_NR_SCANNED => 22;
use constant HIGH_NR_TAKEN => 23;
use constant HIGH_NR_RECLAIMED => 24;
use constant HIGH_NR_FILE_SCANNED => 25;
use constant HIGH_NR_ANON_SCANNED => 26;
use constant HIGH_NR_FILE_RECLAIMED => 27;
use constant HIGH_NR_ANON_RECLAIMED => 28;
my %perprocesspid;
my %perprocess;
my %last_procmap;
my $opt_ignorepid;
my $opt_read_procstat;
my $total_wakeup_kswapd;
my ($total_direct_reclaim, $total_direct_nr_scanned);
my ($total_direct_nr_file_scanned, $total_direct_nr_anon_scanned);
my ($total_direct_latency, $total_kswapd_latency);
my ($total_direct_nr_reclaimed);
my ($total_direct_nr_file_reclaimed, $total_direct_nr_anon_reclaimed);
my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async);
my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async);
my ($total_kswapd_nr_scanned, $total_kswapd_wake);
my ($total_kswapd_nr_file_scanned, $total_kswapd_nr_anon_scanned);
my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async);
my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async);
my ($total_kswapd_nr_reclaimed);
my ($total_kswapd_nr_file_reclaimed, $total_kswapd_nr_anon_reclaimed);
# Catch sigint and exit on request
my $sigint_report = 0;
my $sigint_exit = 0;
my $sigint_pending = 0;
my $sigint_received = 0;
sub sigint_handler {
my $current_time = time;
if ($current_time - 2 > $sigint_received) {
print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
$sigint_report = 1;
} else {
if (!$sigint_exit) {
print "Second SIGINT received quickly, exiting\n";
}
$sigint_exit++;
}
if ($sigint_exit > 3) {
print "Many SIGINTs received, exiting now without report\n";
exit;
}
$sigint_received = $current_time;
$sigint_pending = 1;
}
$SIG{INT} = "sigint_handler";
# Parse command line options
GetOptions(
'ignore-pid' => \$opt_ignorepid,
'read-procstat' => \$opt_read_procstat,
);
# Defaults for dynamically discovered regex's
my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)';
my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) file=([0-9]*)';
my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) zid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)';
my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)';
# Dyanically discovered regex
my $regex_direct_begin;
my $regex_direct_end;
my $regex_kswapd_wake;
my $regex_kswapd_sleep;
my $regex_wakeup_kswapd;
my $regex_lru_isolate;
my $regex_lru_shrink_inactive;
my $regex_lru_shrink_active;
my $regex_writepage;
# Static regex used. Specified like this for readability and for use with /o
# (process_pid) (cpus ) ( time ) (tpoint ) (details)
my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])(\s*[dX.][Nnp.][Hhs.][0-9a-fA-F.]*|)\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
my $regex_statname = '[-0-9]*\s\((.*)\).*';
my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
sub generate_traceevent_regex {
my $event = shift;
my $default = shift;
my $regex;
# Read the event format or use the default
if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
print("WARNING: Event $event format string not found\n");
return $default;
} else {
my $line;
while (!eof(FORMAT)) {
$line = <FORMAT>;
$line =~ s/, REC->.*//;
if ($line =~ /^print fmt:\s"(.*)".*/) {
$regex = $1;
$regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g;
$regex =~ s/%p/\([0-9a-f]*\)/g;
$regex =~ s/%d/\([-0-9]*\)/g;
$regex =~ s/%ld/\([-0-9]*\)/g;
$regex =~ s/%lu/\([0-9]*\)/g;
}
}
}
# Can't handle the print_flags stuff but in the context of this
# script, it really doesn't matter
$regex =~ s/\(REC.*\) \? __print_flags.*//;
# Verify fields are in the right order
my $tuple;
foreach $tuple (split /\s/, $regex) {
my ($key, $value) = split(/=/, $tuple);
my $expected = shift;
if ($key ne $expected) {
print("WARNING: Format not as expected for event $event '$key' != '$expected'\n");
$regex =~ s/$key=\((.*)\)/$key=$1/;
}
}
if (defined shift) {
die("Fewer fields than expected in format");
}
return $regex;
}
$regex_direct_begin = generate_traceevent_regex(
"vmscan/mm_vmscan_direct_reclaim_begin",
$regex_direct_begin_default,
"order", "may_writepage",
"gfp_flags");
$regex_direct_end = generate_traceevent_regex(
"vmscan/mm_vmscan_direct_reclaim_end",
$regex_direct_end_default,
"nr_reclaimed");
$regex_kswapd_wake = generate_traceevent_regex(
"vmscan/mm_vmscan_kswapd_wake",
$regex_kswapd_wake_default,
"nid", "order");
$regex_kswapd_sleep = generate_traceevent_regex(
"vmscan/mm_vmscan_kswapd_sleep",
$regex_kswapd_sleep_default,
"nid");
$regex_wakeup_kswapd = generate_traceevent_regex(
"vmscan/mm_vmscan_wakeup_kswapd",
$regex_wakeup_kswapd_default,
"nid", "zid", "order");
$regex_lru_isolate = generate_traceevent_regex(
"vmscan/mm_vmscan_lru_isolate",
$regex_lru_isolate_default,
"isolate_mode", "order",
"nr_requested", "nr_scanned", "nr_taken",
"file");
$regex_lru_shrink_inactive = generate_traceevent_regex(
"vmscan/mm_vmscan_lru_shrink_inactive",
$regex_lru_shrink_inactive_default,
"nid", "zid",
"nr_scanned", "nr_reclaimed", "priority",
"flags");
$regex_lru_shrink_active = generate_traceevent_regex(
"vmscan/mm_vmscan_lru_shrink_active",
$regex_lru_shrink_active_default,
"nid", "zid",
"lru",
"nr_scanned", "nr_rotated", "priority");
$regex_writepage = generate_traceevent_regex(
"vmscan/mm_vmscan_writepage",
$regex_writepage_default,
"page", "pfn", "flags");
sub read_statline($) {
my $pid = $_[0];
my $statline;
if (open(STAT, "/proc/$pid/stat")) {
$statline = <STAT>;
close(STAT);
}
if ($statline eq '') {
$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
}
return $statline;
}
sub guess_process_pid($$) {
my $pid = $_[0];
my $statline = $_[1];
if ($pid == 0) {
return "swapper-0";
}
if ($statline !~ /$regex_statname/o) {
die("Failed to math stat line for process name :: $statline");
}
return "$1-$pid";
}
# Convert sec.usec timestamp format
sub timestamp_to_ms($) {
my $timestamp = $_[0];
my ($sec, $usec) = split (/\./, $timestamp);
return ($sec * 1000) + ($usec / 1000);
}
sub process_events {
my $traceevent;
my $process_pid;
my $cpus;
my $timestamp;
my $tracepoint;
my $details;
my $statline;
# Read each line of the event log
EVENT_PROCESS:
while ($traceevent = <STDIN>) {
if ($traceevent =~ /$regex_traceevent/o) {
$process_pid = $1;
$timestamp = $4;
$tracepoint = $5;
$process_pid =~ /(.*)-([0-9]*)$/;
my $process = $1;
my $pid = $2;
if ($process eq "") {
$process = $last_procmap{$pid};
$process_pid = "$process-$pid";
}
$last_procmap{$pid} = $process;
if ($opt_read_procstat) {
$statline = read_statline($pid);
if ($opt_read_procstat && $process eq '') {
$process_pid = guess_process_pid($pid, $statline);
}
}
} else {
next;
}
# Perl Switch() sucks majorly
if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") {
$timestamp = timestamp_to_ms($timestamp);
$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++;
$perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp;
$details = $6;
if ($details !~ /$regex_direct_begin/o) {
print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n";
print " $details\n";
print " $regex_direct_begin\n";
next;
}
my $order = $1;
$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++;
$perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order;
} elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") {
# Count the event itself
my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END};
$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++;
# Record how long direct reclaim took this time
if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) {
$timestamp = timestamp_to_ms($timestamp);
my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER};
my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN});
$perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency";
}
} elsif ($tracepoint eq "mm_vmscan_kswapd_wake") {
$details = $6;
if ($details !~ /$regex_kswapd_wake/o) {
print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n";
print " $details\n";
print " $regex_kswapd_wake\n";
next;
}
my $order = $2;
$perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order;
if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) {
$timestamp = timestamp_to_ms($timestamp);
$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++;
$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp;
$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++;
} else {
$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++;
$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++;
}
} elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") {
# Count the event itself
my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP};
$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++;
# Record how long kswapd was awake
$timestamp = timestamp_to_ms($timestamp);
my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER};
my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN});
$perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency";
$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0;
} elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") {
$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++;
$details = $6;
if ($details !~ /$regex_wakeup_kswapd/o) {
print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n";
print " $details\n";
print " $regex_wakeup_kswapd\n";
next;
}
my $order = $3;
$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++;
} elsif ($tracepoint eq "mm_vmscan_lru_isolate") {
$details = $6;
if ($details !~ /$regex_lru_isolate/o) {
print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n";
print " $details\n";
print " $regex_lru_isolate/o\n";
next;
}
my $isolate_mode = $1;
my $nr_scanned = $4;
my $file = $6;
# To closer match vmstat scanning statistics, only count isolate_both
# and isolate_inactive as scanning. isolate_active is rotation
# isolate_inactive == 1
# isolate_active == 2
# isolate_both == 3
if ($isolate_mode != 2) {
$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
if ($file == 1) {
$perprocesspid{$process_pid}->{HIGH_NR_FILE_SCANNED} += $nr_scanned;
} else {
$perprocesspid{$process_pid}->{HIGH_NR_ANON_SCANNED} += $nr_scanned;
}
}
} elsif ($tracepoint eq "mm_vmscan_lru_shrink_inactive") {
$details = $6;
if ($details !~ /$regex_lru_shrink_inactive/o) {
print "WARNING: Failed to parse mm_vmscan_lru_shrink_inactive as expected\n";
print " $details\n";
print " $regex_lru_shrink_inactive/o\n";
next;
}
my $nr_reclaimed = $4;
my $flags = $6;
my $file = 0;
if ($flags =~ /RECLAIM_WB_FILE/) {
$file = 1;
}
$perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED} += $nr_reclaimed;
if ($file) {
$perprocesspid{$process_pid}->{HIGH_NR_FILE_RECLAIMED} += $nr_reclaimed;
} else {
$perprocesspid{$process_pid}->{HIGH_NR_ANON_RECLAIMED} += $nr_reclaimed;
}
} elsif ($tracepoint eq "mm_vmscan_writepage") {
$details = $6;
if ($details !~ /$regex_writepage/o) {
print "WARNING: Failed to parse mm_vmscan_writepage as expected\n";
print " $details\n";
print " $regex_writepage\n";
next;
}
my $flags = $3;
my $file = 0;
my $sync_io = 0;
if ($flags =~ /RECLAIM_WB_FILE/) {
$file = 1;
}
if ($flags =~ /RECLAIM_WB_SYNC/) {
$sync_io = 1;
}
if ($sync_io) {
if ($file) {
$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}++;
} else {
$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}++;
}
} else {
if ($file) {
$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}++;
} else {
$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}++;
}
}
} else {
$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
}
if ($sigint_pending) {
last EVENT_PROCESS;
}
}
}
sub dump_stats {
my $hashref = shift;
my %stats = %$hashref;
# Dump per-process stats
my $process_pid;
my $max_strlen = 0;
# Get the maximum process name
foreach $process_pid (keys %perprocesspid) {
my $len = length($process_pid);
if ($len > $max_strlen) {
$max_strlen = $len;
}
}
$max_strlen += 2;
# Work out latencies
printf("\n") if !$opt_ignorepid;
printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid;
foreach $process_pid (keys %stats) {
if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] &&
!$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) {
next;
}
printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid;
my $index = 0;
while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
$total_direct_latency += $latency;
} else {
printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid;
my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]);
$total_kswapd_latency += $latency;
}
$index++;
}
print "\n" if !$opt_ignorepid;
}
# Print out process activity
printf("\n");
printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Pages", "Time");
printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Rclmed", "Sync-IO", "ASync-IO", "Stalled");
foreach $process_pid (keys %stats) {
if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
next;
}
$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
$total_direct_nr_file_scanned += $stats{$process_pid}->{HIGH_NR_FILE_SCANNED};
$total_direct_nr_anon_scanned += $stats{$process_pid}->{HIGH_NR_ANON_SCANNED};
$total_direct_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
$total_direct_nr_file_reclaimed += $stats{$process_pid}->{HIGH_NR_FILE_RECLAIMED};
$total_direct_nr_anon_reclaimed += $stats{$process_pid}->{HIGH_NR_ANON_RECLAIMED};
$total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
$total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
$total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
$total_direct_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
my $index = 0;
my $this_reclaim_delay = 0;
while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
$this_reclaim_delay += $latency;
$index++;
}
printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8u %8.3f",
$process_pid,
$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
$stats{$process_pid}->{HIGH_NR_SCANNED},
$stats{$process_pid}->{HIGH_NR_FILE_SCANNED},
$stats{$process_pid}->{HIGH_NR_ANON_SCANNED},
$stats{$process_pid}->{HIGH_NR_RECLAIMED},
$stats{$process_pid}->{HIGH_NR_FILE_RECLAIMED},
$stats{$process_pid}->{HIGH_NR_ANON_RECLAIMED},
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC},
$this_reclaim_delay / 1000);
if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
print " ";
for (my $order = 0; $order < 20; $order++) {
my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
if ($count != 0) {
print "direct-$order=$count ";
}
}
}
if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) {
print " ";
for (my $order = 0; $order < 20; $order++) {
my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
if ($count != 0) {
print "wakeup-$order=$count ";
}
}
}
print "\n";
}
# Print out kswapd activity
printf("\n");
printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages", "Pages");
printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Rclmed", "Sync-IO", "ASync-IO");
foreach $process_pid (keys %stats) {
if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
next;
}
$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
$total_kswapd_nr_file_scanned += $stats{$process_pid}->{HIGH_NR_FILE_SCANNED};
$total_kswapd_nr_anon_scanned += $stats{$process_pid}->{HIGH_NR_ANON_SCANNED};
$total_kswapd_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
$total_kswapd_nr_file_reclaimed += $stats{$process_pid}->{HIGH_NR_FILE_RECLAIMED};
$total_kswapd_nr_anon_reclaimed += $stats{$process_pid}->{HIGH_NR_ANON_RECLAIMED};
$total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
$total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
$total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
$total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8i %8u",
$process_pid,
$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
$stats{$process_pid}->{HIGH_NR_SCANNED},
$stats{$process_pid}->{HIGH_NR_FILE_SCANNED},
$stats{$process_pid}->{HIGH_NR_ANON_SCANNED},
$stats{$process_pid}->{HIGH_NR_RECLAIMED},
$stats{$process_pid}->{HIGH_NR_FILE_RECLAIMED},
$stats{$process_pid}->{HIGH_NR_ANON_RECLAIMED},
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC});
if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
print " ";
for (my $order = 0; $order < 20; $order++) {
my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
if ($count != 0) {
print "wake-$order=$count ";
}
}
}
if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) {
print " ";
for (my $order = 0; $order < 20; $order++) {
my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order];
if ($count != 0) {
print "rewake-$order=$count ";
}
}
}
printf("\n");
}
# Print out summaries
$total_direct_latency /= 1000;
$total_kswapd_latency /= 1000;
print "\nSummary\n";
print "Direct reclaims: $total_direct_reclaim\n";
print "Direct reclaim pages scanned: $total_direct_nr_scanned\n";
print "Direct reclaim file pages scanned: $total_direct_nr_file_scanned\n";
print "Direct reclaim anon pages scanned: $total_direct_nr_anon_scanned\n";
print "Direct reclaim pages reclaimed: $total_direct_nr_reclaimed\n";
print "Direct reclaim file pages reclaimed: $total_direct_nr_file_reclaimed\n";
print "Direct reclaim anon pages reclaimed: $total_direct_nr_anon_reclaimed\n";
print "Direct reclaim write file sync I/O: $total_direct_writepage_file_sync\n";
print "Direct reclaim write anon sync I/O: $total_direct_writepage_anon_sync\n";
print "Direct reclaim write file async I/O: $total_direct_writepage_file_async\n";
print "Direct reclaim write anon async I/O: $total_direct_writepage_anon_async\n";
print "Wake kswapd requests: $total_wakeup_kswapd\n";
printf "Time stalled direct reclaim: %-1.2f seconds\n", $total_direct_latency;
print "\n";
print "Kswapd wakeups: $total_kswapd_wake\n";
print "Kswapd pages scanned: $total_kswapd_nr_scanned\n";
print "Kswapd file pages scanned: $total_kswapd_nr_file_scanned\n";
print "Kswapd anon pages scanned: $total_kswapd_nr_anon_scanned\n";
print "Kswapd pages reclaimed: $total_kswapd_nr_reclaimed\n";
print "Kswapd file pages reclaimed: $total_kswapd_nr_file_reclaimed\n";
print "Kswapd anon pages reclaimed: $total_kswapd_nr_anon_reclaimed\n";
print "Kswapd reclaim write file sync I/O: $total_kswapd_writepage_file_sync\n";
print "Kswapd reclaim write anon sync I/O: $total_kswapd_writepage_anon_sync\n";
print "Kswapd reclaim write file async I/O: $total_kswapd_writepage_file_async\n";
print "Kswapd reclaim write anon async I/O: $total_kswapd_writepage_anon_async\n";
printf "Time kswapd awake: %-1.2f seconds\n", $total_kswapd_latency;
}
sub aggregate_perprocesspid() {
my $process_pid;
my $process;
undef %perprocess;
foreach $process_pid (keys %perprocesspid) {
$process = $process_pid;
$process =~ s/-([0-9])*$//;
if ($process eq '') {
$process = "NO_PROCESS_NAME";
}
$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
$perprocess{$process}->{HIGH_NR_FILE_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_FILE_SCANNED};
$perprocess{$process}->{HIGH_NR_ANON_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_ANON_SCANNED};
$perprocess{$process}->{HIGH_NR_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED};
$perprocess{$process}->{HIGH_NR_FILE_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_FILE_RECLAIMED};
$perprocess{$process}->{HIGH_NR_ANON_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_ANON_RECLAIMED};
$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
for (my $order = 0; $order < 20; $order++) {
$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
}
# Aggregate direct reclaim latencies
my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END};
my $rd_index = 0;
while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) {
$perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index];
$rd_index++;
$wr_index++;
}
$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
# Aggregate kswapd latencies
my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP};
my $rd_index = 0;
while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) {
$perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index];
$rd_index++;
$wr_index++;
}
$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
}
}
sub report() {
if (!$opt_ignorepid) {
dump_stats(\%perprocesspid);
} else {
aggregate_perprocesspid();
dump_stats(\%perprocess);
}
}
# Process events or signals until neither is available
sub signal_loop() {
my $sigint_processed;
do {
$sigint_processed = 0;
process_events();
# Handle pending signals if any
if ($sigint_pending) {
my $current_time = time;
if ($sigint_exit) {
print "Received exit signal\n";
$sigint_pending = 0;
}
if ($sigint_report) {
if ($current_time >= $sigint_received + 2) {
report();
$sigint_report = 0;
$sigint_pending = 0;
$sigint_processed = 1;
}
}
}
} while ($sigint_pending || $sigint_processed);
}
signal_loop();
report();

View file

@ -0,0 +1,955 @@
Lockless Ring Buffer Design
===========================
Copyright 2009 Red Hat Inc.
Author: Steven Rostedt <srostedt@redhat.com>
License: The GNU Free Documentation License, Version 1.2
(dual licensed under the GPL v2)
Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
and Frederic Weisbecker.
Written for: 2.6.31
Terminology used in this Document
---------------------------------
tail - where new writes happen in the ring buffer.
head - where new reads happen in the ring buffer.
producer - the task that writes into the ring buffer (same as writer)
writer - same as producer
consumer - the task that reads from the buffer (same as reader)
reader - same as consumer.
reader_page - A page outside the ring buffer used solely (for the most part)
by the reader.
head_page - a pointer to the page that the reader will use next
tail_page - a pointer to the page that will be written to next
commit_page - a pointer to the page with the last finished non-nested write.
cmpxchg - hardware-assisted atomic transaction that performs the following:
A = B iff previous A == C
R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
current A is equal to C, and we put the old (current) A into R
R gets the previous A regardless if A is updated with B or not.
To see if the update was successful a compare of R == C may be used.
The Generic Ring Buffer
-----------------------
The ring buffer can be used in either an overwrite mode or in
producer/consumer mode.
Producer/consumer mode is where if the producer were to fill up the
buffer before the consumer could free up anything, the producer
will stop writing to the buffer. This will lose most recent events.
Overwrite mode is where if the producer were to fill up the buffer
before the consumer could free up anything, the producer will
overwrite the older data. This will lose the oldest events.
No two writers can write at the same time (on the same per-cpu buffer),
but a writer may interrupt another writer, but it must finish writing
before the previous writer may continue. This is very important to the
algorithm. The writers act like a "stack". The way interrupts works
enforces this behavior.
writer1 start
<preempted> writer2 start
<preempted> writer3 start
writer3 finishes
writer2 finishes
writer1 finishes
This is very much like a writer being preempted by an interrupt and
the interrupt doing a write as well.
Readers can happen at any time. But no two readers may run at the
same time, nor can a reader preempt/interrupt another reader. A reader
cannot preempt/interrupt a writer, but it may read/consume from the
buffer at the same time as a writer is writing, but the reader must be
on another processor to do so. A reader may read on its own processor
and can be preempted by a writer.
A writer can preempt a reader, but a reader cannot preempt a writer.
But a reader can read the buffer at the same time (on another processor)
as a writer.
The ring buffer is made up of a list of pages held together by a linked list.
At initialization a reader page is allocated for the reader that is not
part of the ring buffer.
The head_page, tail_page and commit_page are all initialized to point
to the same page.
The reader page is initialized to have its next pointer pointing to
the head page, and its previous pointer pointing to a page before
the head page.
The reader has its own page to use. At start up time, this page is
allocated but is not attached to the list. When the reader wants
to read from the buffer, if its page is empty (like it is on start-up),
it will swap its page with the head_page. The old reader page will
become part of the ring buffer and the head_page will be removed.
The page after the inserted page (old reader_page) will become the
new head page.
Once the new page is given to the reader, the reader could do what
it wants with it, as long as a writer has left that page.
A sample of how the reader page is swapped: Note this does not
show the head page in the buffer, it is for demonstrating a swap
only.
+------+
|reader| RING BUFFER
|page |
+------+
+---+ +---+ +---+
| |-->| |-->| |
| |<--| |<--| |
+---+ +---+ +---+
^ | ^ |
| +-------------+ |
+-----------------+
+------+
|reader| RING BUFFER
|page |-------------------+
+------+ v
| +---+ +---+ +---+
| | |-->| |-->| |
| | |<--| |<--| |<-+
| +---+ +---+ +---+ |
| ^ | ^ | |
| | +-------------+ | |
| +-----------------+ |
+------------------------------------+
+------+
|reader| RING BUFFER
|page |-------------------+
+------+ <---------------+ v
| ^ +---+ +---+ +---+
| | | |-->| |-->| |
| | | | | |<--| |<-+
| | +---+ +---+ +---+ |
| | | ^ | |
| | +-------------+ | |
| +-----------------------------+ |
+------------------------------------+
+------+
|buffer| RING BUFFER
|page |-------------------+
+------+ <---------------+ v
| ^ +---+ +---+ +---+
| | | | | |-->| |
| | New | | | |<--| |<-+
| | Reader +---+ +---+ +---+ |
| | page ----^ | |
| | | |
| +-----------------------------+ |
+------------------------------------+
It is possible that the page swapped is the commit page and the tail page,
if what is in the ring buffer is less than what is held in a buffer page.
reader page commit page tail page
| | |
v | |
+---+ | |
| |<----------+ |
| |<------------------------+
| |------+
+---+ |
|
v
+---+ +---+ +---+ +---+
<---| |--->| |--->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
This case is still valid for this algorithm.
When the writer leaves the page, it simply goes into the ring buffer
since the reader page still points to the next location in the ring
buffer.
The main pointers:
reader page - The page used solely by the reader and is not part
of the ring buffer (may be swapped in)
head page - the next page in the ring buffer that will be swapped
with the reader page.
tail page - the page where the next write will take place.
commit page - the page that last finished a write.
The commit page only is updated by the outermost writer in the
writer stack. A writer that preempts another writer will not move the
commit page.
When data is written into the ring buffer, a position is reserved
in the ring buffer and passed back to the writer. When the writer
is finished writing data into that position, it commits the write.
Another write (or a read) may take place at anytime during this
transaction. If another write happens it must finish before continuing
with the previous write.
Write reserve:
Buffer page
+---------+
|written |
+---------+ <--- given back to writer (current commit)
|reserved |
+---------+ <--- tail pointer
| empty |
+---------+
Write commit:
Buffer page
+---------+
|written |
+---------+
|written |
+---------+ <--- next position for write (current commit)
| empty |
+---------+
If a write happens after the first reserve:
Buffer page
+---------+
|written |
+---------+ <-- current commit
|reserved |
+---------+ <--- given back to second writer
|reserved |
+---------+ <--- tail pointer
After second writer commits:
Buffer page
+---------+
|written |
+---------+ <--(last full commit)
|reserved |
+---------+
|pending |
|commit |
+---------+ <--- tail pointer
When the first writer commits:
Buffer page
+---------+
|written |
+---------+
|written |
+---------+
|written |
+---------+ <--(last full commit and tail pointer)
The commit pointer points to the last write location that was
committed without preempting another write. When a write that
preempted another write is committed, it only becomes a pending commit
and will not be a full commit until all writes have been committed.
The commit page points to the page that has the last full commit.
The tail page points to the page with the last write (before
committing).
The tail page is always equal to or after the commit page. It may
be several pages ahead. If the tail page catches up to the commit
page then no more writes may take place (regardless of the mode
of the ring buffer: overwrite and produce/consumer).
The order of pages is:
head page
commit page
tail page
Possible scenario:
tail page
head page commit page |
| | |
v v v
+---+ +---+ +---+ +---+
<---| |--->| |--->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
There is a special case that the head page is after either the commit page
and possibly the tail page. That is when the commit (and tail) page has been
swapped with the reader page. This is because the head page is always
part of the ring buffer, but the reader page is not. Whenever there
has been less than a full page that has been committed inside the ring buffer,
and a reader swaps out a page, it will be swapping out the commit page.
reader page commit page tail page
| | |
v | |
+---+ | |
| |<----------+ |
| |<------------------------+
| |------+
+---+ |
|
v
+---+ +---+ +---+ +---+
<---| |--->| |--->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
^
|
head page
In this case, the head page will not move when the tail and commit
move back into the ring buffer.
The reader cannot swap a page into the ring buffer if the commit page
is still on that page. If the read meets the last commit (real commit
not pending or reserved), then there is nothing more to read.
The buffer is considered empty until another full commit finishes.
When the tail meets the head page, if the buffer is in overwrite mode,
the head page will be pushed ahead one. If the buffer is in producer/consumer
mode, the write will fail.
Overwrite mode:
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |--->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
^
|
head page
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |--->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
^
|
head page
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |--->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
^
|
head page
Note, the reader page will still point to the previous head page.
But when a swap takes place, it will use the most recent head page.
Making the Ring Buffer Lockless:
--------------------------------
The main idea behind the lockless algorithm is to combine the moving
of the head_page pointer with the swapping of pages with the reader.
State flags are placed inside the pointer to the page. To do this,
each page must be aligned in memory by 4 bytes. This will allow the 2
least significant bits of the address to be used as flags, since
they will always be zero for the address. To get the address,
simply mask out the flags.
MASK = ~3
address & MASK
Two flags will be kept by these two bits:
HEADER - the page being pointed to is a head page
UPDATE - the page being pointed to is being updated by a writer
and was or is about to be a head page.
reader page
|
v
+---+
| |------+
+---+ |
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-H->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
The above pointer "-H->" would have the HEADER flag set. That is
the next page is the next page to be swapped out by the reader.
This pointer means the next page is the head page.
When the tail page meets the head pointer, it will use cmpxchg to
change the pointer to the UPDATE state:
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-H->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
"-U->" represents a pointer in the UPDATE state.
Any access to the reader will need to take some sort of lock to serialize
the readers. But the writers will never take a lock to write to the
ring buffer. This means we only need to worry about a single reader,
and writes only preempt in "stack" formation.
When the reader tries to swap the page with the ring buffer, it
will also use cmpxchg. If the flag bit in the pointer to the
head page does not have the HEADER flag set, the compare will fail
and the reader will need to look for the new head page and try again.
Note, the flags UPDATE and HEADER are never set at the same time.
The reader swaps the reader page as follows:
+------+
|reader| RING BUFFER
|page |
+------+
+---+ +---+ +---+
| |--->| |--->| |
| |<---| |<---| |
+---+ +---+ +---+
^ | ^ |
| +---------------+ |
+-----H-------------+
The reader sets the reader page next pointer as HEADER to the page after
the head page.
+------+
|reader| RING BUFFER
|page |-------H-----------+
+------+ v
| +---+ +---+ +---+
| | |--->| |--->| |
| | |<---| |<---| |<-+
| +---+ +---+ +---+ |
| ^ | ^ | |
| | +---------------+ | |
| +-----H-------------+ |
+--------------------------------------+
It does a cmpxchg with the pointer to the previous head page to make it
point to the reader page. Note that the new pointer does not have the HEADER
flag set. This action atomically moves the head page forward.
+------+
|reader| RING BUFFER
|page |-------H-----------+
+------+ v
| ^ +---+ +---+ +---+
| | | |-->| |-->| |
| | | |<--| |<--| |<-+
| | +---+ +---+ +---+ |
| | | ^ | |
| | +-------------+ | |
| +-----------------------------+ |
+------------------------------------+
After the new head page is set, the previous pointer of the head page is
updated to the reader page.
+------+
|reader| RING BUFFER
|page |-------H-----------+
+------+ <---------------+ v
| ^ +---+ +---+ +---+
| | | |-->| |-->| |
| | | | | |<--| |<-+
| | +---+ +---+ +---+ |
| | | ^ | |
| | +-------------+ | |
| +-----------------------------+ |
+------------------------------------+
+------+
|buffer| RING BUFFER
|page |-------H-----------+ <--- New head page
+------+ <---------------+ v
| ^ +---+ +---+ +---+
| | | | | |-->| |
| | New | | | |<--| |<-+
| | Reader +---+ +---+ +---+ |
| | page ----^ | |
| | | |
| +-----------------------------+ |
+------------------------------------+
Another important point: The page that the reader page points back to
by its previous pointer (the one that now points to the new head page)
never points back to the reader page. That is because the reader page is
not part of the ring buffer. Traversing the ring buffer via the next pointers
will always stay in the ring buffer. Traversing the ring buffer via the
prev pointers may not.
Note, the way to determine a reader page is simply by examining the previous
pointer of the page. If the next pointer of the previous page does not
point back to the original page, then the original page is a reader page:
+--------+
| reader | next +----+
| page |-------->| |<====== (buffer page)
+--------+ +----+
| | ^
| v | next
prev | +----+
+------------->| |
+----+
The way the head page moves forward:
When the tail page meets the head page and the buffer is in overwrite mode
and more writes take place, the head page must be moved forward before the
writer may move the tail page. The way this is done is that the writer
performs a cmpxchg to convert the pointer to the head page from the HEADER
flag to have the UPDATE flag set. Once this is done, the reader will
not be able to swap the head page from the buffer, nor will it be able to
move the head page, until the writer is finished with the move.
This eliminates any races that the reader can have on the writer. The reader
must spin, and this is why the reader cannot preempt the writer.
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-H->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
The following page will be made into the new head page.
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |-H->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
After the new head page has been set, we can set the old head page
pointer back to NORMAL.
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |--->| |-H->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
After the head page has been moved, the tail page may now move forward.
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |--->| |-H->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
The above are the trivial updates. Now for the more complex scenarios.
As stated before, if enough writes preempt the first write, the
tail page may make it all the way around the buffer and meet the commit
page. At this time, we must start dropping writes (usually with some kind
of warning to the user). But what happens if the commit was still on the
reader page? The commit page is not part of the ring buffer. The tail page
must account for this.
reader page commit page
| |
v |
+---+ |
| |<----------+
| |
| |------+
+---+ |
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-H->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
^
|
tail page
If the tail page were to simply push the head page forward, the commit when
leaving the reader page would not be pointing to the correct page.
The solution to this is to test if the commit page is on the reader page
before pushing the head page. If it is, then it can be assumed that the
tail page wrapped the buffer, and we must drop new writes.
This is not a race condition, because the commit page can only be moved
by the outermost writer (the writer that was preempted).
This means that the commit will not move while a writer is moving the
tail page. The reader cannot swap the reader page if it is also being
used as the commit page. The reader can simply check that the commit
is off the reader page. Once the commit page leaves the reader page
it will never go back on it unless a reader does another swap with the
buffer page that is also the commit page.
Nested writes
-------------
In the pushing forward of the tail page we must first push forward
the head page if the head page is the next page. If the head page
is not the next page, the tail page is simply updated with a cmpxchg.
Only writers move the tail page. This must be done atomically to protect
against nested writers.
temp_page = tail_page
next_page = temp_page->next
cmpxchg(tail_page, temp_page, next_page)
The above will update the tail page if it is still pointing to the expected
page. If this fails, a nested write pushed it forward, the current write
does not need to push it.
temp page
|
v
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |--->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
Nested write comes in and moves the tail page forward:
tail page (moved by nested writer)
temp page |
| |
v v
+---+ +---+ +---+ +---+
<---| |--->| |--->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
The above would fail the cmpxchg, but since the tail page has already
been moved forward, the writer will just try again to reserve storage
on the new tail page.
But the moving of the head page is a bit more complex.
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-H->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
The write converts the head page pointer to UPDATE.
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
But if a nested writer preempts here, it will see that the next
page is a head page, but it is also nested. It will detect that
it is nested and will save that information. The detection is the
fact that it sees the UPDATE flag instead of a HEADER or NORMAL
pointer.
The nested writer will set the new head page pointer.
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |-H->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
But it will not reset the update back to normal. Only the writer
that converted a pointer from HEAD to UPDATE will convert it back
to NORMAL.
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |-H->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
After the nested writer finishes, the outermost writer will convert
the UPDATE pointer to NORMAL.
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |--->| |-H->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
It can be even more complex if several nested writes came in and moved
the tail page ahead several pages:
(first writer)
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-H->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
The write converts the head page pointer to UPDATE.
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
Next writer comes in, and sees the update and sets up the new
head page.
(second writer)
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |-H->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
The nested writer moves the tail page forward. But does not set the old
update page to NORMAL because it is not the outermost writer.
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |-H->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
Another writer preempts and sees the page after the tail page is a head page.
It changes it from HEAD to UPDATE.
(third writer)
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |-U->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
The writer will move the head page forward:
(third writer)
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |-U->| |-H->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
But now that the third writer did change the HEAD flag to UPDATE it
will convert it to normal:
(third writer)
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |--->| |-H->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
Then it will move the tail page, and return back to the second writer.
(second writer)
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |--->| |-H->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
The second writer will fail to move the tail page because it was already
moved, so it will try again and add its data to the new tail page.
It will return to the first writer.
(first writer)
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |--->| |-H->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
The first writer cannot know atomically if the tail page moved
while it updates the HEAD page. It will then update the head page to
what it thinks is the new head page.
(first writer)
tail page
|
v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |-H->| |-H->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
Since the cmpxchg returns the old value of the pointer the first writer
will see it succeeded in updating the pointer from NORMAL to HEAD.
But as we can see, this is not good enough. It must also check to see
if the tail page is either where it use to be or on the next page:
(first writer)
A B tail page
| | |
v v v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |-H->| |-H->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
If tail page != A and tail page != B, then it must reset the pointer
back to NORMAL. The fact that it only needs to worry about nested
writers means that it only needs to check this after setting the HEAD page.
(first writer)
A B tail page
| | |
v v v
+---+ +---+ +---+ +---+
<---| |--->| |-U->| |--->| |-H->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
Now the writer can update the head page. This is also why the head page must
remain in UPDATE and only reset by the outermost writer. This prevents
the reader from seeing the incorrect head page.
(first writer)
A B tail page
| | |
v v v
+---+ +---+ +---+ +---+
<---| |--->| |--->| |--->| |-H->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+

View file

@ -0,0 +1,327 @@
Notes on Analysing Behaviour Using Events and Tracepoints
Documentation written by Mel Gorman
PCL information heavily based on email from Ingo Molnar
1. Introduction
===============
Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
creating custom kernel modules to register probe functions using the event
tracing infrastructure.
Simplistically, tracepoints represent important events that can be
taken in conjunction with other tracepoints to build a "Big Picture" of
what is going on within the system. There are a large number of methods for
gathering and interpreting these events. Lacking any current Best Practises,
this document describes some of the methods that can be used.
This document assumes that debugfs is mounted on /sys/kernel/debug and that
the appropriate tracing options have been configured into the kernel. It is
assumed that the PCL tool tools/perf has been installed and is in your path.
2. Listing Available Events
===========================
2.1 Standard Utilities
----------------------
All possible events are visible from /sys/kernel/debug/tracing/events. Simply
calling
$ find /sys/kernel/debug/tracing/events -type d
will give a fair indication of the number of events available.
2.2 PCL (Performance Counters for Linux)
-------
Discovery and enumeration of all counters and events, including tracepoints,
are available with the perf tool. Getting a list of available events is a
simple case of:
$ perf list 2>&1 | grep Tracepoint
ext4:ext4_free_inode [Tracepoint event]
ext4:ext4_request_inode [Tracepoint event]
ext4:ext4_allocate_inode [Tracepoint event]
ext4:ext4_write_begin [Tracepoint event]
ext4:ext4_ordered_write_end [Tracepoint event]
[ .... remaining output snipped .... ]
3. Enabling Events
==================
3.1 System-Wide Event Enabling
------------------------------
See Documentation/trace/events.txt for a proper description on how events
can be enabled system-wide. A short example of enabling all events related
to page allocation would look something like:
$ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
3.2 System-Wide Event Enabling with SystemTap
---------------------------------------------
In SystemTap, tracepoints are accessible using the kernel.trace() function
call. The following is an example that reports every 5 seconds what processes
were allocating the pages.
global page_allocs
probe kernel.trace("mm_page_alloc") {
page_allocs[execname()]++
}
function print_count() {
printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
foreach (proc in page_allocs-)
printf("%-25d %s\n", page_allocs[proc], proc)
printf ("\n")
delete page_allocs
}
probe timer.s(5) {
print_count()
}
3.3 System-Wide Event Enabling with PCL
---------------------------------------
By specifying the -a switch and analysing sleep, the system-wide events
for a duration of time can be examined.
$ perf stat -a \
-e kmem:mm_page_alloc -e kmem:mm_page_free \
-e kmem:mm_page_free_batched \
sleep 10
Performance counter stats for 'sleep 10':
9630 kmem:mm_page_alloc
2143 kmem:mm_page_free
7424 kmem:mm_page_free_batched
10.002577764 seconds time elapsed
Similarly, one could execute a shell and exit it as desired to get a report
at that point.
3.4 Local Event Enabling
------------------------
Documentation/trace/ftrace.txt describes how to enable events on a per-thread
basis using set_ftrace_pid.
3.5 Local Event Enablement with PCL
-----------------------------------
Events can be activated and tracked for the duration of a process on a local
basis using PCL such as follows.
$ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
-e kmem:mm_page_free_batched ./hackbench 10
Time: 0.909
Performance counter stats for './hackbench 10':
17803 kmem:mm_page_alloc
12398 kmem:mm_page_free
4827 kmem:mm_page_free_batched
0.973913387 seconds time elapsed
4. Event Filtering
==================
Documentation/trace/ftrace.txt covers in-depth how to filter events in
ftrace. Obviously using grep and awk of trace_pipe is an option as well
as any script reading trace_pipe.
5. Analysing Event Variances with PCL
=====================================
Any workload can exhibit variances between runs and it can be important
to know what the standard deviation is. By and large, this is left to the
performance analyst to do it by hand. In the event that the discrete event
occurrences are useful to the performance analyst, then perf can be used.
$ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free
-e kmem:mm_page_free_batched ./hackbench 10
Time: 0.890
Time: 0.895
Time: 0.915
Time: 1.001
Time: 0.899
Performance counter stats for './hackbench 10' (5 runs):
16630 kmem:mm_page_alloc ( +- 3.542% )
11486 kmem:mm_page_free ( +- 4.771% )
4730 kmem:mm_page_free_batched ( +- 2.325% )
0.982653002 seconds time elapsed ( +- 1.448% )
In the event that some higher-level event is required that depends on some
aggregation of discrete events, then a script would need to be developed.
Using --repeat, it is also possible to view how events are fluctuating over
time on a system-wide basis using -a and sleep.
$ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
-e kmem:mm_page_free_batched \
-a --repeat 10 \
sleep 1
Performance counter stats for 'sleep 1' (10 runs):
1066 kmem:mm_page_alloc ( +- 26.148% )
182 kmem:mm_page_free ( +- 5.464% )
890 kmem:mm_page_free_batched ( +- 30.079% )
1.002251757 seconds time elapsed ( +- 0.005% )
6. Higher-Level Analysis with Helper Scripts
============================================
When events are enabled the events that are triggering can be read from
/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
options exist as well. By post-processing the output, further information can
be gathered on-line as appropriate. Examples of post-processing might include
o Reading information from /proc for the PID that triggered the event
o Deriving a higher-level event from a series of lower-level events.
o Calculating latencies between two events
Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
script that can read trace_pipe from STDIN or a copy of a trace. When used
on-line, it can be interrupted once to generate a report without exiting
and twice to exit.
Simplistically, the script just reads STDIN and counts up events but it
also can do more such as
o Derive high-level events from many low-level events. If a number of pages
are freed to the main allocator from the per-CPU lists, it recognises
that as one per-CPU drain even though there is no specific tracepoint
for that event
o It can aggregate based on PID or individual process number
o In the event memory is getting externally fragmented, it reports
on whether the fragmentation event was severe or moderate.
o When receiving an event about a PID, it can record who the parent was so
that if large numbers of events are coming from very short-lived
processes, the parent process responsible for creating all the helpers
can be identified
7. Lower-Level Analysis with PCL
================================
There may also be a requirement to identify what functions within a program
were generating events within the kernel. To begin this sort of analysis, the
data must be recorded. At the time of writing, this required root:
$ perf record -c 1 \
-e kmem:mm_page_alloc -e kmem:mm_page_free \
-e kmem:mm_page_free_batched \
./hackbench 10
Time: 0.894
[ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
Note the use of '-c 1' to set the event period to sample. The default sample
period is quite high to minimise overhead but the information collected can be
very coarse as a result.
This record outputted a file called perf.data which can be analysed using
perf report.
$ perf report
# Samples: 30922
#
# Overhead Command Shared Object
# ........ ......... ................................
#
87.27% hackbench [vdso]
6.85% hackbench /lib/i686/cmov/libc-2.9.so
2.62% hackbench /lib/ld-2.9.so
1.52% perf [vdso]
1.22% hackbench ./hackbench
0.48% hackbench [kernel]
0.02% perf /lib/i686/cmov/libc-2.9.so
0.01% perf /usr/bin/perf
0.01% perf /lib/ld-2.9.so
0.00% hackbench /lib/i686/cmov/libpthread-2.9.so
#
# (For more details, try: perf report --sort comm,dso,symbol)
#
According to this, the vast majority of events triggered on events
within the VDSO. With simple binaries, this will often be the case so let's
take a slightly different example. In the course of writing this, it was
noticed that X was generating an insane amount of page allocations so let's look
at it:
$ perf record -c 1 -f \
-e kmem:mm_page_alloc -e kmem:mm_page_free \
-e kmem:mm_page_free_batched \
-p `pidof X`
This was interrupted after a few seconds and
$ perf report
# Samples: 27666
#
# Overhead Command Shared Object
# ........ ....... .......................................
#
51.95% Xorg [vdso]
47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1
0.09% Xorg /lib/i686/cmov/libc-2.9.so
0.01% Xorg [kernel]
#
# (For more details, try: perf report --sort comm,dso,symbol)
#
So, almost half of the events are occurring in a library. To get an idea which
symbol:
$ perf report --sort comm,dso,symbol
# Samples: 27666
#
# Overhead Command Shared Object Symbol
# ........ ....... ....................................... ......
#
51.95% Xorg [vdso] [.] 0x000000ffffe424
47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2
0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc
0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f
0.01% Xorg [kernel] [k] read_hpet
0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path
0.00% Xorg [kernel] [k] ftrace_trace_userstack
To see where within the function pixmanFillsse2 things are going wrong:
$ perf annotate pixmanFillsse2
[ ... ]
0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax)
: }
:
: extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
: _mm_store_si128 (__m128i *__P, __m128i __B) : {
: *__P = __B;
12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax)
0.00 : 34ef5: ff
12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax)
0.00 : 34efd: ff
12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax)
0.00 : 34f05: ff
12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax)
0.00 : 34f0d: ff
12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax)
12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax)
12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax)
12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax)
At a glance, it looks like the time is being spent copying pixmaps to
the card. Further investigation would be needed to determine why pixmaps
are being copied around so much but a starting point would be to take an
ancient build of libpixmap out of the library path where it was totally
forgotten about from months ago!

View file

@ -0,0 +1,145 @@
Using the Linux Kernel Tracepoints
Mathieu Desnoyers
This document introduces Linux Kernel Tracepoints and their use. It
provides examples of how to insert tracepoints in the kernel and
connect probe functions to them and provides some examples of probe
functions.
* Purpose of tracepoints
A tracepoint placed in code provides a hook to call a function (probe)
that you can provide at runtime. A tracepoint can be "on" (a probe is
connected to it) or "off" (no probe is attached). When a tracepoint is
"off" it has no effect, except for adding a tiny time penalty
(checking a condition for a branch) and space penalty (adding a few
bytes for the function call at the end of the instrumented function
and adds a data structure in a separate section). When a tracepoint
is "on", the function you provide is called each time the tracepoint
is executed, in the execution context of the caller. When the function
provided ends its execution, it returns to the caller (continuing from
the tracepoint site).
You can put tracepoints at important locations in the code. They are
lightweight hooks that can pass an arbitrary number of parameters,
which prototypes are described in a tracepoint declaration placed in a
header file.
They can be used for tracing and performance accounting.
* Usage
Two elements are required for tracepoints :
- A tracepoint definition, placed in a header file.
- The tracepoint statement, in C code.
In order to use tracepoints, you should include linux/tracepoint.h.
In include/trace/events/subsys.h :
#undef TRACE_SYSTEM
#define TRACE_SYSTEM subsys
#if !defined(_TRACE_SUBSYS_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_SUBSYS_H
#include <linux/tracepoint.h>
DECLARE_TRACE(subsys_eventname,
TP_PROTO(int firstarg, struct task_struct *p),
TP_ARGS(firstarg, p));
#endif /* _TRACE_SUBSYS_H */
/* This part must be outside protection */
#include <trace/define_trace.h>
In subsys/file.c (where the tracing statement must be added) :
#include <trace/events/subsys.h>
#define CREATE_TRACE_POINTS
DEFINE_TRACE(subsys_eventname);
void somefct(void)
{
...
trace_subsys_eventname(arg, task);
...
}
Where :
- subsys_eventname is an identifier unique to your event
- subsys is the name of your subsystem.
- eventname is the name of the event to trace.
- TP_PROTO(int firstarg, struct task_struct *p) is the prototype of the
function called by this tracepoint.
- TP_ARGS(firstarg, p) are the parameters names, same as found in the
prototype.
- if you use the header in multiple source files, #define CREATE_TRACE_POINTS
should appear only in one source file.
Connecting a function (probe) to a tracepoint is done by providing a
probe (function to call) for the specific tracepoint through
register_trace_subsys_eventname(). Removing a probe is done through
unregister_trace_subsys_eventname(); it will remove the probe.
tracepoint_synchronize_unregister() must be called before the end of
the module exit function to make sure there is no caller left using
the probe. This, and the fact that preemption is disabled around the
probe call, make sure that probe removal and module unload are safe.
The tracepoint mechanism supports inserting multiple instances of the
same tracepoint, but a single definition must be made of a given
tracepoint name over all the kernel to make sure no type conflict will
occur. Name mangling of the tracepoints is done using the prototypes
to make sure typing is correct. Verification of probe type correctness
is done at the registration site by the compiler. Tracepoints can be
put in inline functions, inlined static functions, and unrolled loops
as well as regular functions.
The naming scheme "subsys_event" is suggested here as a convention
intended to limit collisions. Tracepoint names are global to the
kernel: they are considered as being the same whether they are in the
core kernel image or in modules.
If the tracepoint has to be used in kernel modules, an
EXPORT_TRACEPOINT_SYMBOL_GPL() or EXPORT_TRACEPOINT_SYMBOL() can be
used to export the defined tracepoints.
If you need to do a bit of work for a tracepoint parameter, and
that work is only used for the tracepoint, that work can be encapsulated
within an if statement with the following:
if (trace_foo_bar_enabled()) {
int i;
int tot = 0;
for (i = 0; i < count; i++)
tot += calculate_nuggets();
trace_foo_bar(tot);
}
All trace_<tracepoint>() calls have a matching trace_<tracepoint>_enabled()
function defined that returns true if the tracepoint is enabled and
false otherwise. The trace_<tracepoint>() should always be within the
block of the if (trace_<tracepoint>_enabled()) to prevent races between
the tracepoint being enabled and the check being seen.
The advantage of using the trace_<tracepoint>_enabled() is that it uses
the static_key of the tracepoint to allow the if statement to be implemented
with jump labels and avoid conditional branches.
Note: The convenience macro TRACE_EVENT provides an alternative way to
define tracepoints. Check http://lwn.net/Articles/379903,
http://lwn.net/Articles/381064 and http://lwn.net/Articles/383362
for a series of articles with more details.

View file

@ -0,0 +1,159 @@
Uprobe-tracer: Uprobe-based Event Tracing
=========================================
Documentation written by Srikar Dronamraju
Overview
--------
Uprobe based trace events are similar to kprobe based trace events.
To enable this feature, build your kernel with CONFIG_UPROBE_EVENT=y.
Similar to the kprobe-event tracer, this doesn't need to be activated via
current_tracer. Instead of that, add probe points via
/sys/kernel/debug/tracing/uprobe_events, and enable it via
/sys/kernel/debug/tracing/events/uprobes/<EVENT>/enabled.
However unlike kprobe-event tracer, the uprobe event interface expects the
user to calculate the offset of the probepoint in the object.
Synopsis of uprobe_tracer
-------------------------
p[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a uprobe
r[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe)
-:[GRP/]EVENT : Clear uprobe or uretprobe event
GRP : Group name. If omitted, "uprobes" is the default value.
EVENT : Event name. If omitted, the event name is generated based
on PATH+OFFSET.
PATH : Path to an executable or a library.
OFFSET : Offset where the probe is inserted.
FETCHARGS : Arguments. Each probe can have up to 128 args.
%REG : Fetch register REG
@ADDR : Fetch memory at ADDR (ADDR should be in userspace)
@+OFFSET : Fetch memory at OFFSET (OFFSET from same file as PATH)
$stackN : Fetch Nth entry of stack (N >= 0)
$stack : Fetch stack address.
$retval : Fetch return value.(*)
+|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**)
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
(u8/u16/u32/u64/s8/s16/s32/s64), "string" and bitfield
are supported.
(*) only for return probe.
(**) this is useful for fetching a field of data structures.
Types
-----
Several types are supported for fetch-args. Uprobe tracer will access memory
by given type. Prefix 's' and 'u' means those types are signed and unsigned
respectively. Traced arguments are shown in decimal (signed) or hex (unsigned).
String type is a special type, which fetches a "null-terminated" string from
user space.
Bitfield is another special type, which takes 3 parameters, bit-width, bit-
offset, and container-size (usually 32). The syntax is;
b<bit-width>@<bit-offset>/<container-size>
Event Profiling
---------------
You can check the total number of probe hits and probe miss-hits via
/sys/kernel/debug/tracing/uprobe_profile.
The first column is event name, the second is the number of probe hits,
the third is the number of probe miss-hits.
Usage examples
--------------
* Add a probe as a new uprobe event, write a new definition to uprobe_events
as below: (sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash)
echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
* Add a probe as a new uretprobe event:
echo 'r: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
* Unset registered event:
echo '-:bash_0x4245c0' >> /sys/kernel/debug/tracing/uprobe_events
* Print out the events that are registered:
cat /sys/kernel/debug/tracing/uprobe_events
* Clear all events:
echo > /sys/kernel/debug/tracing/uprobe_events
Following example shows how to dump the instruction pointer and %ax register
at the probed text address. Probe zfree function in /bin/zsh:
# cd /sys/kernel/debug/tracing/
# cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
# objdump -T /bin/zsh | grep -w zfree
0000000000446420 g DF .text 0000000000000012 Base zfree
0x46420 is the offset of zfree in object /bin/zsh that is loaded at
0x00400000. Hence the command to uprobe would be:
# echo 'p:zfree_entry /bin/zsh:0x46420 %ip %ax' > uprobe_events
And the same for the uretprobe would be:
# echo 'r:zfree_exit /bin/zsh:0x46420 %ip %ax' >> uprobe_events
Please note: User has to explicitly calculate the offset of the probe-point
in the object. We can see the events that are registered by looking at the
uprobe_events file.
# cat uprobe_events
p:uprobes/zfree_entry /bin/zsh:0x00046420 arg1=%ip arg2=%ax
r:uprobes/zfree_exit /bin/zsh:0x00046420 arg1=%ip arg2=%ax
Format of events can be seen by viewing the file events/uprobes/zfree_entry/format
# cat events/uprobes/zfree_entry/format
name: zfree_entry
ID: 922
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int common_padding; offset:8; size:4; signed:1;
field:unsigned long __probe_ip; offset:12; size:4; signed:0;
field:u32 arg1; offset:16; size:4; signed:0;
field:u32 arg2; offset:20; size:4; signed:0;
print fmt: "(%lx) arg1=%lx arg2=%lx", REC->__probe_ip, REC->arg1, REC->arg2
Right after definition, each event is disabled by default. For tracing these
events, you need to enable it by:
# echo 1 > events/uprobes/enable
Lets disable the event after sleeping for some time.
# sleep 20
# echo 0 > events/uprobes/enable
And you can see the traced information via /sys/kernel/debug/tracing/trace.
# cat trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
zsh-24842 [006] 258544.995456: zfree_entry: (0x446420) arg1=446420 arg2=79
zsh-24842 [007] 258545.000270: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0
zsh-24842 [002] 258545.043929: zfree_entry: (0x446420) arg1=446420 arg2=79
zsh-24842 [004] 258547.046129: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0
Output shows us uprobe was triggered for a pid 24842 with ip being 0x446420
and contents of ax register being 79. And uretprobe was triggered with ip at
0x446540 with counterpart function entry at 0x446420.