mirror of
https://github.com/AetherDroid/android_kernel_samsung_on5xelte.git
synced 2025-09-05 07:57:45 -04:00
Fixed MTP to work with TWRP
This commit is contained in:
commit
f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions
107
Documentation/trace/events-kmem.txt
Normal file
107
Documentation/trace/events-kmem.txt
Normal file
|
@ -0,0 +1,107 @@
|
|||
Subsystem Trace Points: kmem
|
||||
|
||||
The kmem tracing system captures events related to object and page allocation
|
||||
within the kernel. Broadly speaking there are five major subheadings.
|
||||
|
||||
o Slab allocation of small objects of unknown type (kmalloc)
|
||||
o Slab allocation of small objects of known type
|
||||
o Page allocation
|
||||
o Per-CPU Allocator Activity
|
||||
o External Fragmentation
|
||||
|
||||
This document describes what each of the tracepoints is and why they
|
||||
might be useful.
|
||||
|
||||
1. Slab allocation of small objects of unknown type
|
||||
===================================================
|
||||
kmalloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
|
||||
kmalloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
|
||||
kfree call_site=%lx ptr=%p
|
||||
|
||||
Heavy activity for these events may indicate that a specific cache is
|
||||
justified, particularly if kmalloc slab pages are getting significantly
|
||||
internal fragmented as a result of the allocation pattern. By correlating
|
||||
kmalloc with kfree, it may be possible to identify memory leaks and where
|
||||
the allocation sites were.
|
||||
|
||||
|
||||
2. Slab allocation of small objects of known type
|
||||
=================================================
|
||||
kmem_cache_alloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
|
||||
kmem_cache_alloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
|
||||
kmem_cache_free call_site=%lx ptr=%p
|
||||
|
||||
These events are similar in usage to the kmalloc-related events except that
|
||||
it is likely easier to pin the event down to a specific cache. At the time
|
||||
of writing, no information is available on what slab is being allocated from,
|
||||
but the call_site can usually be used to extrapolate that information.
|
||||
|
||||
3. Page allocation
|
||||
==================
|
||||
mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
|
||||
mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
|
||||
mm_page_free page=%p pfn=%lu order=%d
|
||||
mm_page_free_batched page=%p pfn=%lu order=%d cold=%d
|
||||
|
||||
These four events deal with page allocation and freeing. mm_page_alloc is
|
||||
a simple indicator of page allocator activity. Pages may be allocated from
|
||||
the per-CPU allocator (high performance) or the buddy allocator.
|
||||
|
||||
If pages are allocated directly from the buddy allocator, the
|
||||
mm_page_alloc_zone_locked event is triggered. This event is important as high
|
||||
amounts of activity imply high activity on the zone->lock. Taking this lock
|
||||
impairs performance by disabling interrupts, dirtying cache lines between
|
||||
CPUs and serialising many CPUs.
|
||||
|
||||
When a page is freed directly by the caller, the only mm_page_free event
|
||||
is triggered. Significant amounts of activity here could indicate that the
|
||||
callers should be batching their activities.
|
||||
|
||||
When pages are freed in batch, the also mm_page_free_batched is triggered.
|
||||
Broadly speaking, pages are taken off the LRU lock in bulk and
|
||||
freed in batch with a page list. Significant amounts of activity here could
|
||||
indicate that the system is under memory pressure and can also indicate
|
||||
contention on the zone->lru_lock.
|
||||
|
||||
4. Per-CPU Allocator Activity
|
||||
=============================
|
||||
mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
|
||||
mm_page_pcpu_drain page=%p pfn=%lu order=%d cpu=%d migratetype=%d
|
||||
|
||||
In front of the page allocator is a per-cpu page allocator. It exists only
|
||||
for order-0 pages, reduces contention on the zone->lock and reduces the
|
||||
amount of writing on struct page.
|
||||
|
||||
When a per-CPU list is empty or pages of the wrong type are allocated,
|
||||
the zone->lock will be taken once and the per-CPU list refilled. The event
|
||||
triggered is mm_page_alloc_zone_locked for each page allocated with the
|
||||
event indicating whether it is for a percpu_refill or not.
|
||||
|
||||
When the per-CPU list is too full, a number of pages are freed, each one
|
||||
which triggers a mm_page_pcpu_drain event.
|
||||
|
||||
The individual nature of the events is so that pages can be tracked
|
||||
between allocation and freeing. A number of drain or refill pages that occur
|
||||
consecutively imply the zone->lock being taken once. Large amounts of per-CPU
|
||||
refills and drains could imply an imbalance between CPUs where too much work
|
||||
is being concentrated in one place. It could also indicate that the per-CPU
|
||||
lists should be a larger size. Finally, large amounts of refills on one CPU
|
||||
and drains on another could be a factor in causing large amounts of cache
|
||||
line bounces due to writes between CPUs and worth investigating if pages
|
||||
can be allocated and freed on the same CPU through some algorithm change.
|
||||
|
||||
5. External Fragmentation
|
||||
=========================
|
||||
mm_page_alloc_extfrag page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
|
||||
|
||||
External fragmentation affects whether a high-order allocation will be
|
||||
successful or not. For some types of hardware, this is important although
|
||||
it is avoided where possible. If the system is using huge pages and needs
|
||||
to be able to resize the pool over the lifetime of the system, this value
|
||||
is important.
|
||||
|
||||
Large numbers of this event implies that memory is fragmenting and
|
||||
high-order allocations will start failing at some time in the future. One
|
||||
means of reducing the occurrence of this event is to increase the size of
|
||||
min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
|
||||
pageblock_size is usually the size of the default hugepage size.
|
43
Documentation/trace/events-nmi.txt
Normal file
43
Documentation/trace/events-nmi.txt
Normal file
|
@ -0,0 +1,43 @@
|
|||
NMI Trace Events
|
||||
|
||||
These events normally show up here:
|
||||
|
||||
/sys/kernel/debug/tracing/events/nmi
|
||||
|
||||
--
|
||||
|
||||
nmi_handler:
|
||||
|
||||
You might want to use this tracepoint if you suspect that your
|
||||
NMI handlers are hogging large amounts of CPU time. The kernel
|
||||
will warn if it sees long-running handlers:
|
||||
|
||||
INFO: NMI handler took too long to run: 9.207 msecs
|
||||
|
||||
and this tracepoint will allow you to drill down and get some
|
||||
more details.
|
||||
|
||||
Let's say you suspect that perf_event_nmi_handler() is causing
|
||||
you some problems and you only want to trace that handler
|
||||
specifically. You need to find its address:
|
||||
|
||||
$ grep perf_event_nmi_handler /proc/kallsyms
|
||||
ffffffff81625600 t perf_event_nmi_handler
|
||||
|
||||
Let's also say you are only interested in when that function is
|
||||
really hogging a lot of CPU time, like a millisecond at a time.
|
||||
Note that the kernel's output is in milliseconds, but the input
|
||||
to the filter is in nanoseconds! You can filter on 'delta_ns':
|
||||
|
||||
cd /sys/kernel/debug/tracing/events/nmi/nmi_handler
|
||||
echo 'handler==0xffffffff81625600 && delta_ns>1000000' > filter
|
||||
echo 1 > enable
|
||||
|
||||
Your output would then look like:
|
||||
|
||||
$ cat /sys/kernel/debug/tracing/trace_pipe
|
||||
<idle>-0 [000] d.h3 505.397558: nmi_handler: perf_event_nmi_handler() delta_ns: 3236765 handled: 1
|
||||
<idle>-0 [000] d.h3 505.805893: nmi_handler: perf_event_nmi_handler() delta_ns: 3174234 handled: 1
|
||||
<idle>-0 [000] d.h3 506.158206: nmi_handler: perf_event_nmi_handler() delta_ns: 3084642 handled: 1
|
||||
<idle>-0 [000] d.h3 506.334346: nmi_handler: perf_event_nmi_handler() delta_ns: 3080351 handled: 1
|
||||
|
97
Documentation/trace/events-power.txt
Normal file
97
Documentation/trace/events-power.txt
Normal file
|
@ -0,0 +1,97 @@
|
|||
|
||||
Subsystem Trace Points: power
|
||||
|
||||
The power tracing system captures events related to power transitions
|
||||
within the kernel. Broadly speaking there are three major subheadings:
|
||||
|
||||
o Power state switch which reports events related to suspend (S-states),
|
||||
cpuidle (C-states) and cpufreq (P-states)
|
||||
o System clock related changes
|
||||
o Power domains related changes and transitions
|
||||
|
||||
This document describes what each of the tracepoints is and why they
|
||||
might be useful.
|
||||
|
||||
Cf. include/trace/events/power.h for the events definitions.
|
||||
|
||||
1. Power state switch events
|
||||
============================
|
||||
|
||||
1.1 Trace API
|
||||
-----------------
|
||||
|
||||
A 'cpu' event class gathers the CPU-related events: cpuidle and
|
||||
cpufreq.
|
||||
|
||||
cpu_idle "state=%lu cpu_id=%lu"
|
||||
cpu_frequency "state=%lu cpu_id=%lu"
|
||||
cpu_frequency_limits "min=%lu max=%lu cpu_id=%lu"
|
||||
|
||||
A suspend event is used to indicate the system going in and out of the
|
||||
suspend mode:
|
||||
|
||||
machine_suspend "state=%lu"
|
||||
|
||||
|
||||
Note: the value of '-1' or '4294967295' for state means an exit from the current state,
|
||||
i.e. trace_cpu_idle(4, smp_processor_id()) means that the system
|
||||
enters the idle state 4, while trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id())
|
||||
means that the system exits the previous idle state.
|
||||
|
||||
The event which has 'state=4294967295' in the trace is very important to the user
|
||||
space tools which are using it to detect the end of the current state, and so to
|
||||
correctly draw the states diagrams and to calculate accurate statistics etc.
|
||||
|
||||
2. Clocks events
|
||||
================
|
||||
The clock events are used for clock enable/disable and for
|
||||
clock rate change.
|
||||
|
||||
clock_enable "%s state=%lu cpu_id=%lu"
|
||||
clock_disable "%s state=%lu cpu_id=%lu"
|
||||
clock_set_rate "%s state=%lu cpu_id=%lu"
|
||||
|
||||
The first parameter gives the clock name (e.g. "gpio1_iclk").
|
||||
The second parameter is '1' for enable, '0' for disable, the target
|
||||
clock rate for set_rate.
|
||||
|
||||
3. Power domains events
|
||||
=======================
|
||||
The power domain events are used for power domains transitions
|
||||
|
||||
power_domain_target "%s state=%lu cpu_id=%lu"
|
||||
|
||||
The first parameter gives the power domain name (e.g. "mpu_pwrdm").
|
||||
The second parameter is the power domain target state.
|
||||
|
||||
4. PM QoS events
|
||||
================
|
||||
The PM QoS events are used for QoS add/update/remove request and for
|
||||
target/flags update.
|
||||
|
||||
pm_qos_add_request "pm_qos_class=%s value=%d"
|
||||
pm_qos_update_request "pm_qos_class=%s value=%d"
|
||||
pm_qos_remove_request "pm_qos_class=%s value=%d"
|
||||
pm_qos_update_request_timeout "pm_qos_class=%s value=%d, timeout_us=%ld"
|
||||
|
||||
The first parameter gives the QoS class name (e.g. "CPU_DMA_LATENCY").
|
||||
The second parameter is value to be added/updated/removed.
|
||||
The third parameter is timeout value in usec.
|
||||
|
||||
pm_qos_update_target "action=%s prev_value=%d curr_value=%d"
|
||||
pm_qos_update_flags "action=%s prev_value=0x%x curr_value=0x%x"
|
||||
|
||||
The first parameter gives the QoS action name (e.g. "ADD_REQ").
|
||||
The second parameter is the previous QoS value.
|
||||
The third parameter is the current QoS value to update.
|
||||
|
||||
And, there are also events used for device PM QoS add/update/remove request.
|
||||
|
||||
dev_pm_qos_add_request "device=%s type=%s new_value=%d"
|
||||
dev_pm_qos_update_request "device=%s type=%s new_value=%d"
|
||||
dev_pm_qos_remove_request "device=%s type=%s new_value=%d"
|
||||
|
||||
The first parameter gives the device name which tries to add/update/remove
|
||||
QoS requests.
|
||||
The second parameter gives the request type (e.g. "DEV_PM_QOS_RESUME_LATENCY").
|
||||
The third parameter is value to be added/updated/removed.
|
496
Documentation/trace/events.txt
Normal file
496
Documentation/trace/events.txt
Normal file
|
@ -0,0 +1,496 @@
|
|||
Event Tracing
|
||||
|
||||
Documentation written by Theodore Ts'o
|
||||
Updated by Li Zefan and Tom Zanussi
|
||||
|
||||
1. Introduction
|
||||
===============
|
||||
|
||||
Tracepoints (see Documentation/trace/tracepoints.txt) can be used
|
||||
without creating custom kernel modules to register probe functions
|
||||
using the event tracing infrastructure.
|
||||
|
||||
Not all tracepoints can be traced using the event tracing system;
|
||||
the kernel developer must provide code snippets which define how the
|
||||
tracing information is saved into the tracing buffer, and how the
|
||||
tracing information should be printed.
|
||||
|
||||
2. Using Event Tracing
|
||||
======================
|
||||
|
||||
2.1 Via the 'set_event' interface
|
||||
---------------------------------
|
||||
|
||||
The events which are available for tracing can be found in the file
|
||||
/sys/kernel/debug/tracing/available_events.
|
||||
|
||||
To enable a particular event, such as 'sched_wakeup', simply echo it
|
||||
to /sys/kernel/debug/tracing/set_event. For example:
|
||||
|
||||
# echo sched_wakeup >> /sys/kernel/debug/tracing/set_event
|
||||
|
||||
[ Note: '>>' is necessary, otherwise it will firstly disable
|
||||
all the events. ]
|
||||
|
||||
To disable an event, echo the event name to the set_event file prefixed
|
||||
with an exclamation point:
|
||||
|
||||
# echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event
|
||||
|
||||
To disable all events, echo an empty line to the set_event file:
|
||||
|
||||
# echo > /sys/kernel/debug/tracing/set_event
|
||||
|
||||
To enable all events, echo '*:*' or '*:' to the set_event file:
|
||||
|
||||
# echo *:* > /sys/kernel/debug/tracing/set_event
|
||||
|
||||
The events are organized into subsystems, such as ext4, irq, sched,
|
||||
etc., and a full event name looks like this: <subsystem>:<event>. The
|
||||
subsystem name is optional, but it is displayed in the available_events
|
||||
file. All of the events in a subsystem can be specified via the syntax
|
||||
"<subsystem>:*"; for example, to enable all irq events, you can use the
|
||||
command:
|
||||
|
||||
# echo 'irq:*' > /sys/kernel/debug/tracing/set_event
|
||||
|
||||
2.2 Via the 'enable' toggle
|
||||
---------------------------
|
||||
|
||||
The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy
|
||||
of directories.
|
||||
|
||||
To enable event 'sched_wakeup':
|
||||
|
||||
# echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
|
||||
|
||||
To disable it:
|
||||
|
||||
# echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
|
||||
|
||||
To enable all events in sched subsystem:
|
||||
|
||||
# echo 1 > /sys/kernel/debug/tracing/events/sched/enable
|
||||
|
||||
To enable all events:
|
||||
|
||||
# echo 1 > /sys/kernel/debug/tracing/events/enable
|
||||
|
||||
When reading one of these enable files, there are four results:
|
||||
|
||||
0 - all events this file affects are disabled
|
||||
1 - all events this file affects are enabled
|
||||
X - there is a mixture of events enabled and disabled
|
||||
? - this file does not affect any event
|
||||
|
||||
2.3 Boot option
|
||||
---------------
|
||||
|
||||
In order to facilitate early boot debugging, use boot option:
|
||||
|
||||
trace_event=[event-list]
|
||||
|
||||
event-list is a comma separated list of events. See section 2.1 for event
|
||||
format.
|
||||
|
||||
3. Defining an event-enabled tracepoint
|
||||
=======================================
|
||||
|
||||
See The example provided in samples/trace_events
|
||||
|
||||
4. Event formats
|
||||
================
|
||||
|
||||
Each trace event has a 'format' file associated with it that contains
|
||||
a description of each field in a logged event. This information can
|
||||
be used to parse the binary trace stream, and is also the place to
|
||||
find the field names that can be used in event filters (see section 5).
|
||||
|
||||
It also displays the format string that will be used to print the
|
||||
event in text mode, along with the event name and ID used for
|
||||
profiling.
|
||||
|
||||
Every event has a set of 'common' fields associated with it; these are
|
||||
the fields prefixed with 'common_'. The other fields vary between
|
||||
events and correspond to the fields defined in the TRACE_EVENT
|
||||
definition for that event.
|
||||
|
||||
Each field in the format has the form:
|
||||
|
||||
field:field-type field-name; offset:N; size:N;
|
||||
|
||||
where offset is the offset of the field in the trace record and size
|
||||
is the size of the data item, in bytes.
|
||||
|
||||
For example, here's the information displayed for the 'sched_wakeup'
|
||||
event:
|
||||
|
||||
# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/format
|
||||
|
||||
name: sched_wakeup
|
||||
ID: 60
|
||||
format:
|
||||
field:unsigned short common_type; offset:0; size:2;
|
||||
field:unsigned char common_flags; offset:2; size:1;
|
||||
field:unsigned char common_preempt_count; offset:3; size:1;
|
||||
field:int common_pid; offset:4; size:4;
|
||||
field:int common_tgid; offset:8; size:4;
|
||||
|
||||
field:char comm[TASK_COMM_LEN]; offset:12; size:16;
|
||||
field:pid_t pid; offset:28; size:4;
|
||||
field:int prio; offset:32; size:4;
|
||||
field:int success; offset:36; size:4;
|
||||
field:int cpu; offset:40; size:4;
|
||||
|
||||
print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid,
|
||||
REC->prio, REC->success, REC->cpu
|
||||
|
||||
This event contains 10 fields, the first 5 common and the remaining 5
|
||||
event-specific. All the fields for this event are numeric, except for
|
||||
'comm' which is a string, a distinction important for event filtering.
|
||||
|
||||
5. Event filtering
|
||||
==================
|
||||
|
||||
Trace events can be filtered in the kernel by associating boolean
|
||||
'filter expressions' with them. As soon as an event is logged into
|
||||
the trace buffer, its fields are checked against the filter expression
|
||||
associated with that event type. An event with field values that
|
||||
'match' the filter will appear in the trace output, and an event whose
|
||||
values don't match will be discarded. An event with no filter
|
||||
associated with it matches everything, and is the default when no
|
||||
filter has been set for an event.
|
||||
|
||||
5.1 Expression syntax
|
||||
---------------------
|
||||
|
||||
A filter expression consists of one or more 'predicates' that can be
|
||||
combined using the logical operators '&&' and '||'. A predicate is
|
||||
simply a clause that compares the value of a field contained within a
|
||||
logged event with a constant value and returns either 0 or 1 depending
|
||||
on whether the field value matched (1) or didn't match (0):
|
||||
|
||||
field-name relational-operator value
|
||||
|
||||
Parentheses can be used to provide arbitrary logical groupings and
|
||||
double-quotes can be used to prevent the shell from interpreting
|
||||
operators as shell metacharacters.
|
||||
|
||||
The field-names available for use in filters can be found in the
|
||||
'format' files for trace events (see section 4).
|
||||
|
||||
The relational-operators depend on the type of the field being tested:
|
||||
|
||||
The operators available for numeric fields are:
|
||||
|
||||
==, !=, <, <=, >, >=, &
|
||||
|
||||
And for string fields they are:
|
||||
|
||||
==, !=, ~
|
||||
|
||||
The glob (~) only accepts a wild card character (*) at the start and or
|
||||
end of the string. For example:
|
||||
|
||||
prev_comm ~ "*sh"
|
||||
prev_comm ~ "sh*"
|
||||
prev_comm ~ "*sh*"
|
||||
|
||||
But does not allow for it to be within the string:
|
||||
|
||||
prev_comm ~ "ba*sh" <-- is invalid
|
||||
|
||||
5.2 Setting filters
|
||||
-------------------
|
||||
|
||||
A filter for an individual event is set by writing a filter expression
|
||||
to the 'filter' file for the given event.
|
||||
|
||||
For example:
|
||||
|
||||
# cd /sys/kernel/debug/tracing/events/sched/sched_wakeup
|
||||
# echo "common_preempt_count > 4" > filter
|
||||
|
||||
A slightly more involved example:
|
||||
|
||||
# cd /sys/kernel/debug/tracing/events/signal/signal_generate
|
||||
# echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter
|
||||
|
||||
If there is an error in the expression, you'll get an 'Invalid
|
||||
argument' error when setting it, and the erroneous string along with
|
||||
an error message can be seen by looking at the filter e.g.:
|
||||
|
||||
# cd /sys/kernel/debug/tracing/events/signal/signal_generate
|
||||
# echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter
|
||||
-bash: echo: write error: Invalid argument
|
||||
# cat filter
|
||||
((sig >= 10 && sig < 15) || dsig == 17) && comm != bash
|
||||
^
|
||||
parse_error: Field not found
|
||||
|
||||
Currently the caret ('^') for an error always appears at the beginning of
|
||||
the filter string; the error message should still be useful though
|
||||
even without more accurate position info.
|
||||
|
||||
5.3 Clearing filters
|
||||
--------------------
|
||||
|
||||
To clear the filter for an event, write a '0' to the event's filter
|
||||
file.
|
||||
|
||||
To clear the filters for all events in a subsystem, write a '0' to the
|
||||
subsystem's filter file.
|
||||
|
||||
5.3 Subsystem filters
|
||||
---------------------
|
||||
|
||||
For convenience, filters for every event in a subsystem can be set or
|
||||
cleared as a group by writing a filter expression into the filter file
|
||||
at the root of the subsystem. Note however, that if a filter for any
|
||||
event within the subsystem lacks a field specified in the subsystem
|
||||
filter, or if the filter can't be applied for any other reason, the
|
||||
filter for that event will retain its previous setting. This can
|
||||
result in an unintended mixture of filters which could lead to
|
||||
confusing (to the user who might think different filters are in
|
||||
effect) trace output. Only filters that reference just the common
|
||||
fields can be guaranteed to propagate successfully to all events.
|
||||
|
||||
Here are a few subsystem filter examples that also illustrate the
|
||||
above points:
|
||||
|
||||
Clear the filters on all events in the sched subsystem:
|
||||
|
||||
# cd /sys/kernel/debug/tracing/events/sched
|
||||
# echo 0 > filter
|
||||
# cat sched_switch/filter
|
||||
none
|
||||
# cat sched_wakeup/filter
|
||||
none
|
||||
|
||||
Set a filter using only common fields for all events in the sched
|
||||
subsystem (all events end up with the same filter):
|
||||
|
||||
# cd /sys/kernel/debug/tracing/events/sched
|
||||
# echo common_pid == 0 > filter
|
||||
# cat sched_switch/filter
|
||||
common_pid == 0
|
||||
# cat sched_wakeup/filter
|
||||
common_pid == 0
|
||||
|
||||
Attempt to set a filter using a non-common field for all events in the
|
||||
sched subsystem (all events but those that have a prev_pid field retain
|
||||
their old filters):
|
||||
|
||||
# cd /sys/kernel/debug/tracing/events/sched
|
||||
# echo prev_pid == 0 > filter
|
||||
# cat sched_switch/filter
|
||||
prev_pid == 0
|
||||
# cat sched_wakeup/filter
|
||||
common_pid == 0
|
||||
|
||||
6. Event triggers
|
||||
=================
|
||||
|
||||
Trace events can be made to conditionally invoke trigger 'commands'
|
||||
which can take various forms and are described in detail below;
|
||||
examples would be enabling or disabling other trace events or invoking
|
||||
a stack trace whenever the trace event is hit. Whenever a trace event
|
||||
with attached triggers is invoked, the set of trigger commands
|
||||
associated with that event is invoked. Any given trigger can
|
||||
additionally have an event filter of the same form as described in
|
||||
section 5 (Event filtering) associated with it - the command will only
|
||||
be invoked if the event being invoked passes the associated filter.
|
||||
If no filter is associated with the trigger, it always passes.
|
||||
|
||||
Triggers are added to and removed from a particular event by writing
|
||||
trigger expressions to the 'trigger' file for the given event.
|
||||
|
||||
A given event can have any number of triggers associated with it,
|
||||
subject to any restrictions that individual commands may have in that
|
||||
regard.
|
||||
|
||||
Event triggers are implemented on top of "soft" mode, which means that
|
||||
whenever a trace event has one or more triggers associated with it,
|
||||
the event is activated even if it isn't actually enabled, but is
|
||||
disabled in a "soft" mode. That is, the tracepoint will be called,
|
||||
but just will not be traced, unless of course it's actually enabled.
|
||||
This scheme allows triggers to be invoked even for events that aren't
|
||||
enabled, and also allows the current event filter implementation to be
|
||||
used for conditionally invoking triggers.
|
||||
|
||||
The syntax for event triggers is roughly based on the syntax for
|
||||
set_ftrace_filter 'ftrace filter commands' (see the 'Filter commands'
|
||||
section of Documentation/trace/ftrace.txt), but there are major
|
||||
differences and the implementation isn't currently tied to it in any
|
||||
way, so beware about making generalizations between the two.
|
||||
|
||||
6.1 Expression syntax
|
||||
---------------------
|
||||
|
||||
Triggers are added by echoing the command to the 'trigger' file:
|
||||
|
||||
# echo 'command[:count] [if filter]' > trigger
|
||||
|
||||
Triggers are removed by echoing the same command but starting with '!'
|
||||
to the 'trigger' file:
|
||||
|
||||
# echo '!command[:count] [if filter]' > trigger
|
||||
|
||||
The [if filter] part isn't used in matching commands when removing, so
|
||||
leaving that off in a '!' command will accomplish the same thing as
|
||||
having it in.
|
||||
|
||||
The filter syntax is the same as that described in the 'Event
|
||||
filtering' section above.
|
||||
|
||||
For ease of use, writing to the trigger file using '>' currently just
|
||||
adds or removes a single trigger and there's no explicit '>>' support
|
||||
('>' actually behaves like '>>') or truncation support to remove all
|
||||
triggers (you have to use '!' for each one added.)
|
||||
|
||||
6.2 Supported trigger commands
|
||||
------------------------------
|
||||
|
||||
The following commands are supported:
|
||||
|
||||
- enable_event/disable_event
|
||||
|
||||
These commands can enable or disable another trace event whenever
|
||||
the triggering event is hit. When these commands are registered,
|
||||
the other trace event is activated, but disabled in a "soft" mode.
|
||||
That is, the tracepoint will be called, but just will not be traced.
|
||||
The event tracepoint stays in this mode as long as there's a trigger
|
||||
in effect that can trigger it.
|
||||
|
||||
For example, the following trigger causes kmalloc events to be
|
||||
traced when a read system call is entered, and the :1 at the end
|
||||
specifies that this enablement happens only once:
|
||||
|
||||
# echo 'enable_event:kmem:kmalloc:1' > \
|
||||
/sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
|
||||
|
||||
The following trigger causes kmalloc events to stop being traced
|
||||
when a read system call exits. This disablement happens on every
|
||||
read system call exit:
|
||||
|
||||
# echo 'disable_event:kmem:kmalloc' > \
|
||||
/sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger
|
||||
|
||||
The format is:
|
||||
|
||||
enable_event:<system>:<event>[:count]
|
||||
disable_event:<system>:<event>[:count]
|
||||
|
||||
To remove the above commands:
|
||||
|
||||
# echo '!enable_event:kmem:kmalloc:1' > \
|
||||
/sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
|
||||
|
||||
# echo '!disable_event:kmem:kmalloc' > \
|
||||
/sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger
|
||||
|
||||
Note that there can be any number of enable/disable_event triggers
|
||||
per triggering event, but there can only be one trigger per
|
||||
triggered event. e.g. sys_enter_read can have triggers enabling both
|
||||
kmem:kmalloc and sched:sched_switch, but can't have two kmem:kmalloc
|
||||
versions such as kmem:kmalloc and kmem:kmalloc:1 or 'kmem:kmalloc if
|
||||
bytes_req == 256' and 'kmem:kmalloc if bytes_alloc == 256' (they
|
||||
could be combined into a single filter on kmem:kmalloc though).
|
||||
|
||||
- stacktrace
|
||||
|
||||
This command dumps a stacktrace in the trace buffer whenever the
|
||||
triggering event occurs.
|
||||
|
||||
For example, the following trigger dumps a stacktrace every time the
|
||||
kmalloc tracepoint is hit:
|
||||
|
||||
# echo 'stacktrace' > \
|
||||
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
|
||||
|
||||
The following trigger dumps a stacktrace the first 5 times a kmalloc
|
||||
request happens with a size >= 64K
|
||||
|
||||
# echo 'stacktrace:5 if bytes_req >= 65536' > \
|
||||
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
|
||||
|
||||
The format is:
|
||||
|
||||
stacktrace[:count]
|
||||
|
||||
To remove the above commands:
|
||||
|
||||
# echo '!stacktrace' > \
|
||||
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
|
||||
|
||||
# echo '!stacktrace:5 if bytes_req >= 65536' > \
|
||||
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
|
||||
|
||||
The latter can also be removed more simply by the following (without
|
||||
the filter):
|
||||
|
||||
# echo '!stacktrace:5' > \
|
||||
/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
|
||||
|
||||
Note that there can be only one stacktrace trigger per triggering
|
||||
event.
|
||||
|
||||
- snapshot
|
||||
|
||||
This command causes a snapshot to be triggered whenever the
|
||||
triggering event occurs.
|
||||
|
||||
The following command creates a snapshot every time a block request
|
||||
queue is unplugged with a depth > 1. If you were tracing a set of
|
||||
events or functions at the time, the snapshot trace buffer would
|
||||
capture those events when the trigger event occurred:
|
||||
|
||||
# echo 'snapshot if nr_rq > 1' > \
|
||||
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
|
||||
|
||||
To only snapshot once:
|
||||
|
||||
# echo 'snapshot:1 if nr_rq > 1' > \
|
||||
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
|
||||
|
||||
To remove the above commands:
|
||||
|
||||
# echo '!snapshot if nr_rq > 1' > \
|
||||
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
|
||||
|
||||
# echo '!snapshot:1 if nr_rq > 1' > \
|
||||
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
|
||||
|
||||
Note that there can be only one snapshot trigger per triggering
|
||||
event.
|
||||
|
||||
- traceon/traceoff
|
||||
|
||||
These commands turn tracing on and off when the specified events are
|
||||
hit. The parameter determines how many times the tracing system is
|
||||
turned on and off. If unspecified, there is no limit.
|
||||
|
||||
The following command turns tracing off the first time a block
|
||||
request queue is unplugged with a depth > 1. If you were tracing a
|
||||
set of events or functions at the time, you could then examine the
|
||||
trace buffer to see the sequence of events that led up to the
|
||||
trigger event:
|
||||
|
||||
# echo 'traceoff:1 if nr_rq > 1' > \
|
||||
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
|
||||
|
||||
To always disable tracing when nr_rq > 1 :
|
||||
|
||||
# echo 'traceoff if nr_rq > 1' > \
|
||||
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
|
||||
|
||||
To remove the above commands:
|
||||
|
||||
# echo '!traceoff:1 if nr_rq > 1' > \
|
||||
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
|
||||
|
||||
# echo '!traceoff if nr_rq > 1' > \
|
||||
/sys/kernel/debug/tracing/events/block/block_unplug/trigger
|
||||
|
||||
Note that there can be only one traceon or traceoff trigger per
|
||||
triggering event.
|
382
Documentation/trace/ftrace-design.txt
Normal file
382
Documentation/trace/ftrace-design.txt
Normal file
|
@ -0,0 +1,382 @@
|
|||
function tracer guts
|
||||
====================
|
||||
By Mike Frysinger
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
Here we will cover the architecture pieces that the common function tracing
|
||||
code relies on for proper functioning. Things are broken down into increasing
|
||||
complexity so that you can start simple and at least get basic functionality.
|
||||
|
||||
Note that this focuses on architecture implementation details only. If you
|
||||
want more explanation of a feature in terms of common code, review the common
|
||||
ftrace.txt file.
|
||||
|
||||
Ideally, everyone who wishes to retain performance while supporting tracing in
|
||||
their kernel should make it all the way to dynamic ftrace support.
|
||||
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
|
||||
Ftrace relies on these features being implemented:
|
||||
STACKTRACE_SUPPORT - implement save_stack_trace()
|
||||
TRACE_IRQFLAGS_SUPPORT - implement include/asm/irqflags.h
|
||||
|
||||
|
||||
HAVE_FUNCTION_TRACER
|
||||
--------------------
|
||||
|
||||
You will need to implement the mcount and the ftrace_stub functions.
|
||||
|
||||
The exact mcount symbol name will depend on your toolchain. Some call it
|
||||
"mcount", "_mcount", or even "__mcount". You can probably figure it out by
|
||||
running something like:
|
||||
$ echo 'main(){}' | gcc -x c -S -o - - -pg | grep mcount
|
||||
call mcount
|
||||
We'll make the assumption below that the symbol is "mcount" just to keep things
|
||||
nice and simple in the examples.
|
||||
|
||||
Keep in mind that the ABI that is in effect inside of the mcount function is
|
||||
*highly* architecture/toolchain specific. We cannot help you in this regard,
|
||||
sorry. Dig up some old documentation and/or find someone more familiar than
|
||||
you to bang ideas off of. Typically, register usage (argument/scratch/etc...)
|
||||
is a major issue at this point, especially in relation to the location of the
|
||||
mcount call (before/after function prologue). You might also want to look at
|
||||
how glibc has implemented the mcount function for your architecture. It might
|
||||
be (semi-)relevant.
|
||||
|
||||
The mcount function should check the function pointer ftrace_trace_function
|
||||
to see if it is set to ftrace_stub. If it is, there is nothing for you to do,
|
||||
so return immediately. If it isn't, then call that function in the same way
|
||||
the mcount function normally calls __mcount_internal -- the first argument is
|
||||
the "frompc" while the second argument is the "selfpc" (adjusted to remove the
|
||||
size of the mcount call that is embedded in the function).
|
||||
|
||||
For example, if the function foo() calls bar(), when the bar() function calls
|
||||
mcount(), the arguments mcount() will pass to the tracer are:
|
||||
"frompc" - the address bar() will use to return to foo()
|
||||
"selfpc" - the address bar() (with mcount() size adjustment)
|
||||
|
||||
Also keep in mind that this mcount function will be called *a lot*, so
|
||||
optimizing for the default case of no tracer will help the smooth running of
|
||||
your system when tracing is disabled. So the start of the mcount function is
|
||||
typically the bare minimum with checking things before returning. That also
|
||||
means the code flow should usually be kept linear (i.e. no branching in the nop
|
||||
case). This is of course an optimization and not a hard requirement.
|
||||
|
||||
Here is some pseudo code that should help (these functions should actually be
|
||||
implemented in assembly):
|
||||
|
||||
void ftrace_stub(void)
|
||||
{
|
||||
return;
|
||||
}
|
||||
|
||||
void mcount(void)
|
||||
{
|
||||
/* save any bare state needed in order to do initial checking */
|
||||
|
||||
extern void (*ftrace_trace_function)(unsigned long, unsigned long);
|
||||
if (ftrace_trace_function != ftrace_stub)
|
||||
goto do_trace;
|
||||
|
||||
/* restore any bare state */
|
||||
|
||||
return;
|
||||
|
||||
do_trace:
|
||||
|
||||
/* save all state needed by the ABI (see paragraph above) */
|
||||
|
||||
unsigned long frompc = ...;
|
||||
unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE;
|
||||
ftrace_trace_function(frompc, selfpc);
|
||||
|
||||
/* restore all state needed by the ABI */
|
||||
}
|
||||
|
||||
Don't forget to export mcount for modules !
|
||||
extern void mcount(void);
|
||||
EXPORT_SYMBOL(mcount);
|
||||
|
||||
|
||||
HAVE_FUNCTION_GRAPH_TRACER
|
||||
--------------------------
|
||||
|
||||
Deep breath ... time to do some real work. Here you will need to update the
|
||||
mcount function to check ftrace graph function pointers, as well as implement
|
||||
some functions to save (hijack) and restore the return address.
|
||||
|
||||
The mcount function should check the function pointers ftrace_graph_return
|
||||
(compare to ftrace_stub) and ftrace_graph_entry (compare to
|
||||
ftrace_graph_entry_stub). If either of those is not set to the relevant stub
|
||||
function, call the arch-specific function ftrace_graph_caller which in turn
|
||||
calls the arch-specific function prepare_ftrace_return. Neither of these
|
||||
function names is strictly required, but you should use them anyway to stay
|
||||
consistent across the architecture ports -- easier to compare & contrast
|
||||
things.
|
||||
|
||||
The arguments to prepare_ftrace_return are slightly different than what are
|
||||
passed to ftrace_trace_function. The second argument "selfpc" is the same,
|
||||
but the first argument should be a pointer to the "frompc". Typically this is
|
||||
located on the stack. This allows the function to hijack the return address
|
||||
temporarily to have it point to the arch-specific function return_to_handler.
|
||||
That function will simply call the common ftrace_return_to_handler function and
|
||||
that will return the original return address with which you can return to the
|
||||
original call site.
|
||||
|
||||
Here is the updated mcount pseudo code:
|
||||
void mcount(void)
|
||||
{
|
||||
...
|
||||
if (ftrace_trace_function != ftrace_stub)
|
||||
goto do_trace;
|
||||
|
||||
+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
|
||||
+ extern void (*ftrace_graph_return)(...);
|
||||
+ extern void (*ftrace_graph_entry)(...);
|
||||
+ if (ftrace_graph_return != ftrace_stub ||
|
||||
+ ftrace_graph_entry != ftrace_graph_entry_stub)
|
||||
+ ftrace_graph_caller();
|
||||
+#endif
|
||||
|
||||
/* restore any bare state */
|
||||
...
|
||||
|
||||
Here is the pseudo code for the new ftrace_graph_caller assembly function:
|
||||
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
|
||||
void ftrace_graph_caller(void)
|
||||
{
|
||||
/* save all state needed by the ABI */
|
||||
|
||||
unsigned long *frompc = &...;
|
||||
unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE;
|
||||
/* passing frame pointer up is optional -- see below */
|
||||
prepare_ftrace_return(frompc, selfpc, frame_pointer);
|
||||
|
||||
/* restore all state needed by the ABI */
|
||||
}
|
||||
#endif
|
||||
|
||||
For information on how to implement prepare_ftrace_return(), simply look at the
|
||||
x86 version (the frame pointer passing is optional; see the next section for
|
||||
more information). The only architecture-specific piece in it is the setup of
|
||||
the fault recovery table (the asm(...) code). The rest should be the same
|
||||
across architectures.
|
||||
|
||||
Here is the pseudo code for the new return_to_handler assembly function. Note
|
||||
that the ABI that applies here is different from what applies to the mcount
|
||||
code. Since you are returning from a function (after the epilogue), you might
|
||||
be able to skimp on things saved/restored (usually just registers used to pass
|
||||
return values).
|
||||
|
||||
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
|
||||
void return_to_handler(void)
|
||||
{
|
||||
/* save all state needed by the ABI (see paragraph above) */
|
||||
|
||||
void (*original_return_point)(void) = ftrace_return_to_handler();
|
||||
|
||||
/* restore all state needed by the ABI */
|
||||
|
||||
/* this is usually either a return or a jump */
|
||||
original_return_point();
|
||||
}
|
||||
#endif
|
||||
|
||||
|
||||
HAVE_FUNCTION_GRAPH_FP_TEST
|
||||
---------------------------
|
||||
|
||||
An arch may pass in a unique value (frame pointer) to both the entering and
|
||||
exiting of a function. On exit, the value is compared and if it does not
|
||||
match, then it will panic the kernel. This is largely a sanity check for bad
|
||||
code generation with gcc. If gcc for your port sanely updates the frame
|
||||
pointer under different optimization levels, then ignore this option.
|
||||
|
||||
However, adding support for it isn't terribly difficult. In your assembly code
|
||||
that calls prepare_ftrace_return(), pass the frame pointer as the 3rd argument.
|
||||
Then in the C version of that function, do what the x86 port does and pass it
|
||||
along to ftrace_push_return_trace() instead of a stub value of 0.
|
||||
|
||||
Similarly, when you call ftrace_return_to_handler(), pass it the frame pointer.
|
||||
|
||||
|
||||
HAVE_FTRACE_NMI_ENTER
|
||||
---------------------
|
||||
|
||||
If you can't trace NMI functions, then skip this option.
|
||||
|
||||
<details to be filled>
|
||||
|
||||
|
||||
HAVE_SYSCALL_TRACEPOINTS
|
||||
------------------------
|
||||
|
||||
You need very few things to get the syscalls tracing in an arch.
|
||||
|
||||
- Support HAVE_ARCH_TRACEHOOK (see arch/Kconfig).
|
||||
- Have a NR_syscalls variable in <asm/unistd.h> that provides the number
|
||||
of syscalls supported by the arch.
|
||||
- Support the TIF_SYSCALL_TRACEPOINT thread flags.
|
||||
- Put the trace_sys_enter() and trace_sys_exit() tracepoints calls from ptrace
|
||||
in the ptrace syscalls tracing path.
|
||||
- If the system call table on this arch is more complicated than a simple array
|
||||
of addresses of the system calls, implement an arch_syscall_addr to return
|
||||
the address of a given system call.
|
||||
- If the symbol names of the system calls do not match the function names on
|
||||
this arch, define ARCH_HAS_SYSCALL_MATCH_SYM_NAME in asm/ftrace.h and
|
||||
implement arch_syscall_match_sym_name with the appropriate logic to return
|
||||
true if the function name corresponds with the symbol name.
|
||||
- Tag this arch as HAVE_SYSCALL_TRACEPOINTS.
|
||||
|
||||
|
||||
HAVE_FTRACE_MCOUNT_RECORD
|
||||
-------------------------
|
||||
|
||||
See scripts/recordmcount.pl for more info. Just fill in the arch-specific
|
||||
details for how to locate the addresses of mcount call sites via objdump.
|
||||
This option doesn't make much sense without also implementing dynamic ftrace.
|
||||
|
||||
|
||||
HAVE_DYNAMIC_FTRACE
|
||||
-------------------
|
||||
|
||||
You will first need HAVE_FTRACE_MCOUNT_RECORD and HAVE_FUNCTION_TRACER, so
|
||||
scroll your reader back up if you got over eager.
|
||||
|
||||
Once those are out of the way, you will need to implement:
|
||||
- asm/ftrace.h:
|
||||
- MCOUNT_ADDR
|
||||
- ftrace_call_adjust()
|
||||
- struct dyn_arch_ftrace{}
|
||||
- asm code:
|
||||
- mcount() (new stub)
|
||||
- ftrace_caller()
|
||||
- ftrace_call()
|
||||
- ftrace_stub()
|
||||
- C code:
|
||||
- ftrace_dyn_arch_init()
|
||||
- ftrace_make_nop()
|
||||
- ftrace_make_call()
|
||||
- ftrace_update_ftrace_func()
|
||||
|
||||
First you will need to fill out some arch details in your asm/ftrace.h.
|
||||
|
||||
Define MCOUNT_ADDR as the address of your mcount symbol similar to:
|
||||
#define MCOUNT_ADDR ((unsigned long)mcount)
|
||||
Since no one else will have a decl for that function, you will need to:
|
||||
extern void mcount(void);
|
||||
|
||||
You will also need the helper function ftrace_call_adjust(). Most people
|
||||
will be able to stub it out like so:
|
||||
static inline unsigned long ftrace_call_adjust(unsigned long addr)
|
||||
{
|
||||
return addr;
|
||||
}
|
||||
<details to be filled>
|
||||
|
||||
Lastly you will need the custom dyn_arch_ftrace structure. If you need
|
||||
some extra state when runtime patching arbitrary call sites, this is the
|
||||
place. For now though, create an empty struct:
|
||||
struct dyn_arch_ftrace {
|
||||
/* No extra data needed */
|
||||
};
|
||||
|
||||
With the header out of the way, we can fill out the assembly code. While we
|
||||
did already create a mcount() function earlier, dynamic ftrace only wants a
|
||||
stub function. This is because the mcount() will only be used during boot
|
||||
and then all references to it will be patched out never to return. Instead,
|
||||
the guts of the old mcount() will be used to create a new ftrace_caller()
|
||||
function. Because the two are hard to merge, it will most likely be a lot
|
||||
easier to have two separate definitions split up by #ifdefs. Same goes for
|
||||
the ftrace_stub() as that will now be inlined in ftrace_caller().
|
||||
|
||||
Before we get confused anymore, let's check out some pseudo code so you can
|
||||
implement your own stuff in assembly:
|
||||
|
||||
void mcount(void)
|
||||
{
|
||||
return;
|
||||
}
|
||||
|
||||
void ftrace_caller(void)
|
||||
{
|
||||
/* save all state needed by the ABI (see paragraph above) */
|
||||
|
||||
unsigned long frompc = ...;
|
||||
unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE;
|
||||
|
||||
ftrace_call:
|
||||
ftrace_stub(frompc, selfpc);
|
||||
|
||||
/* restore all state needed by the ABI */
|
||||
|
||||
ftrace_stub:
|
||||
return;
|
||||
}
|
||||
|
||||
This might look a little odd at first, but keep in mind that we will be runtime
|
||||
patching multiple things. First, only functions that we actually want to trace
|
||||
will be patched to call ftrace_caller(). Second, since we only have one tracer
|
||||
active at a time, we will patch the ftrace_caller() function itself to call the
|
||||
specific tracer in question. That is the point of the ftrace_call label.
|
||||
|
||||
With that in mind, let's move on to the C code that will actually be doing the
|
||||
runtime patching. You'll need a little knowledge of your arch's opcodes in
|
||||
order to make it through the next section.
|
||||
|
||||
Every arch has an init callback function. If you need to do something early on
|
||||
to initialize some state, this is the time to do that. Otherwise, this simple
|
||||
function below should be sufficient for most people:
|
||||
|
||||
int __init ftrace_dyn_arch_init(void)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
|
||||
There are two functions that are used to do runtime patching of arbitrary
|
||||
functions. The first is used to turn the mcount call site into a nop (which
|
||||
is what helps us retain runtime performance when not tracing). The second is
|
||||
used to turn the mcount call site into a call to an arbitrary location (but
|
||||
typically that is ftracer_caller()). See the general function definition in
|
||||
linux/ftrace.h for the functions:
|
||||
ftrace_make_nop()
|
||||
ftrace_make_call()
|
||||
The rec->ip value is the address of the mcount call site that was collected
|
||||
by the scripts/recordmcount.pl during build time.
|
||||
|
||||
The last function is used to do runtime patching of the active tracer. This
|
||||
will be modifying the assembly code at the location of the ftrace_call symbol
|
||||
inside of the ftrace_caller() function. So you should have sufficient padding
|
||||
at that location to support the new function calls you'll be inserting. Some
|
||||
people will be using a "call" type instruction while others will be using a
|
||||
"branch" type instruction. Specifically, the function is:
|
||||
ftrace_update_ftrace_func()
|
||||
|
||||
|
||||
HAVE_DYNAMIC_FTRACE + HAVE_FUNCTION_GRAPH_TRACER
|
||||
------------------------------------------------
|
||||
|
||||
The function grapher needs a few tweaks in order to work with dynamic ftrace.
|
||||
Basically, you will need to:
|
||||
- update:
|
||||
- ftrace_caller()
|
||||
- ftrace_graph_call()
|
||||
- ftrace_graph_caller()
|
||||
- implement:
|
||||
- ftrace_enable_ftrace_graph_caller()
|
||||
- ftrace_disable_ftrace_graph_caller()
|
||||
|
||||
<details to be filled>
|
||||
Quick notes:
|
||||
- add a nop stub after the ftrace_call location named ftrace_graph_call;
|
||||
stub needs to be large enough to support a call to ftrace_graph_caller()
|
||||
- update ftrace_graph_caller() to work with being called by the new
|
||||
ftrace_caller() since some semantics may have changed
|
||||
- ftrace_enable_ftrace_graph_caller() will runtime patch the
|
||||
ftrace_graph_call location with a call to ftrace_graph_caller()
|
||||
- ftrace_disable_ftrace_graph_caller() will runtime patch the
|
||||
ftrace_graph_call location with nops
|
2866
Documentation/trace/ftrace.txt
Normal file
2866
Documentation/trace/ftrace.txt
Normal file
File diff suppressed because it is too large
Load diff
42
Documentation/trace/function-graph-fold.vim
Normal file
42
Documentation/trace/function-graph-fold.vim
Normal file
|
@ -0,0 +1,42 @@
|
|||
" Enable folding for ftrace function_graph traces.
|
||||
"
|
||||
" To use, :source this file while viewing a function_graph trace, or use vim's
|
||||
" -S option to load from the command-line together with a trace. You can then
|
||||
" use the usual vim fold commands, such as "za", to open and close nested
|
||||
" functions. While closed, a fold will show the total time taken for a call,
|
||||
" as would normally appear on the line with the closing brace. Folded
|
||||
" functions will not include finish_task_switch(), so folding should remain
|
||||
" relatively sane even through a context switch.
|
||||
"
|
||||
" Note that this will almost certainly only work well with a
|
||||
" single-CPU trace (e.g. trace-cmd report --cpu 1).
|
||||
|
||||
function! FunctionGraphFoldExpr(lnum)
|
||||
let line = getline(a:lnum)
|
||||
if line[-1:] == '{'
|
||||
if line =~ 'finish_task_switch() {$'
|
||||
return '>1'
|
||||
endif
|
||||
return 'a1'
|
||||
elseif line[-1:] == '}'
|
||||
return 's1'
|
||||
else
|
||||
return '='
|
||||
endif
|
||||
endfunction
|
||||
|
||||
function! FunctionGraphFoldText()
|
||||
let s = split(getline(v:foldstart), '|', 1)
|
||||
if getline(v:foldend+1) =~ 'finish_task_switch() {$'
|
||||
let s[2] = ' task switch '
|
||||
else
|
||||
let e = split(getline(v:foldend), '|', 1)
|
||||
let s[2] = e[2]
|
||||
endif
|
||||
return join(s, '|')
|
||||
endfunction
|
||||
|
||||
setlocal foldexpr=FunctionGraphFoldExpr(v:lnum)
|
||||
setlocal foldtext=FunctionGraphFoldText()
|
||||
setlocal foldcolumn=12
|
||||
setlocal foldmethod=expr
|
172
Documentation/trace/kprobetrace.txt
Normal file
172
Documentation/trace/kprobetrace.txt
Normal file
|
@ -0,0 +1,172 @@
|
|||
Kprobe-based Event Tracing
|
||||
==========================
|
||||
|
||||
Documentation is written by Masami Hiramatsu
|
||||
|
||||
|
||||
Overview
|
||||
--------
|
||||
These events are similar to tracepoint based events. Instead of Tracepoint,
|
||||
this is based on kprobes (kprobe and kretprobe). So it can probe wherever
|
||||
kprobes can probe (this means, all functions body except for __kprobes
|
||||
functions). Unlike the Tracepoint based event, this can be added and removed
|
||||
dynamically, on the fly.
|
||||
|
||||
To enable this feature, build your kernel with CONFIG_KPROBE_EVENT=y.
|
||||
|
||||
Similar to the events tracer, this doesn't need to be activated via
|
||||
current_tracer. Instead of that, add probe points via
|
||||
/sys/kernel/debug/tracing/kprobe_events, and enable it via
|
||||
/sys/kernel/debug/tracing/events/kprobes/<EVENT>/enabled.
|
||||
|
||||
|
||||
Synopsis of kprobe_events
|
||||
-------------------------
|
||||
p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe
|
||||
r[:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe
|
||||
-:[GRP/]EVENT : Clear a probe
|
||||
|
||||
GRP : Group name. If omitted, use "kprobes" for it.
|
||||
EVENT : Event name. If omitted, the event name is generated
|
||||
based on SYM+offs or MEMADDR.
|
||||
MOD : Module name which has given SYM.
|
||||
SYM[+offs] : Symbol+offset where the probe is inserted.
|
||||
MEMADDR : Address where the probe is inserted.
|
||||
|
||||
FETCHARGS : Arguments. Each probe can have up to 128 args.
|
||||
%REG : Fetch register REG
|
||||
@ADDR : Fetch memory at ADDR (ADDR should be in kernel)
|
||||
@SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
|
||||
$stackN : Fetch Nth entry of stack (N >= 0)
|
||||
$stack : Fetch stack address.
|
||||
$retval : Fetch return value.(*)
|
||||
+|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**)
|
||||
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
|
||||
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
|
||||
(u8/u16/u32/u64/s8/s16/s32/s64), "string" and bitfield
|
||||
are supported.
|
||||
|
||||
(*) only for return probe.
|
||||
(**) this is useful for fetching a field of data structures.
|
||||
|
||||
Types
|
||||
-----
|
||||
Several types are supported for fetch-args. Kprobe tracer will access memory
|
||||
by given type. Prefix 's' and 'u' means those types are signed and unsigned
|
||||
respectively. Traced arguments are shown in decimal (signed) or hex (unsigned).
|
||||
String type is a special type, which fetches a "null-terminated" string from
|
||||
kernel space. This means it will fail and store NULL if the string container
|
||||
has been paged out.
|
||||
Bitfield is another special type, which takes 3 parameters, bit-width, bit-
|
||||
offset, and container-size (usually 32). The syntax is;
|
||||
|
||||
b<bit-width>@<bit-offset>/<container-size>
|
||||
|
||||
|
||||
Per-Probe Event Filtering
|
||||
-------------------------
|
||||
Per-probe event filtering feature allows you to set different filter on each
|
||||
probe and gives you what arguments will be shown in trace buffer. If an event
|
||||
name is specified right after 'p:' or 'r:' in kprobe_events, it adds an event
|
||||
under tracing/events/kprobes/<EVENT>, at the directory you can see 'id',
|
||||
'enabled', 'format' and 'filter'.
|
||||
|
||||
enabled:
|
||||
You can enable/disable the probe by writing 1 or 0 on it.
|
||||
|
||||
format:
|
||||
This shows the format of this probe event.
|
||||
|
||||
filter:
|
||||
You can write filtering rules of this event.
|
||||
|
||||
id:
|
||||
This shows the id of this probe event.
|
||||
|
||||
|
||||
Event Profiling
|
||||
---------------
|
||||
You can check the total number of probe hits and probe miss-hits via
|
||||
/sys/kernel/debug/tracing/kprobe_profile.
|
||||
The first column is event name, the second is the number of probe hits,
|
||||
the third is the number of probe miss-hits.
|
||||
|
||||
|
||||
Usage examples
|
||||
--------------
|
||||
To add a probe as a new event, write a new definition to kprobe_events
|
||||
as below.
|
||||
|
||||
echo 'p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack)' > /sys/kernel/debug/tracing/kprobe_events
|
||||
|
||||
This sets a kprobe on the top of do_sys_open() function with recording
|
||||
1st to 4th arguments as "myprobe" event. Note, which register/stack entry is
|
||||
assigned to each function argument depends on arch-specific ABI. If you unsure
|
||||
the ABI, please try to use probe subcommand of perf-tools (you can find it
|
||||
under tools/perf/).
|
||||
As this example shows, users can choose more familiar names for each arguments.
|
||||
|
||||
echo 'r:myretprobe do_sys_open $retval' >> /sys/kernel/debug/tracing/kprobe_events
|
||||
|
||||
This sets a kretprobe on the return point of do_sys_open() function with
|
||||
recording return value as "myretprobe" event.
|
||||
You can see the format of these events via
|
||||
/sys/kernel/debug/tracing/events/kprobes/<EVENT>/format.
|
||||
|
||||
cat /sys/kernel/debug/tracing/events/kprobes/myprobe/format
|
||||
name: myprobe
|
||||
ID: 780
|
||||
format:
|
||||
field:unsigned short common_type; offset:0; size:2; signed:0;
|
||||
field:unsigned char common_flags; offset:2; size:1; signed:0;
|
||||
field:unsigned char common_preempt_count; offset:3; size:1;signed:0;
|
||||
field:int common_pid; offset:4; size:4; signed:1;
|
||||
|
||||
field:unsigned long __probe_ip; offset:12; size:4; signed:0;
|
||||
field:int __probe_nargs; offset:16; size:4; signed:1;
|
||||
field:unsigned long dfd; offset:20; size:4; signed:0;
|
||||
field:unsigned long filename; offset:24; size:4; signed:0;
|
||||
field:unsigned long flags; offset:28; size:4; signed:0;
|
||||
field:unsigned long mode; offset:32; size:4; signed:0;
|
||||
|
||||
|
||||
print fmt: "(%lx) dfd=%lx filename=%lx flags=%lx mode=%lx", REC->__probe_ip,
|
||||
REC->dfd, REC->filename, REC->flags, REC->mode
|
||||
|
||||
You can see that the event has 4 arguments as in the expressions you specified.
|
||||
|
||||
echo > /sys/kernel/debug/tracing/kprobe_events
|
||||
|
||||
This clears all probe points.
|
||||
|
||||
Or,
|
||||
|
||||
echo -:myprobe >> kprobe_events
|
||||
|
||||
This clears probe points selectively.
|
||||
|
||||
Right after definition, each event is disabled by default. For tracing these
|
||||
events, you need to enable it.
|
||||
|
||||
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
|
||||
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myretprobe/enable
|
||||
|
||||
And you can see the traced information via /sys/kernel/debug/tracing/trace.
|
||||
|
||||
cat /sys/kernel/debug/tracing/trace
|
||||
# tracer: nop
|
||||
#
|
||||
# TASK-PID CPU# TIMESTAMP FUNCTION
|
||||
# | | | | |
|
||||
<...>-1447 [001] 1038282.286875: myprobe: (do_sys_open+0x0/0xd6) dfd=3 filename=7fffd1ec4440 flags=8000 mode=0
|
||||
<...>-1447 [001] 1038282.286878: myretprobe: (sys_openat+0xc/0xe <- do_sys_open) $retval=fffffffffffffffe
|
||||
<...>-1447 [001] 1038282.286885: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=40413c flags=8000 mode=1b6
|
||||
<...>-1447 [001] 1038282.286915: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3
|
||||
<...>-1447 [001] 1038282.286969: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=4041c6 flags=98800 mode=10
|
||||
<...>-1447 [001] 1038282.286976: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3
|
||||
|
||||
|
||||
Each line shows when the kernel hits an event, and <- SYMBOL means kernel
|
||||
returns from SYMBOL(e.g. "sys_open+0x1b/0x1d <- do_sys_open" means kernel
|
||||
returns from do_sys_open to sys_open+0x1b).
|
||||
|
164
Documentation/trace/mmiotrace.txt
Normal file
164
Documentation/trace/mmiotrace.txt
Normal file
|
@ -0,0 +1,164 @@
|
|||
In-kernel memory-mapped I/O tracing
|
||||
|
||||
|
||||
Home page and links to optional user space tools:
|
||||
|
||||
http://nouveau.freedesktop.org/wiki/MmioTrace
|
||||
|
||||
MMIO tracing was originally developed by Intel around 2003 for their Fault
|
||||
Injection Test Harness. In Dec 2006 - Jan 2007, using the code from Intel,
|
||||
Jeff Muizelaar created a tool for tracing MMIO accesses with the Nouveau
|
||||
project in mind. Since then many people have contributed.
|
||||
|
||||
Mmiotrace was built for reverse engineering any memory-mapped IO device with
|
||||
the Nouveau project as the first real user. Only x86 and x86_64 architectures
|
||||
are supported.
|
||||
|
||||
Out-of-tree mmiotrace was originally modified for mainline inclusion and
|
||||
ftrace framework by Pekka Paalanen <pq@iki.fi>.
|
||||
|
||||
|
||||
Preparation
|
||||
-----------
|
||||
|
||||
Mmiotrace feature is compiled in by the CONFIG_MMIOTRACE option. Tracing is
|
||||
disabled by default, so it is safe to have this set to yes. SMP systems are
|
||||
supported, but tracing is unreliable and may miss events if more than one CPU
|
||||
is on-line, therefore mmiotrace takes all but one CPU off-line during run-time
|
||||
activation. You can re-enable CPUs by hand, but you have been warned, there
|
||||
is no way to automatically detect if you are losing events due to CPUs racing.
|
||||
|
||||
|
||||
Usage Quick Reference
|
||||
---------------------
|
||||
|
||||
$ mount -t debugfs debugfs /sys/kernel/debug
|
||||
$ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer
|
||||
$ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt &
|
||||
Start X or whatever.
|
||||
$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker
|
||||
$ echo nop > /sys/kernel/debug/tracing/current_tracer
|
||||
Check for lost events.
|
||||
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
Make sure debugfs is mounted to /sys/kernel/debug.
|
||||
If not (requires root privileges):
|
||||
$ mount -t debugfs debugfs /sys/kernel/debug
|
||||
|
||||
Check that the driver you are about to trace is not loaded.
|
||||
|
||||
Activate mmiotrace (requires root privileges):
|
||||
$ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer
|
||||
|
||||
Start storing the trace:
|
||||
$ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt &
|
||||
The 'cat' process should stay running (sleeping) in the background.
|
||||
|
||||
Load the driver you want to trace and use it. Mmiotrace will only catch MMIO
|
||||
accesses to areas that are ioremapped while mmiotrace is active.
|
||||
|
||||
During tracing you can place comments (markers) into the trace by
|
||||
$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker
|
||||
This makes it easier to see which part of the (huge) trace corresponds to
|
||||
which action. It is recommended to place descriptive markers about what you
|
||||
do.
|
||||
|
||||
Shut down mmiotrace (requires root privileges):
|
||||
$ echo nop > /sys/kernel/debug/tracing/current_tracer
|
||||
The 'cat' process exits. If it does not, kill it by issuing 'fg' command and
|
||||
pressing ctrl+c.
|
||||
|
||||
Check that mmiotrace did not lose events due to a buffer filling up. Either
|
||||
$ grep -i lost mydump.txt
|
||||
which tells you exactly how many events were lost, or use
|
||||
$ dmesg
|
||||
to view your kernel log and look for "mmiotrace has lost events" warning. If
|
||||
events were lost, the trace is incomplete. You should enlarge the buffers and
|
||||
try again. Buffers are enlarged by first seeing how large the current buffers
|
||||
are:
|
||||
$ cat /sys/kernel/debug/tracing/buffer_size_kb
|
||||
gives you a number. Approximately double this number and write it back, for
|
||||
instance:
|
||||
$ echo 128000 > /sys/kernel/debug/tracing/buffer_size_kb
|
||||
Then start again from the top.
|
||||
|
||||
If you are doing a trace for a driver project, e.g. Nouveau, you should also
|
||||
do the following before sending your results:
|
||||
$ lspci -vvv > lspci.txt
|
||||
$ dmesg > dmesg.txt
|
||||
$ tar zcf pciid-nick-mmiotrace.tar.gz mydump.txt lspci.txt dmesg.txt
|
||||
and then send the .tar.gz file. The trace compresses considerably. Replace
|
||||
"pciid" and "nick" with the PCI ID or model name of your piece of hardware
|
||||
under investigation and your nickname.
|
||||
|
||||
|
||||
How Mmiotrace Works
|
||||
-------------------
|
||||
|
||||
Access to hardware IO-memory is gained by mapping addresses from PCI bus by
|
||||
calling one of the ioremap_*() functions. Mmiotrace is hooked into the
|
||||
__ioremap() function and gets called whenever a mapping is created. Mapping is
|
||||
an event that is recorded into the trace log. Note that ISA range mappings
|
||||
are not caught, since the mapping always exists and is returned directly.
|
||||
|
||||
MMIO accesses are recorded via page faults. Just before __ioremap() returns,
|
||||
the mapped pages are marked as not present. Any access to the pages causes a
|
||||
fault. The page fault handler calls mmiotrace to handle the fault. Mmiotrace
|
||||
marks the page present, sets TF flag to achieve single stepping and exits the
|
||||
fault handler. The instruction that faulted is executed and debug trap is
|
||||
entered. Here mmiotrace again marks the page as not present. The instruction
|
||||
is decoded to get the type of operation (read/write), data width and the value
|
||||
read or written. These are stored to the trace log.
|
||||
|
||||
Setting the page present in the page fault handler has a race condition on SMP
|
||||
machines. During the single stepping other CPUs may run freely on that page
|
||||
and events can be missed without a notice. Re-enabling other CPUs during
|
||||
tracing is discouraged.
|
||||
|
||||
|
||||
Trace Log Format
|
||||
----------------
|
||||
|
||||
The raw log is text and easily filtered with e.g. grep and awk. One record is
|
||||
one line in the log. A record starts with a keyword, followed by keyword-
|
||||
dependent arguments. Arguments are separated by a space, or continue until the
|
||||
end of line. The format for version 20070824 is as follows:
|
||||
|
||||
Explanation Keyword Space-separated arguments
|
||||
---------------------------------------------------------------------------
|
||||
|
||||
read event R width, timestamp, map id, physical, value, PC, PID
|
||||
write event W width, timestamp, map id, physical, value, PC, PID
|
||||
ioremap event MAP timestamp, map id, physical, virtual, length, PC, PID
|
||||
iounmap event UNMAP timestamp, map id, PC, PID
|
||||
marker MARK timestamp, text
|
||||
version VERSION the string "20070824"
|
||||
info for reader LSPCI one line from lspci -v
|
||||
PCI address map PCIDEV space-separated /proc/bus/pci/devices data
|
||||
unk. opcode UNKNOWN timestamp, map id, physical, data, PC, PID
|
||||
|
||||
Timestamp is in seconds with decimals. Physical is a PCI bus address, virtual
|
||||
is a kernel virtual address. Width is the data width in bytes and value is the
|
||||
data value. Map id is an arbitrary id number identifying the mapping that was
|
||||
used in an operation. PC is the program counter and PID is process id. PC is
|
||||
zero if it is not recorded. PID is always zero as tracing MMIO accesses
|
||||
originating in user space memory is not yet supported.
|
||||
|
||||
For instance, the following awk filter will pass all 32-bit writes that target
|
||||
physical addresses in the range [0xfb73ce40, 0xfb800000[
|
||||
|
||||
$ awk '/W 4 / { adr=strtonum($5); if (adr >= 0xfb73ce40 &&
|
||||
adr < 0xfb800000) print; }'
|
||||
|
||||
|
||||
Tools for Developers
|
||||
--------------------
|
||||
|
||||
The user space tools include utilities for:
|
||||
- replacing numeric addresses and values with hardware register names
|
||||
- replaying MMIO logs, i.e., re-executing the recorded writes
|
||||
|
||||
|
418
Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
Normal file
418
Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
Normal file
|
@ -0,0 +1,418 @@
|
|||
#!/usr/bin/perl
|
||||
# This is a POC (proof of concept or piece of crap, take your pick) for reading the
|
||||
# text representation of trace output related to page allocation. It makes an attempt
|
||||
# to extract some high-level information on what is going on. The accuracy of the parser
|
||||
# may vary considerably
|
||||
#
|
||||
# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
|
||||
# other options
|
||||
# --prepend-parent Report on the parent proc and PID
|
||||
# --read-procstat If the trace lacks process info, get it from /proc
|
||||
# --ignore-pid Aggregate processes of the same name together
|
||||
#
|
||||
# Copyright (c) IBM Corporation 2009
|
||||
# Author: Mel Gorman <mel@csn.ul.ie>
|
||||
use strict;
|
||||
use Getopt::Long;
|
||||
|
||||
# Tracepoint events
|
||||
use constant MM_PAGE_ALLOC => 1;
|
||||
use constant MM_PAGE_FREE => 2;
|
||||
use constant MM_PAGE_FREE_BATCHED => 3;
|
||||
use constant MM_PAGE_PCPU_DRAIN => 4;
|
||||
use constant MM_PAGE_ALLOC_ZONE_LOCKED => 5;
|
||||
use constant MM_PAGE_ALLOC_EXTFRAG => 6;
|
||||
use constant EVENT_UNKNOWN => 7;
|
||||
|
||||
# Constants used to track state
|
||||
use constant STATE_PCPU_PAGES_DRAINED => 8;
|
||||
use constant STATE_PCPU_PAGES_REFILLED => 9;
|
||||
|
||||
# High-level events extrapolated from tracepoints
|
||||
use constant HIGH_PCPU_DRAINS => 10;
|
||||
use constant HIGH_PCPU_REFILLS => 11;
|
||||
use constant HIGH_EXT_FRAGMENT => 12;
|
||||
use constant HIGH_EXT_FRAGMENT_SEVERE => 13;
|
||||
use constant HIGH_EXT_FRAGMENT_MODERATE => 14;
|
||||
use constant HIGH_EXT_FRAGMENT_CHANGED => 15;
|
||||
|
||||
my %perprocesspid;
|
||||
my %perprocess;
|
||||
my $opt_ignorepid;
|
||||
my $opt_read_procstat;
|
||||
my $opt_prepend_parent;
|
||||
|
||||
# Catch sigint and exit on request
|
||||
my $sigint_report = 0;
|
||||
my $sigint_exit = 0;
|
||||
my $sigint_pending = 0;
|
||||
my $sigint_received = 0;
|
||||
sub sigint_handler {
|
||||
my $current_time = time;
|
||||
if ($current_time - 2 > $sigint_received) {
|
||||
print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
|
||||
$sigint_report = 1;
|
||||
} else {
|
||||
if (!$sigint_exit) {
|
||||
print "Second SIGINT received quickly, exiting\n";
|
||||
}
|
||||
$sigint_exit++;
|
||||
}
|
||||
|
||||
if ($sigint_exit > 3) {
|
||||
print "Many SIGINTs received, exiting now without report\n";
|
||||
exit;
|
||||
}
|
||||
|
||||
$sigint_received = $current_time;
|
||||
$sigint_pending = 1;
|
||||
}
|
||||
$SIG{INT} = "sigint_handler";
|
||||
|
||||
# Parse command line options
|
||||
GetOptions(
|
||||
'ignore-pid' => \$opt_ignorepid,
|
||||
'read-procstat' => \$opt_read_procstat,
|
||||
'prepend-parent' => \$opt_prepend_parent,
|
||||
);
|
||||
|
||||
# Defaults for dynamically discovered regex's
|
||||
my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])';
|
||||
|
||||
# Dyanically discovered regex
|
||||
my $regex_fragdetails;
|
||||
|
||||
# Static regex used. Specified like this for readability and for use with /o
|
||||
# (process_pid) (cpus ) ( time ) (tpoint ) (details)
|
||||
my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
|
||||
my $regex_statname = '[-0-9]*\s\((.*)\).*';
|
||||
my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
|
||||
|
||||
sub generate_traceevent_regex {
|
||||
my $event = shift;
|
||||
my $default = shift;
|
||||
my $regex;
|
||||
|
||||
# Read the event format or use the default
|
||||
if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
|
||||
$regex = $default;
|
||||
} else {
|
||||
my $line;
|
||||
while (!eof(FORMAT)) {
|
||||
$line = <FORMAT>;
|
||||
if ($line =~ /^print fmt:\s"(.*)",.*/) {
|
||||
$regex = $1;
|
||||
$regex =~ s/%p/\([0-9a-f]*\)/g;
|
||||
$regex =~ s/%d/\([-0-9]*\)/g;
|
||||
$regex =~ s/%lu/\([0-9]*\)/g;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Verify fields are in the right order
|
||||
my $tuple;
|
||||
foreach $tuple (split /\s/, $regex) {
|
||||
my ($key, $value) = split(/=/, $tuple);
|
||||
my $expected = shift;
|
||||
if ($key ne $expected) {
|
||||
print("WARNING: Format not as expected '$key' != '$expected'");
|
||||
$regex =~ s/$key=\((.*)\)/$key=$1/;
|
||||
}
|
||||
}
|
||||
|
||||
if (defined shift) {
|
||||
die("Fewer fields than expected in format");
|
||||
}
|
||||
|
||||
return $regex;
|
||||
}
|
||||
$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag",
|
||||
$regex_fragdetails_default,
|
||||
"page", "pfn",
|
||||
"alloc_order", "fallback_order", "pageblock_order",
|
||||
"alloc_migratetype", "fallback_migratetype",
|
||||
"fragmenting", "change_ownership");
|
||||
|
||||
sub read_statline($) {
|
||||
my $pid = $_[0];
|
||||
my $statline;
|
||||
|
||||
if (open(STAT, "/proc/$pid/stat")) {
|
||||
$statline = <STAT>;
|
||||
close(STAT);
|
||||
}
|
||||
|
||||
if ($statline eq '') {
|
||||
$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
|
||||
}
|
||||
|
||||
return $statline;
|
||||
}
|
||||
|
||||
sub guess_process_pid($$) {
|
||||
my $pid = $_[0];
|
||||
my $statline = $_[1];
|
||||
|
||||
if ($pid == 0) {
|
||||
return "swapper-0";
|
||||
}
|
||||
|
||||
if ($statline !~ /$regex_statname/o) {
|
||||
die("Failed to math stat line for process name :: $statline");
|
||||
}
|
||||
return "$1-$pid";
|
||||
}
|
||||
|
||||
sub parent_info($$) {
|
||||
my $pid = $_[0];
|
||||
my $statline = $_[1];
|
||||
my $ppid;
|
||||
|
||||
if ($pid == 0) {
|
||||
return "NOPARENT-0";
|
||||
}
|
||||
|
||||
if ($statline !~ /$regex_statppid/o) {
|
||||
die("Failed to match stat line process ppid:: $statline");
|
||||
}
|
||||
|
||||
# Read the ppid stat line
|
||||
$ppid = $1;
|
||||
return guess_process_pid($ppid, read_statline($ppid));
|
||||
}
|
||||
|
||||
sub process_events {
|
||||
my $traceevent;
|
||||
my $process_pid;
|
||||
my $cpus;
|
||||
my $timestamp;
|
||||
my $tracepoint;
|
||||
my $details;
|
||||
my $statline;
|
||||
|
||||
# Read each line of the event log
|
||||
EVENT_PROCESS:
|
||||
while ($traceevent = <STDIN>) {
|
||||
if ($traceevent =~ /$regex_traceevent/o) {
|
||||
$process_pid = $1;
|
||||
$tracepoint = $4;
|
||||
|
||||
if ($opt_read_procstat || $opt_prepend_parent) {
|
||||
$process_pid =~ /(.*)-([0-9]*)$/;
|
||||
my $process = $1;
|
||||
my $pid = $2;
|
||||
|
||||
$statline = read_statline($pid);
|
||||
|
||||
if ($opt_read_procstat && $process eq '') {
|
||||
$process_pid = guess_process_pid($pid, $statline);
|
||||
}
|
||||
|
||||
if ($opt_prepend_parent) {
|
||||
$process_pid = parent_info($pid, $statline) . " :: $process_pid";
|
||||
}
|
||||
}
|
||||
|
||||
# Unnecessary in this script. Uncomment if required
|
||||
# $cpus = $2;
|
||||
# $timestamp = $3;
|
||||
} else {
|
||||
next;
|
||||
}
|
||||
|
||||
# Perl Switch() sucks majorly
|
||||
if ($tracepoint eq "mm_page_alloc") {
|
||||
$perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
|
||||
} elsif ($tracepoint eq "mm_page_free") {
|
||||
$perprocesspid{$process_pid}->{MM_PAGE_FREE}++
|
||||
} elsif ($tracepoint eq "mm_page_free_batched") {
|
||||
$perprocesspid{$process_pid}->{MM_PAGE_FREE_BATCHED}++;
|
||||
} elsif ($tracepoint eq "mm_page_pcpu_drain") {
|
||||
$perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
|
||||
$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
|
||||
} elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
|
||||
$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
|
||||
$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
|
||||
} elsif ($tracepoint eq "mm_page_alloc_extfrag") {
|
||||
|
||||
# Extract the details of the event now
|
||||
$details = $5;
|
||||
|
||||
my ($page, $pfn);
|
||||
my ($alloc_order, $fallback_order, $pageblock_order);
|
||||
my ($alloc_migratetype, $fallback_migratetype);
|
||||
my ($fragmenting, $change_ownership);
|
||||
|
||||
if ($details !~ /$regex_fragdetails/o) {
|
||||
print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n";
|
||||
next;
|
||||
}
|
||||
|
||||
$perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
|
||||
$page = $1;
|
||||
$pfn = $2;
|
||||
$alloc_order = $3;
|
||||
$fallback_order = $4;
|
||||
$pageblock_order = $5;
|
||||
$alloc_migratetype = $6;
|
||||
$fallback_migratetype = $7;
|
||||
$fragmenting = $8;
|
||||
$change_ownership = $9;
|
||||
|
||||
if ($fragmenting) {
|
||||
$perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
|
||||
if ($fallback_order <= 3) {
|
||||
$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
|
||||
} else {
|
||||
$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
|
||||
}
|
||||
}
|
||||
if ($change_ownership) {
|
||||
$perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
|
||||
}
|
||||
} else {
|
||||
$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
|
||||
}
|
||||
|
||||
# Catch a full pcpu drain event
|
||||
if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
|
||||
$tracepoint ne "mm_page_pcpu_drain") {
|
||||
|
||||
$perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
|
||||
$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
|
||||
}
|
||||
|
||||
# Catch a full pcpu refill event
|
||||
if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
|
||||
$tracepoint ne "mm_page_alloc_zone_locked") {
|
||||
$perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
|
||||
$perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
|
||||
}
|
||||
|
||||
if ($sigint_pending) {
|
||||
last EVENT_PROCESS;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
sub dump_stats {
|
||||
my $hashref = shift;
|
||||
my %stats = %$hashref;
|
||||
|
||||
# Dump per-process stats
|
||||
my $process_pid;
|
||||
my $max_strlen = 0;
|
||||
|
||||
# Get the maximum process name
|
||||
foreach $process_pid (keys %perprocesspid) {
|
||||
my $len = length($process_pid);
|
||||
if ($len > $max_strlen) {
|
||||
$max_strlen = $len;
|
||||
}
|
||||
}
|
||||
$max_strlen += 2;
|
||||
|
||||
printf("\n");
|
||||
printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
|
||||
"Process", "Pages", "Pages", "Pages", "Pages", "PCPU", "PCPU", "PCPU", "Fragment", "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
|
||||
printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
|
||||
"details", "allocd", "allocd", "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing", "Changed", "Severe", "Moderate", "");
|
||||
|
||||
printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
|
||||
"", "", "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
|
||||
|
||||
foreach $process_pid (keys %stats) {
|
||||
# Dump final aggregates
|
||||
if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
|
||||
$stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
|
||||
$stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
|
||||
}
|
||||
if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
|
||||
$stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
|
||||
$stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
|
||||
}
|
||||
|
||||
printf("%-" . $max_strlen . "s %8d %10d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d\n",
|
||||
$process_pid,
|
||||
$stats{$process_pid}->{MM_PAGE_ALLOC},
|
||||
$stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
|
||||
$stats{$process_pid}->{MM_PAGE_FREE},
|
||||
$stats{$process_pid}->{MM_PAGE_FREE_BATCHED},
|
||||
$stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
|
||||
$stats{$process_pid}->{HIGH_PCPU_DRAINS},
|
||||
$stats{$process_pid}->{HIGH_PCPU_REFILLS},
|
||||
$stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
|
||||
$stats{$process_pid}->{HIGH_EXT_FRAG},
|
||||
$stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
|
||||
$stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
|
||||
$stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
|
||||
$stats{$process_pid}->{EVENT_UNKNOWN});
|
||||
}
|
||||
}
|
||||
|
||||
sub aggregate_perprocesspid() {
|
||||
my $process_pid;
|
||||
my $process;
|
||||
undef %perprocess;
|
||||
|
||||
foreach $process_pid (keys %perprocesspid) {
|
||||
$process = $process_pid;
|
||||
$process =~ s/-([0-9])*$//;
|
||||
if ($process eq '') {
|
||||
$process = "NO_PROCESS_NAME";
|
||||
}
|
||||
|
||||
$perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
|
||||
$perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
|
||||
$perprocess{$process}->{MM_PAGE_FREE} += $perprocesspid{$process_pid}->{MM_PAGE_FREE};
|
||||
$perprocess{$process}->{MM_PAGE_FREE_BATCHED} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_BATCHED};
|
||||
$perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
|
||||
$perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
|
||||
$perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
|
||||
$perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
|
||||
$perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
|
||||
$perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
|
||||
$perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
|
||||
$perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
|
||||
$perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
|
||||
}
|
||||
}
|
||||
|
||||
sub report() {
|
||||
if (!$opt_ignorepid) {
|
||||
dump_stats(\%perprocesspid);
|
||||
} else {
|
||||
aggregate_perprocesspid();
|
||||
dump_stats(\%perprocess);
|
||||
}
|
||||
}
|
||||
|
||||
# Process events or signals until neither is available
|
||||
sub signal_loop() {
|
||||
my $sigint_processed;
|
||||
do {
|
||||
$sigint_processed = 0;
|
||||
process_events();
|
||||
|
||||
# Handle pending signals if any
|
||||
if ($sigint_pending) {
|
||||
my $current_time = time;
|
||||
|
||||
if ($sigint_exit) {
|
||||
print "Received exit signal\n";
|
||||
$sigint_pending = 0;
|
||||
}
|
||||
if ($sigint_report) {
|
||||
if ($current_time >= $sigint_received + 2) {
|
||||
report();
|
||||
$sigint_report = 0;
|
||||
$sigint_pending = 0;
|
||||
$sigint_processed = 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
} while ($sigint_pending || $sigint_processed);
|
||||
}
|
||||
|
||||
signal_loop();
|
||||
report();
|
757
Documentation/trace/postprocess/trace-vmscan-postprocess.pl
Normal file
757
Documentation/trace/postprocess/trace-vmscan-postprocess.pl
Normal file
|
@ -0,0 +1,757 @@
|
|||
#!/usr/bin/perl
|
||||
# This is a POC for reading the text representation of trace output related to
|
||||
# page reclaim. It makes an attempt to extract some high-level information on
|
||||
# what is going on. The accuracy of the parser may vary
|
||||
#
|
||||
# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
|
||||
# other options
|
||||
# --read-procstat If the trace lacks process info, get it from /proc
|
||||
# --ignore-pid Aggregate processes of the same name together
|
||||
#
|
||||
# Copyright (c) IBM Corporation 2009
|
||||
# Author: Mel Gorman <mel@csn.ul.ie>
|
||||
use strict;
|
||||
use Getopt::Long;
|
||||
|
||||
# Tracepoint events
|
||||
use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN => 1;
|
||||
use constant MM_VMSCAN_DIRECT_RECLAIM_END => 2;
|
||||
use constant MM_VMSCAN_KSWAPD_WAKE => 3;
|
||||
use constant MM_VMSCAN_KSWAPD_SLEEP => 4;
|
||||
use constant MM_VMSCAN_LRU_SHRINK_ACTIVE => 5;
|
||||
use constant MM_VMSCAN_LRU_SHRINK_INACTIVE => 6;
|
||||
use constant MM_VMSCAN_LRU_ISOLATE => 7;
|
||||
use constant MM_VMSCAN_WRITEPAGE_FILE_SYNC => 8;
|
||||
use constant MM_VMSCAN_WRITEPAGE_ANON_SYNC => 9;
|
||||
use constant MM_VMSCAN_WRITEPAGE_FILE_ASYNC => 10;
|
||||
use constant MM_VMSCAN_WRITEPAGE_ANON_ASYNC => 11;
|
||||
use constant MM_VMSCAN_WRITEPAGE_ASYNC => 12;
|
||||
use constant EVENT_UNKNOWN => 13;
|
||||
|
||||
# Per-order events
|
||||
use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
|
||||
use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER => 12;
|
||||
use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER => 13;
|
||||
use constant HIGH_KSWAPD_REWAKEUP_PERORDER => 14;
|
||||
|
||||
# Constants used to track state
|
||||
use constant STATE_DIRECT_BEGIN => 15;
|
||||
use constant STATE_DIRECT_ORDER => 16;
|
||||
use constant STATE_KSWAPD_BEGIN => 17;
|
||||
use constant STATE_KSWAPD_ORDER => 18;
|
||||
|
||||
# High-level events extrapolated from tracepoints
|
||||
use constant HIGH_DIRECT_RECLAIM_LATENCY => 19;
|
||||
use constant HIGH_KSWAPD_LATENCY => 20;
|
||||
use constant HIGH_KSWAPD_REWAKEUP => 21;
|
||||
use constant HIGH_NR_SCANNED => 22;
|
||||
use constant HIGH_NR_TAKEN => 23;
|
||||
use constant HIGH_NR_RECLAIMED => 24;
|
||||
use constant HIGH_NR_FILE_SCANNED => 25;
|
||||
use constant HIGH_NR_ANON_SCANNED => 26;
|
||||
use constant HIGH_NR_FILE_RECLAIMED => 27;
|
||||
use constant HIGH_NR_ANON_RECLAIMED => 28;
|
||||
|
||||
my %perprocesspid;
|
||||
my %perprocess;
|
||||
my %last_procmap;
|
||||
my $opt_ignorepid;
|
||||
my $opt_read_procstat;
|
||||
|
||||
my $total_wakeup_kswapd;
|
||||
my ($total_direct_reclaim, $total_direct_nr_scanned);
|
||||
my ($total_direct_nr_file_scanned, $total_direct_nr_anon_scanned);
|
||||
my ($total_direct_latency, $total_kswapd_latency);
|
||||
my ($total_direct_nr_reclaimed);
|
||||
my ($total_direct_nr_file_reclaimed, $total_direct_nr_anon_reclaimed);
|
||||
my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async);
|
||||
my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async);
|
||||
my ($total_kswapd_nr_scanned, $total_kswapd_wake);
|
||||
my ($total_kswapd_nr_file_scanned, $total_kswapd_nr_anon_scanned);
|
||||
my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async);
|
||||
my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async);
|
||||
my ($total_kswapd_nr_reclaimed);
|
||||
my ($total_kswapd_nr_file_reclaimed, $total_kswapd_nr_anon_reclaimed);
|
||||
|
||||
# Catch sigint and exit on request
|
||||
my $sigint_report = 0;
|
||||
my $sigint_exit = 0;
|
||||
my $sigint_pending = 0;
|
||||
my $sigint_received = 0;
|
||||
sub sigint_handler {
|
||||
my $current_time = time;
|
||||
if ($current_time - 2 > $sigint_received) {
|
||||
print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
|
||||
$sigint_report = 1;
|
||||
} else {
|
||||
if (!$sigint_exit) {
|
||||
print "Second SIGINT received quickly, exiting\n";
|
||||
}
|
||||
$sigint_exit++;
|
||||
}
|
||||
|
||||
if ($sigint_exit > 3) {
|
||||
print "Many SIGINTs received, exiting now without report\n";
|
||||
exit;
|
||||
}
|
||||
|
||||
$sigint_received = $current_time;
|
||||
$sigint_pending = 1;
|
||||
}
|
||||
$SIG{INT} = "sigint_handler";
|
||||
|
||||
# Parse command line options
|
||||
GetOptions(
|
||||
'ignore-pid' => \$opt_ignorepid,
|
||||
'read-procstat' => \$opt_read_procstat,
|
||||
);
|
||||
|
||||
# Defaults for dynamically discovered regex's
|
||||
my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)';
|
||||
my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
|
||||
my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
|
||||
my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
|
||||
my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
|
||||
my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) file=([0-9]*)';
|
||||
my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) zid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)';
|
||||
my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
|
||||
my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)';
|
||||
|
||||
# Dyanically discovered regex
|
||||
my $regex_direct_begin;
|
||||
my $regex_direct_end;
|
||||
my $regex_kswapd_wake;
|
||||
my $regex_kswapd_sleep;
|
||||
my $regex_wakeup_kswapd;
|
||||
my $regex_lru_isolate;
|
||||
my $regex_lru_shrink_inactive;
|
||||
my $regex_lru_shrink_active;
|
||||
my $regex_writepage;
|
||||
|
||||
# Static regex used. Specified like this for readability and for use with /o
|
||||
# (process_pid) (cpus ) ( time ) (tpoint ) (details)
|
||||
my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])(\s*[dX.][Nnp.][Hhs.][0-9a-fA-F.]*|)\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
|
||||
my $regex_statname = '[-0-9]*\s\((.*)\).*';
|
||||
my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
|
||||
|
||||
sub generate_traceevent_regex {
|
||||
my $event = shift;
|
||||
my $default = shift;
|
||||
my $regex;
|
||||
|
||||
# Read the event format or use the default
|
||||
if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
|
||||
print("WARNING: Event $event format string not found\n");
|
||||
return $default;
|
||||
} else {
|
||||
my $line;
|
||||
while (!eof(FORMAT)) {
|
||||
$line = <FORMAT>;
|
||||
$line =~ s/, REC->.*//;
|
||||
if ($line =~ /^print fmt:\s"(.*)".*/) {
|
||||
$regex = $1;
|
||||
$regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g;
|
||||
$regex =~ s/%p/\([0-9a-f]*\)/g;
|
||||
$regex =~ s/%d/\([-0-9]*\)/g;
|
||||
$regex =~ s/%ld/\([-0-9]*\)/g;
|
||||
$regex =~ s/%lu/\([0-9]*\)/g;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Can't handle the print_flags stuff but in the context of this
|
||||
# script, it really doesn't matter
|
||||
$regex =~ s/\(REC.*\) \? __print_flags.*//;
|
||||
|
||||
# Verify fields are in the right order
|
||||
my $tuple;
|
||||
foreach $tuple (split /\s/, $regex) {
|
||||
my ($key, $value) = split(/=/, $tuple);
|
||||
my $expected = shift;
|
||||
if ($key ne $expected) {
|
||||
print("WARNING: Format not as expected for event $event '$key' != '$expected'\n");
|
||||
$regex =~ s/$key=\((.*)\)/$key=$1/;
|
||||
}
|
||||
}
|
||||
|
||||
if (defined shift) {
|
||||
die("Fewer fields than expected in format");
|
||||
}
|
||||
|
||||
return $regex;
|
||||
}
|
||||
|
||||
$regex_direct_begin = generate_traceevent_regex(
|
||||
"vmscan/mm_vmscan_direct_reclaim_begin",
|
||||
$regex_direct_begin_default,
|
||||
"order", "may_writepage",
|
||||
"gfp_flags");
|
||||
$regex_direct_end = generate_traceevent_regex(
|
||||
"vmscan/mm_vmscan_direct_reclaim_end",
|
||||
$regex_direct_end_default,
|
||||
"nr_reclaimed");
|
||||
$regex_kswapd_wake = generate_traceevent_regex(
|
||||
"vmscan/mm_vmscan_kswapd_wake",
|
||||
$regex_kswapd_wake_default,
|
||||
"nid", "order");
|
||||
$regex_kswapd_sleep = generate_traceevent_regex(
|
||||
"vmscan/mm_vmscan_kswapd_sleep",
|
||||
$regex_kswapd_sleep_default,
|
||||
"nid");
|
||||
$regex_wakeup_kswapd = generate_traceevent_regex(
|
||||
"vmscan/mm_vmscan_wakeup_kswapd",
|
||||
$regex_wakeup_kswapd_default,
|
||||
"nid", "zid", "order");
|
||||
$regex_lru_isolate = generate_traceevent_regex(
|
||||
"vmscan/mm_vmscan_lru_isolate",
|
||||
$regex_lru_isolate_default,
|
||||
"isolate_mode", "order",
|
||||
"nr_requested", "nr_scanned", "nr_taken",
|
||||
"file");
|
||||
$regex_lru_shrink_inactive = generate_traceevent_regex(
|
||||
"vmscan/mm_vmscan_lru_shrink_inactive",
|
||||
$regex_lru_shrink_inactive_default,
|
||||
"nid", "zid",
|
||||
"nr_scanned", "nr_reclaimed", "priority",
|
||||
"flags");
|
||||
$regex_lru_shrink_active = generate_traceevent_regex(
|
||||
"vmscan/mm_vmscan_lru_shrink_active",
|
||||
$regex_lru_shrink_active_default,
|
||||
"nid", "zid",
|
||||
"lru",
|
||||
"nr_scanned", "nr_rotated", "priority");
|
||||
$regex_writepage = generate_traceevent_regex(
|
||||
"vmscan/mm_vmscan_writepage",
|
||||
$regex_writepage_default,
|
||||
"page", "pfn", "flags");
|
||||
|
||||
sub read_statline($) {
|
||||
my $pid = $_[0];
|
||||
my $statline;
|
||||
|
||||
if (open(STAT, "/proc/$pid/stat")) {
|
||||
$statline = <STAT>;
|
||||
close(STAT);
|
||||
}
|
||||
|
||||
if ($statline eq '') {
|
||||
$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
|
||||
}
|
||||
|
||||
return $statline;
|
||||
}
|
||||
|
||||
sub guess_process_pid($$) {
|
||||
my $pid = $_[0];
|
||||
my $statline = $_[1];
|
||||
|
||||
if ($pid == 0) {
|
||||
return "swapper-0";
|
||||
}
|
||||
|
||||
if ($statline !~ /$regex_statname/o) {
|
||||
die("Failed to math stat line for process name :: $statline");
|
||||
}
|
||||
return "$1-$pid";
|
||||
}
|
||||
|
||||
# Convert sec.usec timestamp format
|
||||
sub timestamp_to_ms($) {
|
||||
my $timestamp = $_[0];
|
||||
|
||||
my ($sec, $usec) = split (/\./, $timestamp);
|
||||
return ($sec * 1000) + ($usec / 1000);
|
||||
}
|
||||
|
||||
sub process_events {
|
||||
my $traceevent;
|
||||
my $process_pid;
|
||||
my $cpus;
|
||||
my $timestamp;
|
||||
my $tracepoint;
|
||||
my $details;
|
||||
my $statline;
|
||||
|
||||
# Read each line of the event log
|
||||
EVENT_PROCESS:
|
||||
while ($traceevent = <STDIN>) {
|
||||
if ($traceevent =~ /$regex_traceevent/o) {
|
||||
$process_pid = $1;
|
||||
$timestamp = $4;
|
||||
$tracepoint = $5;
|
||||
|
||||
$process_pid =~ /(.*)-([0-9]*)$/;
|
||||
my $process = $1;
|
||||
my $pid = $2;
|
||||
|
||||
if ($process eq "") {
|
||||
$process = $last_procmap{$pid};
|
||||
$process_pid = "$process-$pid";
|
||||
}
|
||||
$last_procmap{$pid} = $process;
|
||||
|
||||
if ($opt_read_procstat) {
|
||||
$statline = read_statline($pid);
|
||||
if ($opt_read_procstat && $process eq '') {
|
||||
$process_pid = guess_process_pid($pid, $statline);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
next;
|
||||
}
|
||||
|
||||
# Perl Switch() sucks majorly
|
||||
if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") {
|
||||
$timestamp = timestamp_to_ms($timestamp);
|
||||
$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++;
|
||||
$perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp;
|
||||
|
||||
$details = $6;
|
||||
if ($details !~ /$regex_direct_begin/o) {
|
||||
print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n";
|
||||
print " $details\n";
|
||||
print " $regex_direct_begin\n";
|
||||
next;
|
||||
}
|
||||
my $order = $1;
|
||||
$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++;
|
||||
$perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order;
|
||||
} elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") {
|
||||
# Count the event itself
|
||||
my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END};
|
||||
$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++;
|
||||
|
||||
# Record how long direct reclaim took this time
|
||||
if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) {
|
||||
$timestamp = timestamp_to_ms($timestamp);
|
||||
my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER};
|
||||
my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN});
|
||||
$perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency";
|
||||
}
|
||||
} elsif ($tracepoint eq "mm_vmscan_kswapd_wake") {
|
||||
$details = $6;
|
||||
if ($details !~ /$regex_kswapd_wake/o) {
|
||||
print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n";
|
||||
print " $details\n";
|
||||
print " $regex_kswapd_wake\n";
|
||||
next;
|
||||
}
|
||||
|
||||
my $order = $2;
|
||||
$perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order;
|
||||
if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) {
|
||||
$timestamp = timestamp_to_ms($timestamp);
|
||||
$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++;
|
||||
$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp;
|
||||
$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++;
|
||||
} else {
|
||||
$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++;
|
||||
$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++;
|
||||
}
|
||||
} elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") {
|
||||
|
||||
# Count the event itself
|
||||
my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP};
|
||||
$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++;
|
||||
|
||||
# Record how long kswapd was awake
|
||||
$timestamp = timestamp_to_ms($timestamp);
|
||||
my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER};
|
||||
my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN});
|
||||
$perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency";
|
||||
$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0;
|
||||
} elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") {
|
||||
$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++;
|
||||
|
||||
$details = $6;
|
||||
if ($details !~ /$regex_wakeup_kswapd/o) {
|
||||
print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n";
|
||||
print " $details\n";
|
||||
print " $regex_wakeup_kswapd\n";
|
||||
next;
|
||||
}
|
||||
my $order = $3;
|
||||
$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++;
|
||||
} elsif ($tracepoint eq "mm_vmscan_lru_isolate") {
|
||||
$details = $6;
|
||||
if ($details !~ /$regex_lru_isolate/o) {
|
||||
print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n";
|
||||
print " $details\n";
|
||||
print " $regex_lru_isolate/o\n";
|
||||
next;
|
||||
}
|
||||
my $isolate_mode = $1;
|
||||
my $nr_scanned = $4;
|
||||
my $file = $6;
|
||||
|
||||
# To closer match vmstat scanning statistics, only count isolate_both
|
||||
# and isolate_inactive as scanning. isolate_active is rotation
|
||||
# isolate_inactive == 1
|
||||
# isolate_active == 2
|
||||
# isolate_both == 3
|
||||
if ($isolate_mode != 2) {
|
||||
$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
|
||||
if ($file == 1) {
|
||||
$perprocesspid{$process_pid}->{HIGH_NR_FILE_SCANNED} += $nr_scanned;
|
||||
} else {
|
||||
$perprocesspid{$process_pid}->{HIGH_NR_ANON_SCANNED} += $nr_scanned;
|
||||
}
|
||||
}
|
||||
} elsif ($tracepoint eq "mm_vmscan_lru_shrink_inactive") {
|
||||
$details = $6;
|
||||
if ($details !~ /$regex_lru_shrink_inactive/o) {
|
||||
print "WARNING: Failed to parse mm_vmscan_lru_shrink_inactive as expected\n";
|
||||
print " $details\n";
|
||||
print " $regex_lru_shrink_inactive/o\n";
|
||||
next;
|
||||
}
|
||||
|
||||
my $nr_reclaimed = $4;
|
||||
my $flags = $6;
|
||||
my $file = 0;
|
||||
if ($flags =~ /RECLAIM_WB_FILE/) {
|
||||
$file = 1;
|
||||
}
|
||||
$perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED} += $nr_reclaimed;
|
||||
if ($file) {
|
||||
$perprocesspid{$process_pid}->{HIGH_NR_FILE_RECLAIMED} += $nr_reclaimed;
|
||||
} else {
|
||||
$perprocesspid{$process_pid}->{HIGH_NR_ANON_RECLAIMED} += $nr_reclaimed;
|
||||
}
|
||||
} elsif ($tracepoint eq "mm_vmscan_writepage") {
|
||||
$details = $6;
|
||||
if ($details !~ /$regex_writepage/o) {
|
||||
print "WARNING: Failed to parse mm_vmscan_writepage as expected\n";
|
||||
print " $details\n";
|
||||
print " $regex_writepage\n";
|
||||
next;
|
||||
}
|
||||
|
||||
my $flags = $3;
|
||||
my $file = 0;
|
||||
my $sync_io = 0;
|
||||
if ($flags =~ /RECLAIM_WB_FILE/) {
|
||||
$file = 1;
|
||||
}
|
||||
if ($flags =~ /RECLAIM_WB_SYNC/) {
|
||||
$sync_io = 1;
|
||||
}
|
||||
if ($sync_io) {
|
||||
if ($file) {
|
||||
$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}++;
|
||||
} else {
|
||||
$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}++;
|
||||
}
|
||||
} else {
|
||||
if ($file) {
|
||||
$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}++;
|
||||
} else {
|
||||
$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}++;
|
||||
}
|
||||
}
|
||||
} else {
|
||||
$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
|
||||
}
|
||||
|
||||
if ($sigint_pending) {
|
||||
last EVENT_PROCESS;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
sub dump_stats {
|
||||
my $hashref = shift;
|
||||
my %stats = %$hashref;
|
||||
|
||||
# Dump per-process stats
|
||||
my $process_pid;
|
||||
my $max_strlen = 0;
|
||||
|
||||
# Get the maximum process name
|
||||
foreach $process_pid (keys %perprocesspid) {
|
||||
my $len = length($process_pid);
|
||||
if ($len > $max_strlen) {
|
||||
$max_strlen = $len;
|
||||
}
|
||||
}
|
||||
$max_strlen += 2;
|
||||
|
||||
# Work out latencies
|
||||
printf("\n") if !$opt_ignorepid;
|
||||
printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid;
|
||||
foreach $process_pid (keys %stats) {
|
||||
|
||||
if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] &&
|
||||
!$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) {
|
||||
next;
|
||||
}
|
||||
|
||||
printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid;
|
||||
my $index = 0;
|
||||
while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
|
||||
defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
|
||||
|
||||
if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
|
||||
printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
|
||||
my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
|
||||
$total_direct_latency += $latency;
|
||||
} else {
|
||||
printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid;
|
||||
my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]);
|
||||
$total_kswapd_latency += $latency;
|
||||
}
|
||||
$index++;
|
||||
}
|
||||
print "\n" if !$opt_ignorepid;
|
||||
}
|
||||
|
||||
# Print out process activity
|
||||
printf("\n");
|
||||
printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Pages", "Time");
|
||||
printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Rclmed", "Sync-IO", "ASync-IO", "Stalled");
|
||||
foreach $process_pid (keys %stats) {
|
||||
|
||||
if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
|
||||
next;
|
||||
}
|
||||
|
||||
$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
|
||||
$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
|
||||
$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
|
||||
$total_direct_nr_file_scanned += $stats{$process_pid}->{HIGH_NR_FILE_SCANNED};
|
||||
$total_direct_nr_anon_scanned += $stats{$process_pid}->{HIGH_NR_ANON_SCANNED};
|
||||
$total_direct_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
|
||||
$total_direct_nr_file_reclaimed += $stats{$process_pid}->{HIGH_NR_FILE_RECLAIMED};
|
||||
$total_direct_nr_anon_reclaimed += $stats{$process_pid}->{HIGH_NR_ANON_RECLAIMED};
|
||||
$total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
|
||||
$total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
|
||||
$total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
|
||||
|
||||
$total_direct_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
|
||||
|
||||
my $index = 0;
|
||||
my $this_reclaim_delay = 0;
|
||||
while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
|
||||
my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
|
||||
$this_reclaim_delay += $latency;
|
||||
$index++;
|
||||
}
|
||||
|
||||
printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8u %8.3f",
|
||||
$process_pid,
|
||||
$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
|
||||
$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
|
||||
$stats{$process_pid}->{HIGH_NR_SCANNED},
|
||||
$stats{$process_pid}->{HIGH_NR_FILE_SCANNED},
|
||||
$stats{$process_pid}->{HIGH_NR_ANON_SCANNED},
|
||||
$stats{$process_pid}->{HIGH_NR_RECLAIMED},
|
||||
$stats{$process_pid}->{HIGH_NR_FILE_RECLAIMED},
|
||||
$stats{$process_pid}->{HIGH_NR_ANON_RECLAIMED},
|
||||
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
|
||||
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC},
|
||||
$this_reclaim_delay / 1000);
|
||||
|
||||
if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
|
||||
print " ";
|
||||
for (my $order = 0; $order < 20; $order++) {
|
||||
my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
|
||||
if ($count != 0) {
|
||||
print "direct-$order=$count ";
|
||||
}
|
||||
}
|
||||
}
|
||||
if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) {
|
||||
print " ";
|
||||
for (my $order = 0; $order < 20; $order++) {
|
||||
my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
|
||||
if ($count != 0) {
|
||||
print "wakeup-$order=$count ";
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
print "\n";
|
||||
}
|
||||
|
||||
# Print out kswapd activity
|
||||
printf("\n");
|
||||
printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages", "Pages");
|
||||
printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Rclmed", "Sync-IO", "ASync-IO");
|
||||
foreach $process_pid (keys %stats) {
|
||||
|
||||
if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
|
||||
next;
|
||||
}
|
||||
|
||||
$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
|
||||
$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
|
||||
$total_kswapd_nr_file_scanned += $stats{$process_pid}->{HIGH_NR_FILE_SCANNED};
|
||||
$total_kswapd_nr_anon_scanned += $stats{$process_pid}->{HIGH_NR_ANON_SCANNED};
|
||||
$total_kswapd_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
|
||||
$total_kswapd_nr_file_reclaimed += $stats{$process_pid}->{HIGH_NR_FILE_RECLAIMED};
|
||||
$total_kswapd_nr_anon_reclaimed += $stats{$process_pid}->{HIGH_NR_ANON_RECLAIMED};
|
||||
$total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
|
||||
$total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
|
||||
$total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
|
||||
$total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
|
||||
|
||||
printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8i %8u",
|
||||
$process_pid,
|
||||
$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
|
||||
$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
|
||||
$stats{$process_pid}->{HIGH_NR_SCANNED},
|
||||
$stats{$process_pid}->{HIGH_NR_FILE_SCANNED},
|
||||
$stats{$process_pid}->{HIGH_NR_ANON_SCANNED},
|
||||
$stats{$process_pid}->{HIGH_NR_RECLAIMED},
|
||||
$stats{$process_pid}->{HIGH_NR_FILE_RECLAIMED},
|
||||
$stats{$process_pid}->{HIGH_NR_ANON_RECLAIMED},
|
||||
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
|
||||
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC});
|
||||
|
||||
if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
|
||||
print " ";
|
||||
for (my $order = 0; $order < 20; $order++) {
|
||||
my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
|
||||
if ($count != 0) {
|
||||
print "wake-$order=$count ";
|
||||
}
|
||||
}
|
||||
}
|
||||
if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) {
|
||||
print " ";
|
||||
for (my $order = 0; $order < 20; $order++) {
|
||||
my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order];
|
||||
if ($count != 0) {
|
||||
print "rewake-$order=$count ";
|
||||
}
|
||||
}
|
||||
}
|
||||
printf("\n");
|
||||
}
|
||||
|
||||
# Print out summaries
|
||||
$total_direct_latency /= 1000;
|
||||
$total_kswapd_latency /= 1000;
|
||||
print "\nSummary\n";
|
||||
print "Direct reclaims: $total_direct_reclaim\n";
|
||||
print "Direct reclaim pages scanned: $total_direct_nr_scanned\n";
|
||||
print "Direct reclaim file pages scanned: $total_direct_nr_file_scanned\n";
|
||||
print "Direct reclaim anon pages scanned: $total_direct_nr_anon_scanned\n";
|
||||
print "Direct reclaim pages reclaimed: $total_direct_nr_reclaimed\n";
|
||||
print "Direct reclaim file pages reclaimed: $total_direct_nr_file_reclaimed\n";
|
||||
print "Direct reclaim anon pages reclaimed: $total_direct_nr_anon_reclaimed\n";
|
||||
print "Direct reclaim write file sync I/O: $total_direct_writepage_file_sync\n";
|
||||
print "Direct reclaim write anon sync I/O: $total_direct_writepage_anon_sync\n";
|
||||
print "Direct reclaim write file async I/O: $total_direct_writepage_file_async\n";
|
||||
print "Direct reclaim write anon async I/O: $total_direct_writepage_anon_async\n";
|
||||
print "Wake kswapd requests: $total_wakeup_kswapd\n";
|
||||
printf "Time stalled direct reclaim: %-1.2f seconds\n", $total_direct_latency;
|
||||
print "\n";
|
||||
print "Kswapd wakeups: $total_kswapd_wake\n";
|
||||
print "Kswapd pages scanned: $total_kswapd_nr_scanned\n";
|
||||
print "Kswapd file pages scanned: $total_kswapd_nr_file_scanned\n";
|
||||
print "Kswapd anon pages scanned: $total_kswapd_nr_anon_scanned\n";
|
||||
print "Kswapd pages reclaimed: $total_kswapd_nr_reclaimed\n";
|
||||
print "Kswapd file pages reclaimed: $total_kswapd_nr_file_reclaimed\n";
|
||||
print "Kswapd anon pages reclaimed: $total_kswapd_nr_anon_reclaimed\n";
|
||||
print "Kswapd reclaim write file sync I/O: $total_kswapd_writepage_file_sync\n";
|
||||
print "Kswapd reclaim write anon sync I/O: $total_kswapd_writepage_anon_sync\n";
|
||||
print "Kswapd reclaim write file async I/O: $total_kswapd_writepage_file_async\n";
|
||||
print "Kswapd reclaim write anon async I/O: $total_kswapd_writepage_anon_async\n";
|
||||
printf "Time kswapd awake: %-1.2f seconds\n", $total_kswapd_latency;
|
||||
}
|
||||
|
||||
sub aggregate_perprocesspid() {
|
||||
my $process_pid;
|
||||
my $process;
|
||||
undef %perprocess;
|
||||
|
||||
foreach $process_pid (keys %perprocesspid) {
|
||||
$process = $process_pid;
|
||||
$process =~ s/-([0-9])*$//;
|
||||
if ($process eq '') {
|
||||
$process = "NO_PROCESS_NAME";
|
||||
}
|
||||
|
||||
$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
|
||||
$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
|
||||
$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
|
||||
$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
|
||||
$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
|
||||
$perprocess{$process}->{HIGH_NR_FILE_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_FILE_SCANNED};
|
||||
$perprocess{$process}->{HIGH_NR_ANON_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_ANON_SCANNED};
|
||||
$perprocess{$process}->{HIGH_NR_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED};
|
||||
$perprocess{$process}->{HIGH_NR_FILE_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_FILE_RECLAIMED};
|
||||
$perprocess{$process}->{HIGH_NR_ANON_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_ANON_RECLAIMED};
|
||||
$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
|
||||
$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
|
||||
$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
|
||||
$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
|
||||
|
||||
for (my $order = 0; $order < 20; $order++) {
|
||||
$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
|
||||
$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
|
||||
$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
|
||||
|
||||
}
|
||||
|
||||
# Aggregate direct reclaim latencies
|
||||
my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END};
|
||||
my $rd_index = 0;
|
||||
while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) {
|
||||
$perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index];
|
||||
$rd_index++;
|
||||
$wr_index++;
|
||||
}
|
||||
$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
|
||||
|
||||
# Aggregate kswapd latencies
|
||||
my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP};
|
||||
my $rd_index = 0;
|
||||
while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) {
|
||||
$perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index];
|
||||
$rd_index++;
|
||||
$wr_index++;
|
||||
}
|
||||
$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
|
||||
}
|
||||
}
|
||||
|
||||
sub report() {
|
||||
if (!$opt_ignorepid) {
|
||||
dump_stats(\%perprocesspid);
|
||||
} else {
|
||||
aggregate_perprocesspid();
|
||||
dump_stats(\%perprocess);
|
||||
}
|
||||
}
|
||||
|
||||
# Process events or signals until neither is available
|
||||
sub signal_loop() {
|
||||
my $sigint_processed;
|
||||
do {
|
||||
$sigint_processed = 0;
|
||||
process_events();
|
||||
|
||||
# Handle pending signals if any
|
||||
if ($sigint_pending) {
|
||||
my $current_time = time;
|
||||
|
||||
if ($sigint_exit) {
|
||||
print "Received exit signal\n";
|
||||
$sigint_pending = 0;
|
||||
}
|
||||
if ($sigint_report) {
|
||||
if ($current_time >= $sigint_received + 2) {
|
||||
report();
|
||||
$sigint_report = 0;
|
||||
$sigint_pending = 0;
|
||||
$sigint_processed = 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
} while ($sigint_pending || $sigint_processed);
|
||||
}
|
||||
|
||||
signal_loop();
|
||||
report();
|
955
Documentation/trace/ring-buffer-design.txt
Normal file
955
Documentation/trace/ring-buffer-design.txt
Normal file
|
@ -0,0 +1,955 @@
|
|||
Lockless Ring Buffer Design
|
||||
===========================
|
||||
|
||||
Copyright 2009 Red Hat Inc.
|
||||
Author: Steven Rostedt <srostedt@redhat.com>
|
||||
License: The GNU Free Documentation License, Version 1.2
|
||||
(dual licensed under the GPL v2)
|
||||
Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
|
||||
and Frederic Weisbecker.
|
||||
|
||||
|
||||
Written for: 2.6.31
|
||||
|
||||
Terminology used in this Document
|
||||
---------------------------------
|
||||
|
||||
tail - where new writes happen in the ring buffer.
|
||||
|
||||
head - where new reads happen in the ring buffer.
|
||||
|
||||
producer - the task that writes into the ring buffer (same as writer)
|
||||
|
||||
writer - same as producer
|
||||
|
||||
consumer - the task that reads from the buffer (same as reader)
|
||||
|
||||
reader - same as consumer.
|
||||
|
||||
reader_page - A page outside the ring buffer used solely (for the most part)
|
||||
by the reader.
|
||||
|
||||
head_page - a pointer to the page that the reader will use next
|
||||
|
||||
tail_page - a pointer to the page that will be written to next
|
||||
|
||||
commit_page - a pointer to the page with the last finished non-nested write.
|
||||
|
||||
cmpxchg - hardware-assisted atomic transaction that performs the following:
|
||||
|
||||
A = B iff previous A == C
|
||||
|
||||
R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
|
||||
current A is equal to C, and we put the old (current) A into R
|
||||
|
||||
R gets the previous A regardless if A is updated with B or not.
|
||||
|
||||
To see if the update was successful a compare of R == C may be used.
|
||||
|
||||
The Generic Ring Buffer
|
||||
-----------------------
|
||||
|
||||
The ring buffer can be used in either an overwrite mode or in
|
||||
producer/consumer mode.
|
||||
|
||||
Producer/consumer mode is where if the producer were to fill up the
|
||||
buffer before the consumer could free up anything, the producer
|
||||
will stop writing to the buffer. This will lose most recent events.
|
||||
|
||||
Overwrite mode is where if the producer were to fill up the buffer
|
||||
before the consumer could free up anything, the producer will
|
||||
overwrite the older data. This will lose the oldest events.
|
||||
|
||||
No two writers can write at the same time (on the same per-cpu buffer),
|
||||
but a writer may interrupt another writer, but it must finish writing
|
||||
before the previous writer may continue. This is very important to the
|
||||
algorithm. The writers act like a "stack". The way interrupts works
|
||||
enforces this behavior.
|
||||
|
||||
|
||||
writer1 start
|
||||
<preempted> writer2 start
|
||||
<preempted> writer3 start
|
||||
writer3 finishes
|
||||
writer2 finishes
|
||||
writer1 finishes
|
||||
|
||||
This is very much like a writer being preempted by an interrupt and
|
||||
the interrupt doing a write as well.
|
||||
|
||||
Readers can happen at any time. But no two readers may run at the
|
||||
same time, nor can a reader preempt/interrupt another reader. A reader
|
||||
cannot preempt/interrupt a writer, but it may read/consume from the
|
||||
buffer at the same time as a writer is writing, but the reader must be
|
||||
on another processor to do so. A reader may read on its own processor
|
||||
and can be preempted by a writer.
|
||||
|
||||
A writer can preempt a reader, but a reader cannot preempt a writer.
|
||||
But a reader can read the buffer at the same time (on another processor)
|
||||
as a writer.
|
||||
|
||||
The ring buffer is made up of a list of pages held together by a linked list.
|
||||
|
||||
At initialization a reader page is allocated for the reader that is not
|
||||
part of the ring buffer.
|
||||
|
||||
The head_page, tail_page and commit_page are all initialized to point
|
||||
to the same page.
|
||||
|
||||
The reader page is initialized to have its next pointer pointing to
|
||||
the head page, and its previous pointer pointing to a page before
|
||||
the head page.
|
||||
|
||||
The reader has its own page to use. At start up time, this page is
|
||||
allocated but is not attached to the list. When the reader wants
|
||||
to read from the buffer, if its page is empty (like it is on start-up),
|
||||
it will swap its page with the head_page. The old reader page will
|
||||
become part of the ring buffer and the head_page will be removed.
|
||||
The page after the inserted page (old reader_page) will become the
|
||||
new head page.
|
||||
|
||||
Once the new page is given to the reader, the reader could do what
|
||||
it wants with it, as long as a writer has left that page.
|
||||
|
||||
A sample of how the reader page is swapped: Note this does not
|
||||
show the head page in the buffer, it is for demonstrating a swap
|
||||
only.
|
||||
|
||||
+------+
|
||||
|reader| RING BUFFER
|
||||
|page |
|
||||
+------+
|
||||
+---+ +---+ +---+
|
||||
| |-->| |-->| |
|
||||
| |<--| |<--| |
|
||||
+---+ +---+ +---+
|
||||
^ | ^ |
|
||||
| +-------------+ |
|
||||
+-----------------+
|
||||
|
||||
|
||||
+------+
|
||||
|reader| RING BUFFER
|
||||
|page |-------------------+
|
||||
+------+ v
|
||||
| +---+ +---+ +---+
|
||||
| | |-->| |-->| |
|
||||
| | |<--| |<--| |<-+
|
||||
| +---+ +---+ +---+ |
|
||||
| ^ | ^ | |
|
||||
| | +-------------+ | |
|
||||
| +-----------------+ |
|
||||
+------------------------------------+
|
||||
|
||||
+------+
|
||||
|reader| RING BUFFER
|
||||
|page |-------------------+
|
||||
+------+ <---------------+ v
|
||||
| ^ +---+ +---+ +---+
|
||||
| | | |-->| |-->| |
|
||||
| | | | | |<--| |<-+
|
||||
| | +---+ +---+ +---+ |
|
||||
| | | ^ | |
|
||||
| | +-------------+ | |
|
||||
| +-----------------------------+ |
|
||||
+------------------------------------+
|
||||
|
||||
+------+
|
||||
|buffer| RING BUFFER
|
||||
|page |-------------------+
|
||||
+------+ <---------------+ v
|
||||
| ^ +---+ +---+ +---+
|
||||
| | | | | |-->| |
|
||||
| | New | | | |<--| |<-+
|
||||
| | Reader +---+ +---+ +---+ |
|
||||
| | page ----^ | |
|
||||
| | | |
|
||||
| +-----------------------------+ |
|
||||
+------------------------------------+
|
||||
|
||||
|
||||
|
||||
It is possible that the page swapped is the commit page and the tail page,
|
||||
if what is in the ring buffer is less than what is held in a buffer page.
|
||||
|
||||
|
||||
reader page commit page tail page
|
||||
| | |
|
||||
v | |
|
||||
+---+ | |
|
||||
| |<----------+ |
|
||||
| |<------------------------+
|
||||
| |------+
|
||||
+---+ |
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
This case is still valid for this algorithm.
|
||||
When the writer leaves the page, it simply goes into the ring buffer
|
||||
since the reader page still points to the next location in the ring
|
||||
buffer.
|
||||
|
||||
|
||||
The main pointers:
|
||||
|
||||
reader page - The page used solely by the reader and is not part
|
||||
of the ring buffer (may be swapped in)
|
||||
|
||||
head page - the next page in the ring buffer that will be swapped
|
||||
with the reader page.
|
||||
|
||||
tail page - the page where the next write will take place.
|
||||
|
||||
commit page - the page that last finished a write.
|
||||
|
||||
The commit page only is updated by the outermost writer in the
|
||||
writer stack. A writer that preempts another writer will not move the
|
||||
commit page.
|
||||
|
||||
When data is written into the ring buffer, a position is reserved
|
||||
in the ring buffer and passed back to the writer. When the writer
|
||||
is finished writing data into that position, it commits the write.
|
||||
|
||||
Another write (or a read) may take place at anytime during this
|
||||
transaction. If another write happens it must finish before continuing
|
||||
with the previous write.
|
||||
|
||||
|
||||
Write reserve:
|
||||
|
||||
Buffer page
|
||||
+---------+
|
||||
|written |
|
||||
+---------+ <--- given back to writer (current commit)
|
||||
|reserved |
|
||||
+---------+ <--- tail pointer
|
||||
| empty |
|
||||
+---------+
|
||||
|
||||
Write commit:
|
||||
|
||||
Buffer page
|
||||
+---------+
|
||||
|written |
|
||||
+---------+
|
||||
|written |
|
||||
+---------+ <--- next position for write (current commit)
|
||||
| empty |
|
||||
+---------+
|
||||
|
||||
|
||||
If a write happens after the first reserve:
|
||||
|
||||
Buffer page
|
||||
+---------+
|
||||
|written |
|
||||
+---------+ <-- current commit
|
||||
|reserved |
|
||||
+---------+ <--- given back to second writer
|
||||
|reserved |
|
||||
+---------+ <--- tail pointer
|
||||
|
||||
After second writer commits:
|
||||
|
||||
|
||||
Buffer page
|
||||
+---------+
|
||||
|written |
|
||||
+---------+ <--(last full commit)
|
||||
|reserved |
|
||||
+---------+
|
||||
|pending |
|
||||
|commit |
|
||||
+---------+ <--- tail pointer
|
||||
|
||||
When the first writer commits:
|
||||
|
||||
Buffer page
|
||||
+---------+
|
||||
|written |
|
||||
+---------+
|
||||
|written |
|
||||
+---------+
|
||||
|written |
|
||||
+---------+ <--(last full commit and tail pointer)
|
||||
|
||||
|
||||
The commit pointer points to the last write location that was
|
||||
committed without preempting another write. When a write that
|
||||
preempted another write is committed, it only becomes a pending commit
|
||||
and will not be a full commit until all writes have been committed.
|
||||
|
||||
The commit page points to the page that has the last full commit.
|
||||
The tail page points to the page with the last write (before
|
||||
committing).
|
||||
|
||||
The tail page is always equal to or after the commit page. It may
|
||||
be several pages ahead. If the tail page catches up to the commit
|
||||
page then no more writes may take place (regardless of the mode
|
||||
of the ring buffer: overwrite and produce/consumer).
|
||||
|
||||
The order of pages is:
|
||||
|
||||
head page
|
||||
commit page
|
||||
tail page
|
||||
|
||||
Possible scenario:
|
||||
tail page
|
||||
head page commit page |
|
||||
| | |
|
||||
v v v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
There is a special case that the head page is after either the commit page
|
||||
and possibly the tail page. That is when the commit (and tail) page has been
|
||||
swapped with the reader page. This is because the head page is always
|
||||
part of the ring buffer, but the reader page is not. Whenever there
|
||||
has been less than a full page that has been committed inside the ring buffer,
|
||||
and a reader swaps out a page, it will be swapping out the commit page.
|
||||
|
||||
|
||||
reader page commit page tail page
|
||||
| | |
|
||||
v | |
|
||||
+---+ | |
|
||||
| |<----------+ |
|
||||
| |<------------------------+
|
||||
| |------+
|
||||
+---+ |
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
^
|
||||
|
|
||||
head page
|
||||
|
||||
|
||||
In this case, the head page will not move when the tail and commit
|
||||
move back into the ring buffer.
|
||||
|
||||
The reader cannot swap a page into the ring buffer if the commit page
|
||||
is still on that page. If the read meets the last commit (real commit
|
||||
not pending or reserved), then there is nothing more to read.
|
||||
The buffer is considered empty until another full commit finishes.
|
||||
|
||||
When the tail meets the head page, if the buffer is in overwrite mode,
|
||||
the head page will be pushed ahead one. If the buffer is in producer/consumer
|
||||
mode, the write will fail.
|
||||
|
||||
Overwrite mode:
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
^
|
||||
|
|
||||
head page
|
||||
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
^
|
||||
|
|
||||
head page
|
||||
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
^
|
||||
|
|
||||
head page
|
||||
|
||||
Note, the reader page will still point to the previous head page.
|
||||
But when a swap takes place, it will use the most recent head page.
|
||||
|
||||
|
||||
Making the Ring Buffer Lockless:
|
||||
--------------------------------
|
||||
|
||||
The main idea behind the lockless algorithm is to combine the moving
|
||||
of the head_page pointer with the swapping of pages with the reader.
|
||||
State flags are placed inside the pointer to the page. To do this,
|
||||
each page must be aligned in memory by 4 bytes. This will allow the 2
|
||||
least significant bits of the address to be used as flags, since
|
||||
they will always be zero for the address. To get the address,
|
||||
simply mask out the flags.
|
||||
|
||||
MASK = ~3
|
||||
|
||||
address & MASK
|
||||
|
||||
Two flags will be kept by these two bits:
|
||||
|
||||
HEADER - the page being pointed to is a head page
|
||||
|
||||
UPDATE - the page being pointed to is being updated by a writer
|
||||
and was or is about to be a head page.
|
||||
|
||||
|
||||
reader page
|
||||
|
|
||||
v
|
||||
+---+
|
||||
| |------+
|
||||
+---+ |
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-H->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
|
||||
The above pointer "-H->" would have the HEADER flag set. That is
|
||||
the next page is the next page to be swapped out by the reader.
|
||||
This pointer means the next page is the head page.
|
||||
|
||||
When the tail page meets the head pointer, it will use cmpxchg to
|
||||
change the pointer to the UPDATE state:
|
||||
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-H->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
"-U->" represents a pointer in the UPDATE state.
|
||||
|
||||
Any access to the reader will need to take some sort of lock to serialize
|
||||
the readers. But the writers will never take a lock to write to the
|
||||
ring buffer. This means we only need to worry about a single reader,
|
||||
and writes only preempt in "stack" formation.
|
||||
|
||||
When the reader tries to swap the page with the ring buffer, it
|
||||
will also use cmpxchg. If the flag bit in the pointer to the
|
||||
head page does not have the HEADER flag set, the compare will fail
|
||||
and the reader will need to look for the new head page and try again.
|
||||
Note, the flags UPDATE and HEADER are never set at the same time.
|
||||
|
||||
The reader swaps the reader page as follows:
|
||||
|
||||
+------+
|
||||
|reader| RING BUFFER
|
||||
|page |
|
||||
+------+
|
||||
+---+ +---+ +---+
|
||||
| |--->| |--->| |
|
||||
| |<---| |<---| |
|
||||
+---+ +---+ +---+
|
||||
^ | ^ |
|
||||
| +---------------+ |
|
||||
+-----H-------------+
|
||||
|
||||
The reader sets the reader page next pointer as HEADER to the page after
|
||||
the head page.
|
||||
|
||||
|
||||
+------+
|
||||
|reader| RING BUFFER
|
||||
|page |-------H-----------+
|
||||
+------+ v
|
||||
| +---+ +---+ +---+
|
||||
| | |--->| |--->| |
|
||||
| | |<---| |<---| |<-+
|
||||
| +---+ +---+ +---+ |
|
||||
| ^ | ^ | |
|
||||
| | +---------------+ | |
|
||||
| +-----H-------------+ |
|
||||
+--------------------------------------+
|
||||
|
||||
It does a cmpxchg with the pointer to the previous head page to make it
|
||||
point to the reader page. Note that the new pointer does not have the HEADER
|
||||
flag set. This action atomically moves the head page forward.
|
||||
|
||||
+------+
|
||||
|reader| RING BUFFER
|
||||
|page |-------H-----------+
|
||||
+------+ v
|
||||
| ^ +---+ +---+ +---+
|
||||
| | | |-->| |-->| |
|
||||
| | | |<--| |<--| |<-+
|
||||
| | +---+ +---+ +---+ |
|
||||
| | | ^ | |
|
||||
| | +-------------+ | |
|
||||
| +-----------------------------+ |
|
||||
+------------------------------------+
|
||||
|
||||
After the new head page is set, the previous pointer of the head page is
|
||||
updated to the reader page.
|
||||
|
||||
+------+
|
||||
|reader| RING BUFFER
|
||||
|page |-------H-----------+
|
||||
+------+ <---------------+ v
|
||||
| ^ +---+ +---+ +---+
|
||||
| | | |-->| |-->| |
|
||||
| | | | | |<--| |<-+
|
||||
| | +---+ +---+ +---+ |
|
||||
| | | ^ | |
|
||||
| | +-------------+ | |
|
||||
| +-----------------------------+ |
|
||||
+------------------------------------+
|
||||
|
||||
+------+
|
||||
|buffer| RING BUFFER
|
||||
|page |-------H-----------+ <--- New head page
|
||||
+------+ <---------------+ v
|
||||
| ^ +---+ +---+ +---+
|
||||
| | | | | |-->| |
|
||||
| | New | | | |<--| |<-+
|
||||
| | Reader +---+ +---+ +---+ |
|
||||
| | page ----^ | |
|
||||
| | | |
|
||||
| +-----------------------------+ |
|
||||
+------------------------------------+
|
||||
|
||||
Another important point: The page that the reader page points back to
|
||||
by its previous pointer (the one that now points to the new head page)
|
||||
never points back to the reader page. That is because the reader page is
|
||||
not part of the ring buffer. Traversing the ring buffer via the next pointers
|
||||
will always stay in the ring buffer. Traversing the ring buffer via the
|
||||
prev pointers may not.
|
||||
|
||||
Note, the way to determine a reader page is simply by examining the previous
|
||||
pointer of the page. If the next pointer of the previous page does not
|
||||
point back to the original page, then the original page is a reader page:
|
||||
|
||||
|
||||
+--------+
|
||||
| reader | next +----+
|
||||
| page |-------->| |<====== (buffer page)
|
||||
+--------+ +----+
|
||||
| | ^
|
||||
| v | next
|
||||
prev | +----+
|
||||
+------------->| |
|
||||
+----+
|
||||
|
||||
The way the head page moves forward:
|
||||
|
||||
When the tail page meets the head page and the buffer is in overwrite mode
|
||||
and more writes take place, the head page must be moved forward before the
|
||||
writer may move the tail page. The way this is done is that the writer
|
||||
performs a cmpxchg to convert the pointer to the head page from the HEADER
|
||||
flag to have the UPDATE flag set. Once this is done, the reader will
|
||||
not be able to swap the head page from the buffer, nor will it be able to
|
||||
move the head page, until the writer is finished with the move.
|
||||
|
||||
This eliminates any races that the reader can have on the writer. The reader
|
||||
must spin, and this is why the reader cannot preempt the writer.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-H->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
The following page will be made into the new head page.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
After the new head page has been set, we can set the old head page
|
||||
pointer back to NORMAL.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
After the head page has been moved, the tail page may now move forward.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
|
||||
The above are the trivial updates. Now for the more complex scenarios.
|
||||
|
||||
|
||||
As stated before, if enough writes preempt the first write, the
|
||||
tail page may make it all the way around the buffer and meet the commit
|
||||
page. At this time, we must start dropping writes (usually with some kind
|
||||
of warning to the user). But what happens if the commit was still on the
|
||||
reader page? The commit page is not part of the ring buffer. The tail page
|
||||
must account for this.
|
||||
|
||||
|
||||
reader page commit page
|
||||
| |
|
||||
v |
|
||||
+---+ |
|
||||
| |<----------+
|
||||
| |
|
||||
| |------+
|
||||
+---+ |
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-H->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
^
|
||||
|
|
||||
tail page
|
||||
|
||||
If the tail page were to simply push the head page forward, the commit when
|
||||
leaving the reader page would not be pointing to the correct page.
|
||||
|
||||
The solution to this is to test if the commit page is on the reader page
|
||||
before pushing the head page. If it is, then it can be assumed that the
|
||||
tail page wrapped the buffer, and we must drop new writes.
|
||||
|
||||
This is not a race condition, because the commit page can only be moved
|
||||
by the outermost writer (the writer that was preempted).
|
||||
This means that the commit will not move while a writer is moving the
|
||||
tail page. The reader cannot swap the reader page if it is also being
|
||||
used as the commit page. The reader can simply check that the commit
|
||||
is off the reader page. Once the commit page leaves the reader page
|
||||
it will never go back on it unless a reader does another swap with the
|
||||
buffer page that is also the commit page.
|
||||
|
||||
|
||||
Nested writes
|
||||
-------------
|
||||
|
||||
In the pushing forward of the tail page we must first push forward
|
||||
the head page if the head page is the next page. If the head page
|
||||
is not the next page, the tail page is simply updated with a cmpxchg.
|
||||
|
||||
Only writers move the tail page. This must be done atomically to protect
|
||||
against nested writers.
|
||||
|
||||
temp_page = tail_page
|
||||
next_page = temp_page->next
|
||||
cmpxchg(tail_page, temp_page, next_page)
|
||||
|
||||
The above will update the tail page if it is still pointing to the expected
|
||||
page. If this fails, a nested write pushed it forward, the current write
|
||||
does not need to push it.
|
||||
|
||||
|
||||
temp page
|
||||
|
|
||||
v
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
Nested write comes in and moves the tail page forward:
|
||||
|
||||
tail page (moved by nested writer)
|
||||
temp page |
|
||||
| |
|
||||
v v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
The above would fail the cmpxchg, but since the tail page has already
|
||||
been moved forward, the writer will just try again to reserve storage
|
||||
on the new tail page.
|
||||
|
||||
But the moving of the head page is a bit more complex.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-H->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
The write converts the head page pointer to UPDATE.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
But if a nested writer preempts here, it will see that the next
|
||||
page is a head page, but it is also nested. It will detect that
|
||||
it is nested and will save that information. The detection is the
|
||||
fact that it sees the UPDATE flag instead of a HEADER or NORMAL
|
||||
pointer.
|
||||
|
||||
The nested writer will set the new head page pointer.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
But it will not reset the update back to normal. Only the writer
|
||||
that converted a pointer from HEAD to UPDATE will convert it back
|
||||
to NORMAL.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
After the nested writer finishes, the outermost writer will convert
|
||||
the UPDATE pointer to NORMAL.
|
||||
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
|
||||
It can be even more complex if several nested writes came in and moved
|
||||
the tail page ahead several pages:
|
||||
|
||||
|
||||
(first writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-H->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
The write converts the head page pointer to UPDATE.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
Next writer comes in, and sees the update and sets up the new
|
||||
head page.
|
||||
|
||||
(second writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
The nested writer moves the tail page forward. But does not set the old
|
||||
update page to NORMAL because it is not the outermost writer.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
Another writer preempts and sees the page after the tail page is a head page.
|
||||
It changes it from HEAD to UPDATE.
|
||||
|
||||
(third writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-U->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
The writer will move the head page forward:
|
||||
|
||||
|
||||
(third writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-U->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
But now that the third writer did change the HEAD flag to UPDATE it
|
||||
will convert it to normal:
|
||||
|
||||
|
||||
(third writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
|
||||
Then it will move the tail page, and return back to the second writer.
|
||||
|
||||
|
||||
(second writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
|
||||
The second writer will fail to move the tail page because it was already
|
||||
moved, so it will try again and add its data to the new tail page.
|
||||
It will return to the first writer.
|
||||
|
||||
|
||||
(first writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
The first writer cannot know atomically if the tail page moved
|
||||
while it updates the HEAD page. It will then update the head page to
|
||||
what it thinks is the new head page.
|
||||
|
||||
|
||||
(first writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-H->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
Since the cmpxchg returns the old value of the pointer the first writer
|
||||
will see it succeeded in updating the pointer from NORMAL to HEAD.
|
||||
But as we can see, this is not good enough. It must also check to see
|
||||
if the tail page is either where it use to be or on the next page:
|
||||
|
||||
|
||||
(first writer)
|
||||
|
||||
A B tail page
|
||||
| | |
|
||||
v v v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-H->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
If tail page != A and tail page != B, then it must reset the pointer
|
||||
back to NORMAL. The fact that it only needs to worry about nested
|
||||
writers means that it only needs to check this after setting the HEAD page.
|
||||
|
||||
|
||||
(first writer)
|
||||
|
||||
A B tail page
|
||||
| | |
|
||||
v v v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
Now the writer can update the head page. This is also why the head page must
|
||||
remain in UPDATE and only reset by the outermost writer. This prevents
|
||||
the reader from seeing the incorrect head page.
|
||||
|
||||
|
||||
(first writer)
|
||||
|
||||
A B tail page
|
||||
| | |
|
||||
v v v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
327
Documentation/trace/tracepoint-analysis.txt
Normal file
327
Documentation/trace/tracepoint-analysis.txt
Normal file
|
@ -0,0 +1,327 @@
|
|||
Notes on Analysing Behaviour Using Events and Tracepoints
|
||||
|
||||
Documentation written by Mel Gorman
|
||||
PCL information heavily based on email from Ingo Molnar
|
||||
|
||||
1. Introduction
|
||||
===============
|
||||
|
||||
Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
|
||||
creating custom kernel modules to register probe functions using the event
|
||||
tracing infrastructure.
|
||||
|
||||
Simplistically, tracepoints represent important events that can be
|
||||
taken in conjunction with other tracepoints to build a "Big Picture" of
|
||||
what is going on within the system. There are a large number of methods for
|
||||
gathering and interpreting these events. Lacking any current Best Practises,
|
||||
this document describes some of the methods that can be used.
|
||||
|
||||
This document assumes that debugfs is mounted on /sys/kernel/debug and that
|
||||
the appropriate tracing options have been configured into the kernel. It is
|
||||
assumed that the PCL tool tools/perf has been installed and is in your path.
|
||||
|
||||
2. Listing Available Events
|
||||
===========================
|
||||
|
||||
2.1 Standard Utilities
|
||||
----------------------
|
||||
|
||||
All possible events are visible from /sys/kernel/debug/tracing/events. Simply
|
||||
calling
|
||||
|
||||
$ find /sys/kernel/debug/tracing/events -type d
|
||||
|
||||
will give a fair indication of the number of events available.
|
||||
|
||||
2.2 PCL (Performance Counters for Linux)
|
||||
-------
|
||||
|
||||
Discovery and enumeration of all counters and events, including tracepoints,
|
||||
are available with the perf tool. Getting a list of available events is a
|
||||
simple case of:
|
||||
|
||||
$ perf list 2>&1 | grep Tracepoint
|
||||
ext4:ext4_free_inode [Tracepoint event]
|
||||
ext4:ext4_request_inode [Tracepoint event]
|
||||
ext4:ext4_allocate_inode [Tracepoint event]
|
||||
ext4:ext4_write_begin [Tracepoint event]
|
||||
ext4:ext4_ordered_write_end [Tracepoint event]
|
||||
[ .... remaining output snipped .... ]
|
||||
|
||||
|
||||
3. Enabling Events
|
||||
==================
|
||||
|
||||
3.1 System-Wide Event Enabling
|
||||
------------------------------
|
||||
|
||||
See Documentation/trace/events.txt for a proper description on how events
|
||||
can be enabled system-wide. A short example of enabling all events related
|
||||
to page allocation would look something like:
|
||||
|
||||
$ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
|
||||
|
||||
3.2 System-Wide Event Enabling with SystemTap
|
||||
---------------------------------------------
|
||||
|
||||
In SystemTap, tracepoints are accessible using the kernel.trace() function
|
||||
call. The following is an example that reports every 5 seconds what processes
|
||||
were allocating the pages.
|
||||
|
||||
global page_allocs
|
||||
|
||||
probe kernel.trace("mm_page_alloc") {
|
||||
page_allocs[execname()]++
|
||||
}
|
||||
|
||||
function print_count() {
|
||||
printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
|
||||
foreach (proc in page_allocs-)
|
||||
printf("%-25d %s\n", page_allocs[proc], proc)
|
||||
printf ("\n")
|
||||
delete page_allocs
|
||||
}
|
||||
|
||||
probe timer.s(5) {
|
||||
print_count()
|
||||
}
|
||||
|
||||
3.3 System-Wide Event Enabling with PCL
|
||||
---------------------------------------
|
||||
|
||||
By specifying the -a switch and analysing sleep, the system-wide events
|
||||
for a duration of time can be examined.
|
||||
|
||||
$ perf stat -a \
|
||||
-e kmem:mm_page_alloc -e kmem:mm_page_free \
|
||||
-e kmem:mm_page_free_batched \
|
||||
sleep 10
|
||||
Performance counter stats for 'sleep 10':
|
||||
|
||||
9630 kmem:mm_page_alloc
|
||||
2143 kmem:mm_page_free
|
||||
7424 kmem:mm_page_free_batched
|
||||
|
||||
10.002577764 seconds time elapsed
|
||||
|
||||
Similarly, one could execute a shell and exit it as desired to get a report
|
||||
at that point.
|
||||
|
||||
3.4 Local Event Enabling
|
||||
------------------------
|
||||
|
||||
Documentation/trace/ftrace.txt describes how to enable events on a per-thread
|
||||
basis using set_ftrace_pid.
|
||||
|
||||
3.5 Local Event Enablement with PCL
|
||||
-----------------------------------
|
||||
|
||||
Events can be activated and tracked for the duration of a process on a local
|
||||
basis using PCL such as follows.
|
||||
|
||||
$ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
|
||||
-e kmem:mm_page_free_batched ./hackbench 10
|
||||
Time: 0.909
|
||||
|
||||
Performance counter stats for './hackbench 10':
|
||||
|
||||
17803 kmem:mm_page_alloc
|
||||
12398 kmem:mm_page_free
|
||||
4827 kmem:mm_page_free_batched
|
||||
|
||||
0.973913387 seconds time elapsed
|
||||
|
||||
4. Event Filtering
|
||||
==================
|
||||
|
||||
Documentation/trace/ftrace.txt covers in-depth how to filter events in
|
||||
ftrace. Obviously using grep and awk of trace_pipe is an option as well
|
||||
as any script reading trace_pipe.
|
||||
|
||||
5. Analysing Event Variances with PCL
|
||||
=====================================
|
||||
|
||||
Any workload can exhibit variances between runs and it can be important
|
||||
to know what the standard deviation is. By and large, this is left to the
|
||||
performance analyst to do it by hand. In the event that the discrete event
|
||||
occurrences are useful to the performance analyst, then perf can be used.
|
||||
|
||||
$ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free
|
||||
-e kmem:mm_page_free_batched ./hackbench 10
|
||||
Time: 0.890
|
||||
Time: 0.895
|
||||
Time: 0.915
|
||||
Time: 1.001
|
||||
Time: 0.899
|
||||
|
||||
Performance counter stats for './hackbench 10' (5 runs):
|
||||
|
||||
16630 kmem:mm_page_alloc ( +- 3.542% )
|
||||
11486 kmem:mm_page_free ( +- 4.771% )
|
||||
4730 kmem:mm_page_free_batched ( +- 2.325% )
|
||||
|
||||
0.982653002 seconds time elapsed ( +- 1.448% )
|
||||
|
||||
In the event that some higher-level event is required that depends on some
|
||||
aggregation of discrete events, then a script would need to be developed.
|
||||
|
||||
Using --repeat, it is also possible to view how events are fluctuating over
|
||||
time on a system-wide basis using -a and sleep.
|
||||
|
||||
$ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
|
||||
-e kmem:mm_page_free_batched \
|
||||
-a --repeat 10 \
|
||||
sleep 1
|
||||
Performance counter stats for 'sleep 1' (10 runs):
|
||||
|
||||
1066 kmem:mm_page_alloc ( +- 26.148% )
|
||||
182 kmem:mm_page_free ( +- 5.464% )
|
||||
890 kmem:mm_page_free_batched ( +- 30.079% )
|
||||
|
||||
1.002251757 seconds time elapsed ( +- 0.005% )
|
||||
|
||||
6. Higher-Level Analysis with Helper Scripts
|
||||
============================================
|
||||
|
||||
When events are enabled the events that are triggering can be read from
|
||||
/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
|
||||
options exist as well. By post-processing the output, further information can
|
||||
be gathered on-line as appropriate. Examples of post-processing might include
|
||||
|
||||
o Reading information from /proc for the PID that triggered the event
|
||||
o Deriving a higher-level event from a series of lower-level events.
|
||||
o Calculating latencies between two events
|
||||
|
||||
Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
|
||||
script that can read trace_pipe from STDIN or a copy of a trace. When used
|
||||
on-line, it can be interrupted once to generate a report without exiting
|
||||
and twice to exit.
|
||||
|
||||
Simplistically, the script just reads STDIN and counts up events but it
|
||||
also can do more such as
|
||||
|
||||
o Derive high-level events from many low-level events. If a number of pages
|
||||
are freed to the main allocator from the per-CPU lists, it recognises
|
||||
that as one per-CPU drain even though there is no specific tracepoint
|
||||
for that event
|
||||
o It can aggregate based on PID or individual process number
|
||||
o In the event memory is getting externally fragmented, it reports
|
||||
on whether the fragmentation event was severe or moderate.
|
||||
o When receiving an event about a PID, it can record who the parent was so
|
||||
that if large numbers of events are coming from very short-lived
|
||||
processes, the parent process responsible for creating all the helpers
|
||||
can be identified
|
||||
|
||||
7. Lower-Level Analysis with PCL
|
||||
================================
|
||||
|
||||
There may also be a requirement to identify what functions within a program
|
||||
were generating events within the kernel. To begin this sort of analysis, the
|
||||
data must be recorded. At the time of writing, this required root:
|
||||
|
||||
$ perf record -c 1 \
|
||||
-e kmem:mm_page_alloc -e kmem:mm_page_free \
|
||||
-e kmem:mm_page_free_batched \
|
||||
./hackbench 10
|
||||
Time: 0.894
|
||||
[ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
|
||||
|
||||
Note the use of '-c 1' to set the event period to sample. The default sample
|
||||
period is quite high to minimise overhead but the information collected can be
|
||||
very coarse as a result.
|
||||
|
||||
This record outputted a file called perf.data which can be analysed using
|
||||
perf report.
|
||||
|
||||
$ perf report
|
||||
# Samples: 30922
|
||||
#
|
||||
# Overhead Command Shared Object
|
||||
# ........ ......... ................................
|
||||
#
|
||||
87.27% hackbench [vdso]
|
||||
6.85% hackbench /lib/i686/cmov/libc-2.9.so
|
||||
2.62% hackbench /lib/ld-2.9.so
|
||||
1.52% perf [vdso]
|
||||
1.22% hackbench ./hackbench
|
||||
0.48% hackbench [kernel]
|
||||
0.02% perf /lib/i686/cmov/libc-2.9.so
|
||||
0.01% perf /usr/bin/perf
|
||||
0.01% perf /lib/ld-2.9.so
|
||||
0.00% hackbench /lib/i686/cmov/libpthread-2.9.so
|
||||
#
|
||||
# (For more details, try: perf report --sort comm,dso,symbol)
|
||||
#
|
||||
|
||||
According to this, the vast majority of events triggered on events
|
||||
within the VDSO. With simple binaries, this will often be the case so let's
|
||||
take a slightly different example. In the course of writing this, it was
|
||||
noticed that X was generating an insane amount of page allocations so let's look
|
||||
at it:
|
||||
|
||||
$ perf record -c 1 -f \
|
||||
-e kmem:mm_page_alloc -e kmem:mm_page_free \
|
||||
-e kmem:mm_page_free_batched \
|
||||
-p `pidof X`
|
||||
|
||||
This was interrupted after a few seconds and
|
||||
|
||||
$ perf report
|
||||
# Samples: 27666
|
||||
#
|
||||
# Overhead Command Shared Object
|
||||
# ........ ....... .......................................
|
||||
#
|
||||
51.95% Xorg [vdso]
|
||||
47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1
|
||||
0.09% Xorg /lib/i686/cmov/libc-2.9.so
|
||||
0.01% Xorg [kernel]
|
||||
#
|
||||
# (For more details, try: perf report --sort comm,dso,symbol)
|
||||
#
|
||||
|
||||
So, almost half of the events are occurring in a library. To get an idea which
|
||||
symbol:
|
||||
|
||||
$ perf report --sort comm,dso,symbol
|
||||
# Samples: 27666
|
||||
#
|
||||
# Overhead Command Shared Object Symbol
|
||||
# ........ ....... ....................................... ......
|
||||
#
|
||||
51.95% Xorg [vdso] [.] 0x000000ffffe424
|
||||
47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2
|
||||
0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc
|
||||
0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f
|
||||
0.01% Xorg [kernel] [k] read_hpet
|
||||
0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path
|
||||
0.00% Xorg [kernel] [k] ftrace_trace_userstack
|
||||
|
||||
To see where within the function pixmanFillsse2 things are going wrong:
|
||||
|
||||
$ perf annotate pixmanFillsse2
|
||||
[ ... ]
|
||||
0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax)
|
||||
: }
|
||||
:
|
||||
: extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
|
||||
: _mm_store_si128 (__m128i *__P, __m128i __B) : {
|
||||
: *__P = __B;
|
||||
12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax)
|
||||
0.00 : 34ef5: ff
|
||||
12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax)
|
||||
0.00 : 34efd: ff
|
||||
12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax)
|
||||
0.00 : 34f05: ff
|
||||
12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax)
|
||||
0.00 : 34f0d: ff
|
||||
12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax)
|
||||
12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax)
|
||||
12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax)
|
||||
12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax)
|
||||
|
||||
At a glance, it looks like the time is being spent copying pixmaps to
|
||||
the card. Further investigation would be needed to determine why pixmaps
|
||||
are being copied around so much but a starting point would be to take an
|
||||
ancient build of libpixmap out of the library path where it was totally
|
||||
forgotten about from months ago!
|
145
Documentation/trace/tracepoints.txt
Normal file
145
Documentation/trace/tracepoints.txt
Normal file
|
@ -0,0 +1,145 @@
|
|||
Using the Linux Kernel Tracepoints
|
||||
|
||||
Mathieu Desnoyers
|
||||
|
||||
|
||||
This document introduces Linux Kernel Tracepoints and their use. It
|
||||
provides examples of how to insert tracepoints in the kernel and
|
||||
connect probe functions to them and provides some examples of probe
|
||||
functions.
|
||||
|
||||
|
||||
* Purpose of tracepoints
|
||||
|
||||
A tracepoint placed in code provides a hook to call a function (probe)
|
||||
that you can provide at runtime. A tracepoint can be "on" (a probe is
|
||||
connected to it) or "off" (no probe is attached). When a tracepoint is
|
||||
"off" it has no effect, except for adding a tiny time penalty
|
||||
(checking a condition for a branch) and space penalty (adding a few
|
||||
bytes for the function call at the end of the instrumented function
|
||||
and adds a data structure in a separate section). When a tracepoint
|
||||
is "on", the function you provide is called each time the tracepoint
|
||||
is executed, in the execution context of the caller. When the function
|
||||
provided ends its execution, it returns to the caller (continuing from
|
||||
the tracepoint site).
|
||||
|
||||
You can put tracepoints at important locations in the code. They are
|
||||
lightweight hooks that can pass an arbitrary number of parameters,
|
||||
which prototypes are described in a tracepoint declaration placed in a
|
||||
header file.
|
||||
|
||||
They can be used for tracing and performance accounting.
|
||||
|
||||
|
||||
* Usage
|
||||
|
||||
Two elements are required for tracepoints :
|
||||
|
||||
- A tracepoint definition, placed in a header file.
|
||||
- The tracepoint statement, in C code.
|
||||
|
||||
In order to use tracepoints, you should include linux/tracepoint.h.
|
||||
|
||||
In include/trace/events/subsys.h :
|
||||
|
||||
#undef TRACE_SYSTEM
|
||||
#define TRACE_SYSTEM subsys
|
||||
|
||||
#if !defined(_TRACE_SUBSYS_H) || defined(TRACE_HEADER_MULTI_READ)
|
||||
#define _TRACE_SUBSYS_H
|
||||
|
||||
#include <linux/tracepoint.h>
|
||||
|
||||
DECLARE_TRACE(subsys_eventname,
|
||||
TP_PROTO(int firstarg, struct task_struct *p),
|
||||
TP_ARGS(firstarg, p));
|
||||
|
||||
#endif /* _TRACE_SUBSYS_H */
|
||||
|
||||
/* This part must be outside protection */
|
||||
#include <trace/define_trace.h>
|
||||
|
||||
In subsys/file.c (where the tracing statement must be added) :
|
||||
|
||||
#include <trace/events/subsys.h>
|
||||
|
||||
#define CREATE_TRACE_POINTS
|
||||
DEFINE_TRACE(subsys_eventname);
|
||||
|
||||
void somefct(void)
|
||||
{
|
||||
...
|
||||
trace_subsys_eventname(arg, task);
|
||||
...
|
||||
}
|
||||
|
||||
Where :
|
||||
- subsys_eventname is an identifier unique to your event
|
||||
- subsys is the name of your subsystem.
|
||||
- eventname is the name of the event to trace.
|
||||
|
||||
- TP_PROTO(int firstarg, struct task_struct *p) is the prototype of the
|
||||
function called by this tracepoint.
|
||||
|
||||
- TP_ARGS(firstarg, p) are the parameters names, same as found in the
|
||||
prototype.
|
||||
|
||||
- if you use the header in multiple source files, #define CREATE_TRACE_POINTS
|
||||
should appear only in one source file.
|
||||
|
||||
Connecting a function (probe) to a tracepoint is done by providing a
|
||||
probe (function to call) for the specific tracepoint through
|
||||
register_trace_subsys_eventname(). Removing a probe is done through
|
||||
unregister_trace_subsys_eventname(); it will remove the probe.
|
||||
|
||||
tracepoint_synchronize_unregister() must be called before the end of
|
||||
the module exit function to make sure there is no caller left using
|
||||
the probe. This, and the fact that preemption is disabled around the
|
||||
probe call, make sure that probe removal and module unload are safe.
|
||||
|
||||
The tracepoint mechanism supports inserting multiple instances of the
|
||||
same tracepoint, but a single definition must be made of a given
|
||||
tracepoint name over all the kernel to make sure no type conflict will
|
||||
occur. Name mangling of the tracepoints is done using the prototypes
|
||||
to make sure typing is correct. Verification of probe type correctness
|
||||
is done at the registration site by the compiler. Tracepoints can be
|
||||
put in inline functions, inlined static functions, and unrolled loops
|
||||
as well as regular functions.
|
||||
|
||||
The naming scheme "subsys_event" is suggested here as a convention
|
||||
intended to limit collisions. Tracepoint names are global to the
|
||||
kernel: they are considered as being the same whether they are in the
|
||||
core kernel image or in modules.
|
||||
|
||||
If the tracepoint has to be used in kernel modules, an
|
||||
EXPORT_TRACEPOINT_SYMBOL_GPL() or EXPORT_TRACEPOINT_SYMBOL() can be
|
||||
used to export the defined tracepoints.
|
||||
|
||||
If you need to do a bit of work for a tracepoint parameter, and
|
||||
that work is only used for the tracepoint, that work can be encapsulated
|
||||
within an if statement with the following:
|
||||
|
||||
if (trace_foo_bar_enabled()) {
|
||||
int i;
|
||||
int tot = 0;
|
||||
|
||||
for (i = 0; i < count; i++)
|
||||
tot += calculate_nuggets();
|
||||
|
||||
trace_foo_bar(tot);
|
||||
}
|
||||
|
||||
All trace_<tracepoint>() calls have a matching trace_<tracepoint>_enabled()
|
||||
function defined that returns true if the tracepoint is enabled and
|
||||
false otherwise. The trace_<tracepoint>() should always be within the
|
||||
block of the if (trace_<tracepoint>_enabled()) to prevent races between
|
||||
the tracepoint being enabled and the check being seen.
|
||||
|
||||
The advantage of using the trace_<tracepoint>_enabled() is that it uses
|
||||
the static_key of the tracepoint to allow the if statement to be implemented
|
||||
with jump labels and avoid conditional branches.
|
||||
|
||||
Note: The convenience macro TRACE_EVENT provides an alternative way to
|
||||
define tracepoints. Check http://lwn.net/Articles/379903,
|
||||
http://lwn.net/Articles/381064 and http://lwn.net/Articles/383362
|
||||
for a series of articles with more details.
|
159
Documentation/trace/uprobetracer.txt
Normal file
159
Documentation/trace/uprobetracer.txt
Normal file
|
@ -0,0 +1,159 @@
|
|||
Uprobe-tracer: Uprobe-based Event Tracing
|
||||
=========================================
|
||||
|
||||
Documentation written by Srikar Dronamraju
|
||||
|
||||
|
||||
Overview
|
||||
--------
|
||||
Uprobe based trace events are similar to kprobe based trace events.
|
||||
To enable this feature, build your kernel with CONFIG_UPROBE_EVENT=y.
|
||||
|
||||
Similar to the kprobe-event tracer, this doesn't need to be activated via
|
||||
current_tracer. Instead of that, add probe points via
|
||||
/sys/kernel/debug/tracing/uprobe_events, and enable it via
|
||||
/sys/kernel/debug/tracing/events/uprobes/<EVENT>/enabled.
|
||||
|
||||
However unlike kprobe-event tracer, the uprobe event interface expects the
|
||||
user to calculate the offset of the probepoint in the object.
|
||||
|
||||
Synopsis of uprobe_tracer
|
||||
-------------------------
|
||||
p[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a uprobe
|
||||
r[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe)
|
||||
-:[GRP/]EVENT : Clear uprobe or uretprobe event
|
||||
|
||||
GRP : Group name. If omitted, "uprobes" is the default value.
|
||||
EVENT : Event name. If omitted, the event name is generated based
|
||||
on PATH+OFFSET.
|
||||
PATH : Path to an executable or a library.
|
||||
OFFSET : Offset where the probe is inserted.
|
||||
|
||||
FETCHARGS : Arguments. Each probe can have up to 128 args.
|
||||
%REG : Fetch register REG
|
||||
@ADDR : Fetch memory at ADDR (ADDR should be in userspace)
|
||||
@+OFFSET : Fetch memory at OFFSET (OFFSET from same file as PATH)
|
||||
$stackN : Fetch Nth entry of stack (N >= 0)
|
||||
$stack : Fetch stack address.
|
||||
$retval : Fetch return value.(*)
|
||||
+|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**)
|
||||
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
|
||||
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
|
||||
(u8/u16/u32/u64/s8/s16/s32/s64), "string" and bitfield
|
||||
are supported.
|
||||
|
||||
(*) only for return probe.
|
||||
(**) this is useful for fetching a field of data structures.
|
||||
|
||||
Types
|
||||
-----
|
||||
Several types are supported for fetch-args. Uprobe tracer will access memory
|
||||
by given type. Prefix 's' and 'u' means those types are signed and unsigned
|
||||
respectively. Traced arguments are shown in decimal (signed) or hex (unsigned).
|
||||
String type is a special type, which fetches a "null-terminated" string from
|
||||
user space.
|
||||
Bitfield is another special type, which takes 3 parameters, bit-width, bit-
|
||||
offset, and container-size (usually 32). The syntax is;
|
||||
|
||||
b<bit-width>@<bit-offset>/<container-size>
|
||||
|
||||
|
||||
Event Profiling
|
||||
---------------
|
||||
You can check the total number of probe hits and probe miss-hits via
|
||||
/sys/kernel/debug/tracing/uprobe_profile.
|
||||
The first column is event name, the second is the number of probe hits,
|
||||
the third is the number of probe miss-hits.
|
||||
|
||||
Usage examples
|
||||
--------------
|
||||
* Add a probe as a new uprobe event, write a new definition to uprobe_events
|
||||
as below: (sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash)
|
||||
|
||||
echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
|
||||
|
||||
* Add a probe as a new uretprobe event:
|
||||
|
||||
echo 'r: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
|
||||
|
||||
* Unset registered event:
|
||||
|
||||
echo '-:bash_0x4245c0' >> /sys/kernel/debug/tracing/uprobe_events
|
||||
|
||||
* Print out the events that are registered:
|
||||
|
||||
cat /sys/kernel/debug/tracing/uprobe_events
|
||||
|
||||
* Clear all events:
|
||||
|
||||
echo > /sys/kernel/debug/tracing/uprobe_events
|
||||
|
||||
Following example shows how to dump the instruction pointer and %ax register
|
||||
at the probed text address. Probe zfree function in /bin/zsh:
|
||||
|
||||
# cd /sys/kernel/debug/tracing/
|
||||
# cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
|
||||
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
|
||||
# objdump -T /bin/zsh | grep -w zfree
|
||||
0000000000446420 g DF .text 0000000000000012 Base zfree
|
||||
|
||||
0x46420 is the offset of zfree in object /bin/zsh that is loaded at
|
||||
0x00400000. Hence the command to uprobe would be:
|
||||
|
||||
# echo 'p:zfree_entry /bin/zsh:0x46420 %ip %ax' > uprobe_events
|
||||
|
||||
And the same for the uretprobe would be:
|
||||
|
||||
# echo 'r:zfree_exit /bin/zsh:0x46420 %ip %ax' >> uprobe_events
|
||||
|
||||
Please note: User has to explicitly calculate the offset of the probe-point
|
||||
in the object. We can see the events that are registered by looking at the
|
||||
uprobe_events file.
|
||||
|
||||
# cat uprobe_events
|
||||
p:uprobes/zfree_entry /bin/zsh:0x00046420 arg1=%ip arg2=%ax
|
||||
r:uprobes/zfree_exit /bin/zsh:0x00046420 arg1=%ip arg2=%ax
|
||||
|
||||
Format of events can be seen by viewing the file events/uprobes/zfree_entry/format
|
||||
|
||||
# cat events/uprobes/zfree_entry/format
|
||||
name: zfree_entry
|
||||
ID: 922
|
||||
format:
|
||||
field:unsigned short common_type; offset:0; size:2; signed:0;
|
||||
field:unsigned char common_flags; offset:2; size:1; signed:0;
|
||||
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
|
||||
field:int common_pid; offset:4; size:4; signed:1;
|
||||
field:int common_padding; offset:8; size:4; signed:1;
|
||||
|
||||
field:unsigned long __probe_ip; offset:12; size:4; signed:0;
|
||||
field:u32 arg1; offset:16; size:4; signed:0;
|
||||
field:u32 arg2; offset:20; size:4; signed:0;
|
||||
|
||||
print fmt: "(%lx) arg1=%lx arg2=%lx", REC->__probe_ip, REC->arg1, REC->arg2
|
||||
|
||||
Right after definition, each event is disabled by default. For tracing these
|
||||
events, you need to enable it by:
|
||||
|
||||
# echo 1 > events/uprobes/enable
|
||||
|
||||
Lets disable the event after sleeping for some time.
|
||||
|
||||
# sleep 20
|
||||
# echo 0 > events/uprobes/enable
|
||||
|
||||
And you can see the traced information via /sys/kernel/debug/tracing/trace.
|
||||
|
||||
# cat trace
|
||||
# tracer: nop
|
||||
#
|
||||
# TASK-PID CPU# TIMESTAMP FUNCTION
|
||||
# | | | | |
|
||||
zsh-24842 [006] 258544.995456: zfree_entry: (0x446420) arg1=446420 arg2=79
|
||||
zsh-24842 [007] 258545.000270: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0
|
||||
zsh-24842 [002] 258545.043929: zfree_entry: (0x446420) arg1=446420 arg2=79
|
||||
zsh-24842 [004] 258547.046129: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0
|
||||
|
||||
Output shows us uprobe was triggered for a pid 24842 with ip being 0x446420
|
||||
and contents of ax register being 79. And uretprobe was triggered with ip at
|
||||
0x446540 with counterpart function entry at 0x446420.
|
Loading…
Add table
Add a link
Reference in a new issue