1275 lines
34 KiB
HTML
1275 lines
34 KiB
HTML
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<HTML><HEAD><TITLE>Man page of BPF classifier and actions in tc</TITLE>
|
|
</HEAD><BODY>
|
|
<H1>BPF classifier and actions in tc</H1>
|
|
Section: Linux (8)<BR>Updated: 18 May 2015<BR><A HREF="#index">Index</A>
|
|
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>
|
|
|
|
<A NAME="lbAB"> </A>
|
|
<H2>NAME</H2>
|
|
|
|
BPF - BPF programmable classifier and actions for ingress/egress
|
|
queueing disciplines
|
|
<A NAME="lbAC"> </A>
|
|
<H2>SYNOPSIS</H2>
|
|
|
|
<A NAME="lbAD"> </A>
|
|
<H3>eBPF classifier (filter) or action:</H3>
|
|
|
|
<B>tc filter ... bpf</B>
|
|
|
|
[
|
|
<B>object-file</B>
|
|
|
|
OBJ_FILE ] [
|
|
<B>section</B>
|
|
|
|
CLS_NAME ] [
|
|
<B>export</B>
|
|
|
|
UDS_FILE ] [
|
|
<B>verbose</B>
|
|
|
|
] [
|
|
<B>direct-action</B>
|
|
|
|
|
|
|
<B>da</B>
|
|
|
|
] [
|
|
<B>skip_hw</B>
|
|
|
|
|
|
|
<B>skip_sw</B>
|
|
|
|
] [
|
|
<B>police</B>
|
|
|
|
POLICE_SPEC ] [
|
|
<B>action</B>
|
|
|
|
ACTION_SPEC ] [
|
|
<B>classid</B>
|
|
|
|
CLASSID ]
|
|
<BR>
|
|
|
|
<B>tc action ... bpf</B>
|
|
|
|
[
|
|
<B>object-file</B>
|
|
|
|
OBJ_FILE ] [
|
|
<B>section</B>
|
|
|
|
CLS_NAME ] [
|
|
<B>export</B>
|
|
|
|
UDS_FILE ] [
|
|
<B>verbose</B>
|
|
|
|
]
|
|
<P>
|
|
<A NAME="lbAE"> </A>
|
|
<H3>cBPF classifier (filter) or action:</H3>
|
|
|
|
<B>tc filter ... bpf</B>
|
|
|
|
[
|
|
<B>bytecode-file</B>
|
|
|
|
BPF_FILE |
|
|
<B>bytecode</B>
|
|
|
|
BPF_BYTECODE ] [
|
|
<B>police</B>
|
|
|
|
POLICE_SPEC ] [
|
|
<B>action</B>
|
|
|
|
ACTION_SPEC ] [
|
|
<B>classid</B>
|
|
|
|
CLASSID ]
|
|
<BR>
|
|
|
|
<B>tc action ... bpf</B>
|
|
|
|
[
|
|
<B>bytecode-file</B>
|
|
|
|
BPF_FILE |
|
|
<B>bytecode</B>
|
|
|
|
BPF_BYTECODE ]
|
|
<P>
|
|
<A NAME="lbAF"> </A>
|
|
<H2>DESCRIPTION</H2>
|
|
|
|
<P>
|
|
Extended Berkeley Packet Filter (
|
|
<B>eBPF</B>
|
|
|
|
) and classic Berkeley Packet Filter
|
|
(originally known as BPF, for better distinction referred to as
|
|
<B>cBPF</B>
|
|
|
|
here) are both available as a fully programmable and highly efficient
|
|
classifier and actions. They both offer a minimal instruction set for
|
|
implementing small programs which can safely be loaded into the kernel
|
|
and thus executed in a tiny virtual machine from kernel space. An in-kernel
|
|
verifier guarantees that a specified program always terminates and neither
|
|
crashes nor leaks data from the kernel.
|
|
<P>
|
|
In Linux, it's generally considered that eBPF is the successor of cBPF.
|
|
The kernel internally transforms cBPF expressions into eBPF expressions and
|
|
executes the latter. Execution of them can be performed in an interpreter
|
|
or at setup time, they can be just-in-time compiled (JIT'ed) to run as
|
|
native machine code.
|
|
<P>
|
|
|
|
Currently, the eBPF JIT compiler is available for the following architectures:
|
|
<DL COMPACT>
|
|
<DT id="1">*<DD>
|
|
x86_64 (since Linux 3.18)
|
|
|
|
<DT id="2">*<DD>
|
|
arm64 (since Linux 3.18)
|
|
<DT id="3">*<DD>
|
|
s390 (since Linux 4.1)
|
|
<DT id="4">*<DD>
|
|
ppc64 (since Linux 4.8)
|
|
<DT id="5">*<DD>
|
|
sparc64 (since Linux 4.12)
|
|
<DT id="6">*<DD>
|
|
mips64 (since Linux 4.13)
|
|
<DT id="7">*<DD>
|
|
arm32 (since Linux 4.14)
|
|
<DT id="8">*<DD>
|
|
x86_32 (since Linux 4.18)
|
|
|
|
</DL>
|
|
<P>
|
|
|
|
Whereas the following architectures have cBPF, but did not (yet) switch to eBPF
|
|
JIT support:
|
|
<DL COMPACT>
|
|
<DT id="9">*<DD>
|
|
ppc32
|
|
|
|
<DT id="10">*<DD>
|
|
sparc32
|
|
<DT id="11">*<DD>
|
|
mips32
|
|
|
|
</DL>
|
|
<P>
|
|
|
|
eBPF's instruction set has similar underlying principles as the cBPF
|
|
instruction set, it however is modelled closer to the underlying
|
|
architecture to better mimic native instruction sets with the aim to
|
|
achieve a better run-time performance. It is designed to be JIT'ed with
|
|
a one to one mapping, which can also open up the possibility for compilers
|
|
to generate optimized eBPF code through an eBPF backend that performs
|
|
almost as fast as natively compiled code. Given that LLVM provides such
|
|
an eBPF backend, eBPF programs can therefore easily be programmed in a
|
|
subset of the C language. Other than that, eBPF infrastructure also comes
|
|
with a construct called "maps". eBPF maps are key/value stores that are
|
|
shared between multiple eBPF programs, but also between eBPF programs and
|
|
user space applications.
|
|
<P>
|
|
For the traffic control subsystem, classifier and actions that can be
|
|
attached to ingress and egress qdiscs can be written in eBPF or cBPF. The
|
|
advantage over other classifier and actions is that eBPF/cBPF provides the
|
|
generic framework, while users can implement their highly specialized use
|
|
cases efficiently. This means that the classifier or action written that
|
|
way will not suffer from feature bloat, and can therefore execute its task
|
|
highly efficient. It allows for non-linear classification and even merging
|
|
the action part into the classification. Combined with efficient eBPF map
|
|
data structures, user space can push new policies like classids into the
|
|
kernel without reloading a classifier, or it can gather statistics that
|
|
are pushed into one map and use another one for dynamically load balancing
|
|
traffic based on the determined load, just to provide a few examples.
|
|
<P>
|
|
<A NAME="lbAG"> </A>
|
|
<H2>PARAMETERS</H2>
|
|
|
|
<A NAME="lbAH"> </A>
|
|
<H3>object-file</H3>
|
|
|
|
points to an object file that has an executable and linkable format (ELF)
|
|
and contains eBPF opcodes and eBPF map definitions. The LLVM compiler
|
|
infrastructure with
|
|
<B><A HREF="/cgi-bin/man/man2html?1+clang">clang</A>(1)</B>
|
|
|
|
as a C language front end is one project that supports emitting eBPF object
|
|
files that can be passed to the eBPF classifier (more details in the
|
|
<B>EXAMPLES</B>
|
|
|
|
section). This option is mandatory when an eBPF classifier or action is
|
|
to be loaded.
|
|
<P>
|
|
<A NAME="lbAI"> </A>
|
|
<H3>section</H3>
|
|
|
|
is the name of the ELF section from the object file, where the eBPF
|
|
classifier or action resides. By default the section name for the
|
|
classifier is called "classifier", and for the action "action". Given
|
|
that a single object file can contain multiple classifier and actions,
|
|
the corresponding section name needs to be specified, if it differs
|
|
from the defaults.
|
|
<P>
|
|
<A NAME="lbAJ"> </A>
|
|
<H3>export</H3>
|
|
|
|
points to a Unix domain socket file. In case the eBPF object file also
|
|
contains a section named "maps" with eBPF map specifications, then the
|
|
map file descriptors can be handed off via the Unix domain socket to
|
|
an eBPF "agent" herding all descriptors after tc lifetime. This can be
|
|
some third party application implementing the IPC counterpart for the
|
|
import, that uses them for calling into
|
|
<B><A HREF="/cgi-bin/man/man2html?2+bpf">bpf</A>(2)</B>
|
|
|
|
system call to read out or update eBPF map data from user space, for
|
|
example, for monitoring purposes or to push down new policies.
|
|
<P>
|
|
<A NAME="lbAK"> </A>
|
|
<H3>verbose</H3>
|
|
|
|
if set, it will dump the eBPF verifier output, even if loading the eBPF
|
|
program was successful. By default, only on error, the verifier log is
|
|
being emitted to the user.
|
|
<P>
|
|
<A NAME="lbAL"> </A>
|
|
<H3>direct-action | da</H3>
|
|
|
|
instructs eBPF classifier to not invoke external TC actions, instead use the
|
|
TC actions return codes (<B>TC_ACT_OK</B>, <B>TC_ACT_SHOT</B> etc.) for
|
|
classifiers.
|
|
<P>
|
|
<A NAME="lbAM"> </A>
|
|
<H3>skip_hw | skip_sw</H3>
|
|
|
|
hardware offload control flags. By default TC will try to offload
|
|
filters to hardware if possible.
|
|
<B>skip_hw</B>
|
|
|
|
explicitly disables the attempt to offload.
|
|
<B>skip_sw</B>
|
|
|
|
forces the offload and disables running the eBPF program in the kernel.
|
|
If hardware offload is not possible and this flag was set kernel will
|
|
report an error and filter will not be installed at all.
|
|
<P>
|
|
<A NAME="lbAN"> </A>
|
|
<H3>police</H3>
|
|
|
|
is an optional parameter for an eBPF/cBPF classifier that specifies a
|
|
police in
|
|
<B><A HREF="/cgi-bin/man/man2html?1+tc">tc</A>(1)</B>
|
|
|
|
which is attached to the classifier, for example, on an ingress qdisc.
|
|
<P>
|
|
<A NAME="lbAO"> </A>
|
|
<H3>action</H3>
|
|
|
|
is an optional parameter for an eBPF/cBPF classifier that specifies a
|
|
subsequent action in
|
|
<B><A HREF="/cgi-bin/man/man2html?1+tc">tc</A>(1)</B>
|
|
|
|
which is attached to a classifier.
|
|
<P>
|
|
<A NAME="lbAP"> </A>
|
|
<H3>classid</H3>
|
|
|
|
<A NAME="lbAQ"> </A>
|
|
<H3>flowid</H3>
|
|
|
|
provides the default traffic control class identifier for this eBPF/cBPF
|
|
classifier. The default class identifier can also be overwritten by the
|
|
return code of the eBPF/cBPF program. A default return code of
|
|
<B>-1</B>
|
|
|
|
specifies the here provided default class identifier to be used. A return
|
|
code of the eBPF/cBPF program of 0 implies that no match took place, and
|
|
a return code other than these two will override the default classid. This
|
|
allows for efficient, non-linear classification with only a single eBPF/cBPF
|
|
program as opposed to having multiple individual programs for various class
|
|
identifiers which would need to reparse packet contents.
|
|
<P>
|
|
<A NAME="lbAR"> </A>
|
|
<H3>bytecode</H3>
|
|
|
|
is being used for loading cBPF classifier and actions only. The cBPF bytecode
|
|
is directly passed as a text string in the form of
|
|
<B>'s,c t f k,c t f k,c t f k,...'</B>
|
|
|
|
, where
|
|
<B>s</B>
|
|
|
|
denotes the number of subsequent 4-tuples. One such 4-tuple consists of
|
|
<B>c t f k</B>
|
|
|
|
decimals, where
|
|
<B>c</B>
|
|
|
|
represents the cBPF opcode,
|
|
<B>t</B>
|
|
|
|
the jump true offset target,
|
|
<B>f</B>
|
|
|
|
the jump false offset target and
|
|
<B>k</B>
|
|
|
|
the immediate constant/literal. There are various tools that generate code
|
|
in this loadable format, for example,
|
|
<B>bpf_asm</B>
|
|
|
|
that ships with the Linux kernel source tree under
|
|
<B>tools/net/</B>
|
|
|
|
, so it is certainly not expected to hack this by hand. The
|
|
<B>bytecode</B>
|
|
|
|
or
|
|
<B>bytecode-file</B>
|
|
|
|
option is mandatory when a cBPF classifier or action is to be loaded.
|
|
<P>
|
|
<A NAME="lbAS"> </A>
|
|
<H3>bytecode-file</H3>
|
|
|
|
also being used to load a cBPF classifier or action. It's effectively the
|
|
same as
|
|
<B>bytecode</B>
|
|
|
|
only that the cBPF bytecode is not passed directly via command line, but
|
|
rather resides in a text file.
|
|
<P>
|
|
<A NAME="lbAT"> </A>
|
|
<H2>EXAMPLES</H2>
|
|
|
|
<A NAME="lbAU"> </A>
|
|
<H3>eBPF TOOLING</H3>
|
|
|
|
A full blown example including eBPF agent code can be found inside the
|
|
iproute2 source package under:
|
|
<B>examples/bpf/</B>
|
|
|
|
<P>
|
|
As prerequisites, the kernel needs to have the eBPF system call namely
|
|
<B><A HREF="/cgi-bin/man/man2html?2+bpf">bpf</A>(2)</B>
|
|
|
|
enabled and ships with
|
|
<B>cls_bpf</B>
|
|
|
|
and
|
|
<B>act_bpf</B>
|
|
|
|
kernel modules for the traffic control subsystem. To enable eBPF/eBPF JIT
|
|
support, depending which of the two the given architecture supports:
|
|
<P>
|
|
|
|
<B>echo 1 > /proc/sys/net/core/bpf_jit_enable</B>
|
|
|
|
|
|
<P>
|
|
A given restricted C file can be compiled via LLVM as:
|
|
<P>
|
|
|
|
<B>clang -O2 -emit-llvm -c bpf.c -o - | llc -march=bpf -filetype=obj -o bpf.o</B>
|
|
|
|
|
|
<P>
|
|
The compiler invocation might still simplify in future, so for now,
|
|
it's quite handy to alias this construct in one way or another, for
|
|
example:
|
|
|
|
<PRE>
|
|
|
|
__bcc() {
|
|
clang -O2 -emit-llvm -c $1 -o - | \
|
|
llc -march=bpf -filetype=obj -o "`basename $1 .c`.o"
|
|
}
|
|
|
|
alias bcc=__bcc
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
A minimal, stand-alone unit, which matches on all traffic with the
|
|
default classid (return code of -1) looks like:
|
|
<P>
|
|
|
|
<PRE>
|
|
|
|
#include <<A HREF="file:///usr/include/linux/bpf.h">linux/bpf.h</A>>
|
|
|
|
#ifndef __section
|
|
# define __section(x) __attribute__((section(x), used))
|
|
#endif
|
|
|
|
__section("classifier") int cls_main(struct __sk_buff *skb)
|
|
{
|
|
return -1;
|
|
}
|
|
|
|
char __license[] __section("license") = "GPL";
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
More examples can be found further below in subsection
|
|
<B>eBPF PROGRAMMING</B>
|
|
|
|
as focus here will be on tooling.
|
|
<P>
|
|
There can be various other sections, for example, also for actions.
|
|
Thus, an object file in eBPF can contain multiple entrance points.
|
|
Always a specific entrance point, however, must be specified when
|
|
configuring with tc. A license must be part of the restricted C code
|
|
and the license string syntax is the same as with Linux kernel modules.
|
|
The kernel reserves its right that some eBPF helper functions can be
|
|
restricted to GPL compatible licenses only, and thus may reject a program
|
|
from loading into the kernel when such a license mismatch occurs.
|
|
<P>
|
|
The resulting object file from the compilation can be inspected with
|
|
the usual set of tools that also operate on normal object files, for
|
|
example
|
|
<B><A HREF="/cgi-bin/man/man2html?1+objdump">objdump</A>(1)</B>
|
|
|
|
for inspecting ELF section headers:
|
|
<P>
|
|
|
|
<PRE>
|
|
|
|
objdump -h bpf.o
|
|
[...]
|
|
3 classifier 000007f8 0000000000000000 0000000000000000 00000040 2**3
|
|
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
|
|
4 action-mark 00000088 0000000000000000 0000000000000000 00000838 2**3
|
|
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
|
|
5 action-rand 00000098 0000000000000000 0000000000000000 000008c0 2**3
|
|
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
|
|
6 maps 00000030 0000000000000000 0000000000000000 00000958 2**2
|
|
CONTENTS, ALLOC, LOAD, DATA
|
|
7 license 00000004 0000000000000000 0000000000000000 00000988 2**0
|
|
CONTENTS, ALLOC, LOAD, DATA
|
|
[...]
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
Adding an eBPF classifier from an object file that contains a classifier
|
|
in the default ELF section is trivial (note that instead of "object-file"
|
|
also shortcuts such as "obj" can be used):
|
|
<P>
|
|
|
|
<B>bcc bpf.c</B>
|
|
|
|
<BR>
|
|
|
|
<B>tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1</B>
|
|
|
|
|
|
<P>
|
|
In case the classifier resides in ELF section "mycls", then that same
|
|
command needs to be invoked as:
|
|
<P>
|
|
|
|
<B>tc filter add dev em1 parent 1: bpf obj bpf.o sec mycls flowid 1:1</B>
|
|
|
|
|
|
<P>
|
|
Dumping the classifier configuration will tell the location of the
|
|
classifier, in other words that it's from object file "bpf.o" under
|
|
section "mycls":
|
|
<P>
|
|
|
|
<B>tc filter show dev em1</B>
|
|
|
|
<BR>
|
|
|
|
<B>filter parent 1: protocol all pref 49152 bpf</B>
|
|
|
|
<BR>
|
|
|
|
<B>filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid 1:1 bpf.o:[mycls]</B>
|
|
|
|
|
|
<P>
|
|
The same program can also be installed on ingress qdisc side as opposed
|
|
to egress ...
|
|
<P>
|
|
|
|
<B>tc qdisc add dev em1 handle ffff: ingress</B>
|
|
|
|
<BR>
|
|
|
|
<B>tc filter add dev em1 parent ffff: bpf obj bpf.o sec mycls flowid ffff:1</B>
|
|
|
|
|
|
<P>
|
|
... and again dumped from there:
|
|
<P>
|
|
|
|
<B>tc filter show dev em1 parent ffff:</B>
|
|
|
|
<BR>
|
|
|
|
<B>filter protocol all pref 49152 bpf</B>
|
|
|
|
<BR>
|
|
|
|
<B>filter protocol all pref 49152 bpf handle 0x1 flowid ffff:1 bpf.o:[mycls]</B>
|
|
|
|
|
|
<P>
|
|
Attaching a classifier and action on ingress has the restriction that
|
|
it doesn't have an actual underlying queueing discipline. What ingress
|
|
can do is to classify, mangle, redirect or drop packets. When queueing
|
|
is required on ingress side, then ingress must redirect packets to the
|
|
<B>ifb</B>
|
|
|
|
device, otherwise policing can be used. Moreover, ingress can be used to
|
|
have an early drop point of unwanted packets before they hit upper layers
|
|
of the networking stack, perform network accounting with eBPF maps that
|
|
could be shared with egress, or have an early mangle and/or redirection
|
|
point to different networking devices.
|
|
<P>
|
|
Multiple eBPF actions and classifier can be placed into a single
|
|
object file within various sections. In that case, non-default section
|
|
names must be provided, which is the case for both actions in this
|
|
example:
|
|
<P>
|
|
|
|
<B>tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 \</B>
|
|
|
|
<BR>
|
|
|
|
|
|
<B>action bpf obj bpf.o sec action-mark \</B>
|
|
|
|
<BR>
|
|
|
|
<B>action bpf obj bpf.o sec action-rand ok</B>
|
|
|
|
|
|
|
|
<P>
|
|
The advantage of this is that the classifier and the two actions can
|
|
then share eBPF maps with each other, if implemented in the programs.
|
|
<P>
|
|
In order to access eBPF maps from user space beyond
|
|
<B><A HREF="/cgi-bin/man/man2html?8+tc">tc</A>(8)</B>
|
|
|
|
setup lifetime, the ownership can be transferred to an eBPF agent via
|
|
Unix domain sockets. There are two possibilities for implementing this:
|
|
<P>
|
|
<B>1)</B>
|
|
|
|
implementation of an own eBPF agent that takes care of setting up
|
|
the Unix domain socket and implementing the protocol that
|
|
<B><A HREF="/cgi-bin/man/man2html?8+tc">tc</A>(8)</B>
|
|
|
|
dictates. A code example of this can be found inside the iproute2
|
|
source package under:
|
|
<B>examples/bpf/</B>
|
|
|
|
<P>
|
|
<B>2)</B>
|
|
|
|
use
|
|
<B>tc exec</B>
|
|
|
|
for transferring the eBPF map file descriptors through a Unix domain
|
|
socket, and spawning an application such as
|
|
<B><A HREF="/cgi-bin/man/man2html?1+sh">sh</A>(1)</B>
|
|
|
|
. This approach's advantage is that tc will place the file descriptors
|
|
into the environment and thus make them available just like stdin, stdout,
|
|
stderr file descriptors, meaning, in case user applications run from within
|
|
this fd-owner shell, they can terminate and restart without losing eBPF
|
|
maps file descriptors. Example invocation with the previous classifier and
|
|
action mixture:
|
|
<P>
|
|
|
|
<B>tc exec bpf imp /tmp/bpf</B>
|
|
|
|
<BR>
|
|
|
|
<B>tc filter add dev em1 parent 1: bpf obj bpf.o exp /tmp/bpf flowid 1:1 \</B>
|
|
|
|
<BR>
|
|
|
|
|
|
<B>action bpf obj bpf.o sec action-mark \</B>
|
|
|
|
<BR>
|
|
|
|
<B>action bpf obj bpf.o sec action-rand ok</B>
|
|
|
|
|
|
|
|
<P>
|
|
Assuming that eBPF maps are shared with classifier and actions, it's
|
|
enough to export them once, for example, from within the classifier
|
|
or action command. tc will setup all eBPF map file descriptors at the
|
|
time when the object file is first parsed.
|
|
<P>
|
|
When a shell has been spawned, the environment will have a couple of
|
|
eBPF related variables. BPF_NUM_MAPS provides the total number of maps
|
|
that have been transferred over the Unix domain socket. BPF_MAP<X>'s
|
|
value is the file descriptor number that can be accessed in eBPF agent
|
|
applications, in other words, it can directly be used as the file
|
|
descriptor value for the
|
|
<B><A HREF="/cgi-bin/man/man2html?2+bpf">bpf</A>(2)</B>
|
|
|
|
system call to retrieve or alter eBPF map values. <X> denotes the
|
|
identifier of the eBPF map. It corresponds to the
|
|
<B>id</B>
|
|
|
|
member of
|
|
<B>struct bpf_elf_map</B>
|
|
|
|
from the tc eBPF map specification.
|
|
<P>
|
|
The environment in this example looks as follows:
|
|
<P>
|
|
|
|
<PRE>
|
|
|
|
sh# env | grep BPF
|
|
BPF_NUM_MAPS=3
|
|
BPF_MAP1=6
|
|
BPF_MAP0=5
|
|
BPF_MAP2=7
|
|
sh# ls -la /proc/self/fd
|
|
[...]
|
|
lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
|
|
lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
|
|
lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map
|
|
sh# my_bpf_agent
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
eBPF agents are very useful in that they can prepopulate eBPF maps from
|
|
user space, monitor statistics via maps and based on that feedback, for
|
|
example, rewrite classids in eBPF map values during runtime. Given that eBPF
|
|
agents are implemented as normal applications, they can also dynamically
|
|
receive traffic control policies from external controllers and thus push
|
|
them down into eBPF maps to dynamically adapt to network conditions. Moreover,
|
|
eBPF maps can also be shared with other eBPF program types (e.g. tracing),
|
|
thus very powerful combination can therefore be implemented.
|
|
<P>
|
|
<A NAME="lbAV"> </A>
|
|
<H3>eBPF PROGRAMMING</H3>
|
|
|
|
<P>
|
|
eBPF classifier and actions are being implemented in restricted C syntax
|
|
(in future, there could additionally be new language frontends supported).
|
|
<P>
|
|
The header file
|
|
<B>linux/bpf.h</B>
|
|
|
|
provides eBPF helper functions that can be called from an eBPF program.
|
|
This man page will only provide two minimal, stand-alone examples, have a
|
|
look at
|
|
<B>examples/bpf</B>
|
|
|
|
from the iproute2 source package for a fully fledged flow dissector
|
|
example to better demonstrate some of the possibilities with eBPF.
|
|
<P>
|
|
Supported 32 bit classifier return codes from the C program and their meanings:
|
|
|
|
<B>0</B>
|
|
|
|
, denotes a mismatch
|
|
<BR>
|
|
|
|
<B>-1</B>
|
|
|
|
, denotes the default classid configured from the command line
|
|
<BR>
|
|
|
|
<B>else</B>
|
|
|
|
, everything else will override the default classid to provide a facility for
|
|
non-linear matching
|
|
|
|
<P>
|
|
Supported 32 bit action return codes from the C program and their meanings (
|
|
<B>linux/pkt_cls.h</B>
|
|
|
|
):
|
|
|
|
<B>TC_ACT_OK (0)</B>
|
|
|
|
, will terminate the packet processing pipeline and allows the packet to
|
|
proceed
|
|
<BR>
|
|
|
|
<B>TC_ACT_SHOT (2)</B>
|
|
|
|
, will terminate the packet processing pipeline and drops the packet
|
|
<BR>
|
|
|
|
<B>TC_ACT_UNSPEC (-1)</B>
|
|
|
|
, will use the default action configured from tc (similarly as returning
|
|
<B>-1</B>
|
|
|
|
from a classifier)
|
|
<BR>
|
|
|
|
<B>TC_ACT_PIPE (3)</B>
|
|
|
|
, will iterate to the next action, if available
|
|
<BR>
|
|
|
|
<B>TC_ACT_RECLASSIFY (1)</B>
|
|
|
|
, will terminate the packet processing pipeline and start classification
|
|
from the beginning
|
|
<BR>
|
|
|
|
<B>else</B>
|
|
|
|
, everything else is an unspecified return code
|
|
|
|
<P>
|
|
Both classifier and action return codes are supported in eBPF and cBPF
|
|
programs.
|
|
<P>
|
|
To demonstrate restricted C syntax, a minimal toy classifier example is
|
|
provided, which assumes that egress packets, for instance originating
|
|
from a container, have previously been marked in interval [0, 255]. The
|
|
program keeps statistics on different marks for user space and maps the
|
|
classid to the root qdisc with the marking itself as the minor handle:
|
|
<P>
|
|
|
|
<PRE>
|
|
|
|
#include <<A HREF="file:///usr/include/stdint.h">stdint.h</A>>
|
|
#include <<A HREF="file:///usr/include/asm/types.h">asm/types.h</A>>
|
|
|
|
#include <<A HREF="file:///usr/include/linux/bpf.h">linux/bpf.h</A>>
|
|
#include <<A HREF="file:///usr/include/linux/pkt_sched.h">linux/pkt_sched.h</A>>
|
|
|
|
#include "helpers.h"
|
|
|
|
struct tuple {
|
|
long packets;
|
|
long bytes;
|
|
};
|
|
|
|
#define BPF_MAP_ID_STATS 1 /* agent's map identifier */
|
|
#define BPF_MAX_MARK 256
|
|
|
|
struct bpf_elf_map __section("maps") map_stats = {
|
|
.type = BPF_MAP_TYPE_ARRAY,
|
|
.id = BPF_MAP_ID_STATS,
|
|
.size_key = sizeof(uint32_t),
|
|
.size_value = sizeof(struct tuple),
|
|
.max_elem = BPF_MAX_MARK,
|
|
.pinning = PIN_GLOBAL_NS,
|
|
};
|
|
|
|
static inline void cls_update_stats(const struct __sk_buff *skb,
|
|
uint32_t mark)
|
|
{
|
|
struct tuple *tu;
|
|
|
|
tu = bpf_map_lookup_elem(&map_stats, &mark);
|
|
if (likely(tu)) {
|
|
__sync_fetch_and_add(&tu->packets, 1);
|
|
__sync_fetch_and_add(&tu->bytes, skb->len);
|
|
}
|
|
}
|
|
|
|
__section("cls") int cls_main(struct __sk_buff *skb)
|
|
{
|
|
uint32_t mark = skb->mark;
|
|
|
|
if (unlikely(mark >= BPF_MAX_MARK))
|
|
return 0;
|
|
|
|
cls_update_stats(skb, mark);
|
|
|
|
return TC_H_MAKE(TC_H_ROOT, mark);
|
|
}
|
|
|
|
char __license[] __section("license") = "GPL";
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
Another small example is a port redirector which demuxes destination port
|
|
80 into the interval [8080, 8087] steered by RSS, that can then be attached
|
|
to ingress qdisc. The exercise of adding the egress counterpart and IPv6
|
|
support is left to the reader:
|
|
<P>
|
|
|
|
<PRE>
|
|
|
|
#include <<A HREF="file:///usr/include/asm/types.h">asm/types.h</A>>
|
|
#include <<A HREF="file:///usr/include/asm/byteorder.h">asm/byteorder.h</A>>
|
|
|
|
#include <<A HREF="file:///usr/include/linux/bpf.h">linux/bpf.h</A>>
|
|
#include <<A HREF="file:///usr/include/linux/filter.h">linux/filter.h</A>>
|
|
#include <<A HREF="file:///usr/include/linux/in.h">linux/in.h</A>>
|
|
#include <<A HREF="file:///usr/include/linux/if_ether.h">linux/if_ether.h</A>>
|
|
#include <<A HREF="file:///usr/include/linux/ip.h">linux/ip.h</A>>
|
|
#include <<A HREF="file:///usr/include/linux/tcp.h">linux/tcp.h</A>>
|
|
|
|
#include "helpers.h"
|
|
|
|
static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off,
|
|
__u16 old_port, __u16 new_port)
|
|
{
|
|
bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check),
|
|
old_port, new_port, sizeof(new_port));
|
|
bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest),
|
|
&new_port, sizeof(new_port), 0);
|
|
}
|
|
|
|
static inline int lb_do_ipv4(struct __sk_buff *skb, int nh_off)
|
|
{
|
|
__u16 dport, dport_new = 8080, off;
|
|
__u8 ip_proto, ip_vl;
|
|
|
|
ip_proto = load_byte(skb, nh_off +
|
|
offsetof(struct iphdr, protocol));
|
|
if (ip_proto != IPPROTO_TCP)
|
|
return 0;
|
|
|
|
ip_vl = load_byte(skb, nh_off);
|
|
if (likely(ip_vl == 0x45))
|
|
nh_off += sizeof(struct iphdr);
|
|
else
|
|
nh_off += (ip_vl & 0xF) << 2;
|
|
|
|
dport = load_half(skb, nh_off + offsetof(struct tcphdr, dest));
|
|
if (dport != 80)
|
|
return 0;
|
|
|
|
off = skb->queue_mapping & 7;
|
|
set_tcp_dport(skb, nh_off - BPF_LL_OFF, __constant_htons(80),
|
|
__cpu_to_be16(dport_new + off));
|
|
return -1;
|
|
}
|
|
|
|
__section("lb") int lb_main(struct __sk_buff *skb)
|
|
{
|
|
int ret = 0, nh_off = BPF_LL_OFF + ETH_HLEN;
|
|
|
|
if (likely(skb->protocol == __constant_htons(ETH_P_IP)))
|
|
ret = lb_do_ipv4(skb, nh_off);
|
|
|
|
return ret;
|
|
}
|
|
|
|
char __license[] __section("license") = "GPL";
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
The related helper header file
|
|
<B>helpers.h</B>
|
|
|
|
in both examples was:
|
|
<P>
|
|
|
|
<PRE>
|
|
|
|
/* Misc helper macros. */
|
|
#define __section(x) __attribute__((section(x), used))
|
|
#define offsetof(x, y) __builtin_offsetof(x, y)
|
|
#define likely(x) __builtin_expect(!!(x), 1)
|
|
#define unlikely(x) __builtin_expect(!!(x), 0)
|
|
|
|
/* Object pinning settings */
|
|
#define PIN_NONE 0
|
|
#define PIN_OBJECT_NS 1
|
|
#define PIN_GLOBAL_NS 2
|
|
|
|
/* ELF map definition */
|
|
struct bpf_elf_map {
|
|
__u32 type;
|
|
__u32 size_key;
|
|
__u32 size_value;
|
|
__u32 max_elem;
|
|
__u32 flags;
|
|
__u32 id;
|
|
__u32 pinning;
|
|
__u32 inner_id;
|
|
__u32 inner_idx;
|
|
};
|
|
|
|
/* Some used BPF function calls. */
|
|
static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from,
|
|
int len, int flags) =
|
|
(void *) BPF_FUNC_skb_store_bytes;
|
|
static int (*bpf_l4_csum_replace)(void *ctx, int off, int from,
|
|
int to, int flags) =
|
|
(void *) BPF_FUNC_l4_csum_replace;
|
|
static void *(*bpf_map_lookup_elem)(void *map, void *key) =
|
|
(void *) BPF_FUNC_map_lookup_elem;
|
|
|
|
/* Some used BPF intrinsics. */
|
|
unsigned long long load_byte(void *skb, unsigned long long off)
|
|
asm ("llvm.bpf.load.byte");
|
|
unsigned long long load_half(void *skb, unsigned long long off)
|
|
asm ("llvm.bpf.load.half");
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
Best practice, we recommend to only have a single eBPF classifier loaded
|
|
in tc and perform
|
|
<B>all</B>
|
|
|
|
necessary matching and mangling from there instead of a list of individual
|
|
classifier and separate actions. Just a single classifier tailored for a
|
|
given use-case will be most efficient to run.
|
|
<P>
|
|
<A NAME="lbAW"> </A>
|
|
<H3>eBPF DEBUGGING</H3>
|
|
|
|
<P>
|
|
Both tc
|
|
<B>filter</B>
|
|
|
|
and
|
|
<B>action</B>
|
|
|
|
commands for
|
|
<B>bpf</B>
|
|
|
|
support an optional
|
|
<B>verbose</B>
|
|
|
|
parameter that can be used to inspect the eBPF verifier log. It is dumped
|
|
by default in case of an error.
|
|
<P>
|
|
In case the eBPF/cBPF JIT compiler has been enabled, it can also be
|
|
instructed to emit a debug output of the resulting opcode image into
|
|
the kernel log, which can be read via
|
|
<B><A HREF="/cgi-bin/man/man2html?1+dmesg">dmesg</A>(1)</B>
|
|
|
|
:
|
|
<P>
|
|
|
|
<B>echo 2 > /proc/sys/net/core/bpf_jit_enable</B>
|
|
|
|
|
|
<P>
|
|
The Linux kernel source tree ships additionally under
|
|
<B>tools/net/</B>
|
|
|
|
a small helper called
|
|
<B>bpf_jit_disasm</B>
|
|
|
|
that reads out the opcode image dump from the kernel log and dumps the
|
|
resulting disassembly:
|
|
<P>
|
|
|
|
<B>bpf_jit_disasm -o</B>
|
|
|
|
|
|
<P>
|
|
Other than that, the Linux kernel also contains an extensive eBPF/cBPF
|
|
test suite module called
|
|
<B>test_bpf</B>
|
|
|
|
. Upon ...
|
|
<P>
|
|
|
|
<B>modprobe test_bpf</B>
|
|
|
|
|
|
<P>
|
|
... it performs a diversity of test cases and dumps the results into
|
|
the kernel log that can be inspected with
|
|
<B><A HREF="/cgi-bin/man/man2html?1+dmesg">dmesg</A>(1)</B>
|
|
|
|
. The results can differ depending on whether the JIT compiler is enabled
|
|
or not. In case of failed test cases, the module will fail to load. In
|
|
such cases, we urge you to file a bug report to the related JIT authors,
|
|
Linux kernel and networking mailing lists.
|
|
<P>
|
|
<A NAME="lbAX"> </A>
|
|
<H3>cBPF</H3>
|
|
|
|
<P>
|
|
Although we generally recommend switching to implementing
|
|
<B>eBPF</B>
|
|
|
|
classifier and actions, for the sake of completeness, a few words on how to
|
|
program in cBPF will be lost here.
|
|
<P>
|
|
Likewise, the
|
|
<B>bpf_jit_enable</B>
|
|
|
|
switch can be enabled as mentioned already. Tooling such as
|
|
<B>bpf_jit_disasm</B>
|
|
|
|
is also independent whether eBPF or cBPF code is being loaded.
|
|
<P>
|
|
Unlike in eBPF, classifier and action are not implemented in restricted C,
|
|
but rather in a minimal assembler-like language or with the help of other
|
|
tooling.
|
|
<P>
|
|
The raw interface with tc takes opcodes directly. For example, the most
|
|
minimal classifier matching on every packet resulting in the default
|
|
classid of 1:1 looks like:
|
|
<P>
|
|
|
|
<B>tc filter add dev em1 parent 1: bpf bytecode '1,6 0 0 4294967295,' flowid 1:1</B>
|
|
|
|
|
|
<P>
|
|
The first decimal of the bytecode sequence denotes the number of subsequent
|
|
4-tuples of cBPF opcodes. As mentioned, such a 4-tuple consists of
|
|
<B>c t f k</B>
|
|
|
|
decimals, where
|
|
<B>c</B>
|
|
|
|
represents the cBPF opcode,
|
|
<B>t</B>
|
|
|
|
the jump true offset target,
|
|
<B>f</B>
|
|
|
|
the jump false offset target and
|
|
<B>k</B>
|
|
|
|
the immediate constant/literal. Here, this denotes an unconditional return
|
|
from the program with immediate value of -1.
|
|
<P>
|
|
Thus, for egress classification, Willem de Bruijn implemented a minimal stand-alone
|
|
helper tool under the GNU General Public License version 2 for
|
|
<B><A HREF="/cgi-bin/man/man2html?8+iptables">iptables</A>(8)</B>
|
|
|
|
BPF extension, which abuses the
|
|
<B>libpcap</B>
|
|
|
|
internal classic BPF compiler, his code derived here for usage with
|
|
<B><A HREF="/cgi-bin/man/man2html?8+tc">tc</A>(8)</B>
|
|
|
|
:
|
|
<P>
|
|
|
|
<PRE>
|
|
|
|
#include <<A HREF="file:///usr/include/pcap.h">pcap.h</A>>
|
|
#include <<A HREF="file:///usr/include/stdio.h">stdio.h</A>>
|
|
|
|
int main(int argc, char **argv)
|
|
{
|
|
struct bpf_program prog;
|
|
struct bpf_insn *ins;
|
|
int i, ret, dlt = DLT_RAW;
|
|
|
|
if (argc < 2 || argc > 3)
|
|
return 1;
|
|
if (argc == 3) {
|
|
dlt = pcap_datalink_name_to_val(argv[1]);
|
|
if (dlt == -1)
|
|
return 1;
|
|
}
|
|
|
|
ret = pcap_compile_nopcap(-1, dlt, &prog, argv[argc - 1],
|
|
1, PCAP_NETMASK_UNKNOWN);
|
|
if (ret)
|
|
return 1;
|
|
|
|
printf("%d,", prog.bf_len);
|
|
ins = prog.bf_insns;
|
|
|
|
for (i = 0; i < prog.bf_len - 1; ++ins, ++i)
|
|
printf("%u %u %u %u,", ins->code,
|
|
ins->jt, ins->jf, ins->k);
|
|
printf("%u %u %u %u",
|
|
ins->code, ins->jt, ins->jf, ins->k);
|
|
|
|
pcap_freecode(&prog);
|
|
return 0;
|
|
}
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
Given this small helper, any
|
|
<B><A HREF="/cgi-bin/man/man2html?8+tcpdump">tcpdump</A>(8)</B>
|
|
|
|
filter expression can be abused as a classifier where a match will
|
|
result in the default classid:
|
|
<P>
|
|
|
|
<B>bpftool EN10MB 'tcp[tcpflags] & tcp-syn != 0' > /var/bpf/tcp-syn</B>
|
|
|
|
<BR>
|
|
|
|
<B>tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1</B>
|
|
|
|
|
|
<P>
|
|
Basically, such a minimal generator is equivalent to:
|
|
<P>
|
|
|
|
<B>tcpdump -iem1 -ddd 'tcp[tcpflags] & tcp-syn != 0' | tr '\\n' ',' > /var/bpf/tcp-syn</B>
|
|
|
|
|
|
<P>
|
|
Since
|
|
<B>libpcap</B>
|
|
|
|
does not support all Linux' specific cBPF extensions in its compiler, the
|
|
Linux kernel also ships under
|
|
<B>tools/net/</B>
|
|
|
|
a minimal BPF assembler called
|
|
<B>bpf_asm</B>
|
|
|
|
for providing full control. For detailed syntax and semantics on implementing
|
|
such programs by hand, see references under
|
|
<B>FURTHER READING</B>
|
|
|
|
.
|
|
<P>
|
|
Trivial toy example in
|
|
<B>bpf_asm</B>
|
|
|
|
for classifying IPv4/TCP packets, saved in a text file called
|
|
<B>foobar</B>
|
|
|
|
:
|
|
<P>
|
|
|
|
<PRE>
|
|
|
|
ldh [12]
|
|
jne #0x800, drop
|
|
ldb [23]
|
|
jneq #6, drop
|
|
ret #-1
|
|
drop: ret #0
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
Similarly, such a classifier can be loaded as:
|
|
<P>
|
|
|
|
<B>bpf_asm foobar > /var/bpf/tcp-syn</B>
|
|
|
|
<BR>
|
|
|
|
<B>tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1</B>
|
|
|
|
|
|
<P>
|
|
For BPF classifiers, the Linux kernel provides additionally under
|
|
<B>tools/net/</B>
|
|
|
|
a small BPF debugger called
|
|
<B>bpf_dbg</B>
|
|
|
|
, which can be used to test a classifier against pcap files, single-step
|
|
or add various breakpoints into the classifier program and dump register
|
|
contents during runtime.
|
|
<P>
|
|
Implementing an action in classic BPF is rather limited in the sense that
|
|
packet mangling is not supported. Therefore, it's generally recommended to
|
|
make the switch to eBPF, whenever possible.
|
|
<P>
|
|
<A NAME="lbAY"> </A>
|
|
<H2>FURTHER READING</H2>
|
|
|
|
Further and more technical details about the BPF architecture can be found
|
|
in the Linux kernel source tree under
|
|
<B>Documentation/networking/filter.txt</B>
|
|
|
|
.
|
|
<P>
|
|
Further details on eBPF
|
|
<B><A HREF="/cgi-bin/man/man2html?8+tc">tc</A>(8)</B>
|
|
|
|
examples can be found in the iproute2 source
|
|
tree under
|
|
<B>examples/bpf/</B>
|
|
|
|
.
|
|
<P>
|
|
<A NAME="lbAZ"> </A>
|
|
<H2>SEE ALSO</H2>
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?8+tc">tc</A></B>(8),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?8+tc-ematch">tc-ematch</A></B>(8)
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?2+bpf">bpf</A></B>(2)
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?4+bpf">bpf</A></B>(4)
|
|
|
|
<P>
|
|
<A NAME="lbBA"> </A>
|
|
<H2>AUTHORS</H2>
|
|
|
|
Manpage written by Daniel Borkmann.
|
|
<P>
|
|
Please report corrections or improvements to the Linux kernel networking
|
|
mailing list:
|
|
<B><<A HREF="mailto:netdev@vger.kernel.org">netdev@vger.kernel.org</A>></B>
|
|
|
|
<P>
|
|
|
|
<HR>
|
|
<A NAME="index"> </A><H2>Index</H2>
|
|
<DL>
|
|
<DT id="12"><A HREF="#lbAB">NAME</A><DD>
|
|
<DT id="13"><A HREF="#lbAC">SYNOPSIS</A><DD>
|
|
<DL>
|
|
<DT id="14"><A HREF="#lbAD">eBPF classifier (filter) or action:</A><DD>
|
|
<DT id="15"><A HREF="#lbAE">cBPF classifier (filter) or action:</A><DD>
|
|
</DL>
|
|
<DT id="16"><A HREF="#lbAF">DESCRIPTION</A><DD>
|
|
<DT id="17"><A HREF="#lbAG">PARAMETERS</A><DD>
|
|
<DL>
|
|
<DT id="18"><A HREF="#lbAH">object-file</A><DD>
|
|
<DT id="19"><A HREF="#lbAI">section</A><DD>
|
|
<DT id="20"><A HREF="#lbAJ">export</A><DD>
|
|
<DT id="21"><A HREF="#lbAK">verbose</A><DD>
|
|
<DT id="22"><A HREF="#lbAL">direct-action | da</A><DD>
|
|
<DT id="23"><A HREF="#lbAM">skip_hw | skip_sw</A><DD>
|
|
<DT id="24"><A HREF="#lbAN">police</A><DD>
|
|
<DT id="25"><A HREF="#lbAO">action</A><DD>
|
|
<DT id="26"><A HREF="#lbAP">classid</A><DD>
|
|
<DT id="27"><A HREF="#lbAQ">flowid</A><DD>
|
|
<DT id="28"><A HREF="#lbAR">bytecode</A><DD>
|
|
<DT id="29"><A HREF="#lbAS">bytecode-file</A><DD>
|
|
</DL>
|
|
<DT id="30"><A HREF="#lbAT">EXAMPLES</A><DD>
|
|
<DL>
|
|
<DT id="31"><A HREF="#lbAU">eBPF TOOLING</A><DD>
|
|
<DT id="32"><A HREF="#lbAV">eBPF PROGRAMMING</A><DD>
|
|
<DT id="33"><A HREF="#lbAW">eBPF DEBUGGING</A><DD>
|
|
<DT id="34"><A HREF="#lbAX">cBPF</A><DD>
|
|
</DL>
|
|
<DT id="35"><A HREF="#lbAY">FURTHER READING</A><DD>
|
|
<DT id="36"><A HREF="#lbAZ">SEE ALSO</A><DD>
|
|
<DT id="37"><A HREF="#lbBA">AUTHORS</A><DD>
|
|
</DL>
|
|
<HR>
|
|
This document was created by
|
|
<A HREF="/cgi-bin/man/man2html">man2html</A>,
|
|
using the manual pages.<BR>
|
|
Time: 00:06:17 GMT, March 31, 2021
|
|
</BODY>
|
|
</HTML>
|