Kary Jin [Tue, 6 Nov 2018 04:48:02 +0000 (12:48 +0800)]
rt-patches: update patch#68 for hardlockup detector
After applying the patch enabling hardlockup detector
at https://git-master.nvidia.com/r/1934166, there is
a conflict when applying patch#68, like this:
patching file include/linux/printk.h
patching file kernel/printk/printk.c
patching file kernel/watchdog_hld.c
Hunk #1 FAILED at 19.
Hunk #2 succeeded at 198 (offset 94 lines).
Hunk #3 succeeded at 222 (offset 94 lines).
1 out of 3 hunks FAILED -- saving rejects to file
kernel/watchdog_hld.c.rej
make[1]: ***
[/home/buildbrain/Kary-kernel-only/tree/kernel-build/
make/Makefile.kernel:257:
/home/buildbrain/Kary-kernel-only/tree/out/
embedded_aarch64-arm64-tegra_gnu_linux_defconfig-v4m-rt_patches/
kernel/src-rt] Error 1
So this patch update the patch#68 to fix the conflict
Colin Cross [Fri, 12 Oct 2018 07:33:59 +0000 (15:33 +0800)]
ANDROID: hardlockup: detect hard lockups without NMIs using secondary cpus
Emulate NMIs on systems where they are not available by using timer
interrupts on other cpus. Each cpu will use its softlockup hrtimer
to check that the next cpu is processing hrtimer interrupts by
verifying that a counter is increasing.
This patch is useful on systems where the hardlockup detector is not
available due to a lack of NMIs, for example most ARM SoCs.
Without this patch any cpu stuck with interrupts disabled can
cause a hardware watchdog reset with no debugging information,
but with this patch the kernel can detect the lockup and panic,
which can result in useful debugging info.
Change-Id: I83d6837cafcc6d6e7a70352f5a4d09c0ede1d8a4 Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Kary Jin <karyj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/1929802
(cherry picked from commit c039614dcce22309387769378d722b4c37bd352d)
Reviewed-on: https://git-master.nvidia.com/r/1934166
GVS: Gerrit_Virtual_Submit Reviewed-by: Daniel Fu <danifu@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Change-Id: I2b1739e068afbaf5eb39950516072bff8345ebfe Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-on: https://git-master.nvidia.com/r/1941034 Reviewed-by: Robert Huang (SW-TEGRA) <robhuang@nvidia.com> Tested-by: Robert Huang (SW-TEGRA) <robhuang@nvidia.com> Reviewed-by: Eric Zhang (SW-TEGRA) <ericz@nvidia.com> Tested-by: Eric Zhang (SW-TEGRA) <ericz@nvidia.com>
GVS: Gerrit_Virtual_Submit Reviewed-by: Daniel Fu <danifu@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
This patch enables KPTI - Kernel Page Table Isolation on Android. This
feature is required for all devices shipping with Android P, regardless
of the susceptibility of the device to speculative side channel attacks.
Naveen Kumar S [Mon, 22 Oct 2018 07:03:53 +0000 (12:33 +0530)]
video: fbdev: enable accurate pixel clk on android
This change enbales Accurate Pixel Clock feature on Android. This
feature helps in programming the exact pixel clock rate specified
in the standard CEA/VESA modes or in the detailed timings provided
by the sink. The original pixel clock rate is stored and used during
modeset without using the value that was converted between KHz
(specified in EDID or standard mode arrays) and pico seconds (needed
by fbdev), which helps avoid the precision loss caused by this
Frank Chueh [Fri, 19 Oct 2018 10:49:44 +0000 (18:49 +0800)]
init: modify mount process on diag
When the system-as-root feature is enabled, kernel
uses ramdisk in system partition, it causes Diag
ramdisk (factory_ramdisk) doesn't be mounted.
The change modify kernel mount process to use original
ramdisk when compiling the Diag kerenl.
Martin Gao [Sat, 6 Oct 2018 21:53:25 +0000 (14:53 -0700)]
drivers: allow dvfs to set external vmin to dfll
- added field named external_floor_output in tegra_dfll struct, and API to
allow dvfs to set Vmin in CLDVFS mode
- added field named joint_rail_with_dfll in dvfs_rail struct, so
external floor is only enforced on rail connected with dfll
- this feature should only be enabled on merge rail platform, and it's
controlbed by DT chosen "nvidia,tegra-joint_xpu_rail". Also cc3/cc4
need to be disabled
Angus Liu [Fri, 12 Oct 2018 18:42:36 +0000 (11:42 -0700)]
arm64: configs: tegra_android: Added IO Scheduler
Enabled Deadline IO scheduler in kernel 4.9
- Deadline IO scheduler is only used by foster_e_hdd
(set via foster_e_hdd.rc in userspace)
- Prior to Kernel 4.9, Deadline scheduler was enabled,
when tegra21_defconfig was merged into tegra_android_defconfig,
deadline scheduler was left out.
- This patch reverts deadline scheduler config to K3.10
AVerMedia AVerTV Volar Hybrid Q (H837) TV tuner
did not have a decoder module attached.
Add a decoder module in order to create
media controller entity correctly.
usbtuner: fix dvb functionality after v4l2 operation
Cameraserver calls v4l2_open to check the device type as
soon as a v4l2 device is registered, and then v4l2_close
if the device isn't a camera.
For cx231xx tuners, the device needs to stay in DIGITAL_MODE
for dvb_init, however, in v4l2_open the mode is switched to
ANALOG_MODE, which fails dvb_init. This patch fails v4l2_open
if it's called before dvb_init is done for cx231xx tuners.
For em28xx tuners, v4l2_close resets the usb interface alternate
to 0, which is supposed to be 1 for dvb function to work after
dvb_init. This patch skipped the usb interface reset after dvb_init
for em28xx tuners.
media: cx231xx: bring back PWR_CTL_EN modification
Bring back PWR_CTL_EN modification for AVerMedia H837 TV tuner.
This change modifies upstream commit
https://github.com/torvalds/linux/commit/082417d10fafe7be835d143ade7114b5ce26cb50
which removed I2C port 3 switching by modifying PWR_CTL_EN directly.
The AVerMedia tuner requires the switch to work correctly.
Alex Waterman [Wed, 24 Oct 2018 08:22:28 +0000 (16:22 +0800)]
tegra21: emc: Only poll single rank for PD
Only poll the single active rank for power-down status when checking if
the DRAM has left auto power-down state and there's only a single rank
of DRAM.
Bibek Basu [Wed, 10 Oct 2018 10:03:58 +0000 (15:33 +0530)]
arm64: configs: enable IRQ_TIME_ACCOUNTING
Enable CONFIG_IRQ_TIME_ACCOUNTING for Android defconfig. This is
needed by load balancing so that the time servicing IRQ can be
considered when calculating rt average.
The issue happens because in probe(), DMA controller is registed with DT DMA
helpers and the same is not freed during driver remove. During loading again
this results in memory abort and kernel panic happens.
This patch frees up resources with of_dma_controller_free() call in driver
remove.
The following configs are enabled to support CAN:
CONFIG_CAN
CONFIG_CAN_VCAN
CONFIG_CAN_SLCAN
CONFIG_CAN_PEAK_PCIEFD
CONFIG_MTTCAN
CONFIG_TEGRA_HV_SECCAN
However, not all platforms would need CAN support, especially both
MTTCAN and TEGRA_HV_SECCAN depend on ARCH_TEGRA_18x_SOC, so these
could be just skipped for older SoCs, e.g. ARCH_TEGRA_210_SOC.
This patch changes all CAN configs to =m to build them as LKMs.
Naveen Kumar S [Mon, 27 Aug 2018 06:42:17 +0000 (12:12 +0530)]
video: fbdev: add support for storing pixclk in hz
fbdev framework stores pixelclock in pico seconds. Since monitors
advertize pixclock in hz, we need to convert it from Hz to pico
seconds when creating mode database. During modeset, it again needs
to be converted back to Hz. These conversions cause precision loss,
because of which the final pixclock value does not match the original
value advertized by the sink.
This issue is prominently seen in case of non-CEA modes.
Most of the monitors handle this slight variation in pixclock without
complaining. But, there are a few monitors that fail when the exact
pixclock is not programmed. Hence, interoperability is affected.
Avoiding the pixclock conversion helps in avoiding the precision loss
issue. But, the idea of updating the core kernel fbdev driver to use
pixel clock in hz intead of pico seconds was rejected because it might
cause ABI compatibility issues.
The finalized solution is to store the original pixclock in hz along
with the pixclock in pico seconds, and use the hz pixclock during
modeset instead of the converted one.
Implementation Details:
fb_var_screeninfo is the structure used by userspace clients to specify
their chosen mode timings to kernel. To avoid breaking the current fbdev
ABI, we avoided making any changes to this structure.
A new variable, pixclock_hz, has been added to the existing fb_videomode
structure. It is used to store the original pixclock value in hz during
mode database creation. As part of modeset, this value is used instead of
the one that went through conversion between hz and pico seconds. This
in-turn helps in setting accurate pixel clock advertized by sinks.
New defconfig option introduced:
To make it easy to enable/disable this feature, the whole implementation
has been wrapped under a new defconfig option, CONFIG_FB_MODE_PIXCLOCK_HZ.
Any updates to existing data structures has also been isolated under this
defconfig option.
Change-Id: I9d3083953e62fad256fde7ca2010187ba9411b27 Signed-off-by: Om <omp@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/1923851 Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Commit aee0c9e837c2 ("sched/cputime: Fix ksoftirqd cputime accounting
regression") moved the calls to u64_stats_update_{begin,end} from
irqtime_account_irq into irqtime_account_delta. This was missed in
commit 03fcc2fe7130 ("Merge 4.9.135 into android-4.9"), which only
removed the u64_stats_update_begin call in irqtime_account_irq, because
of a conflict with commit 3a73c96a286f ("ANDROID: sched/walt: Accounting
for number of irqs pending on each core").
Since the two code blocks above and below this statement were gated by
CONFIG_SCHED_WALT, combine them into one at the same time.
Arnd Bergmann [Fri, 30 Jun 2017 15:46:16 +0000 (17:46 +0200)]
usb: typec: include linux/device.h in ucsi.h
The new driver causes a build failure in some configurations:
In file included from /git/arm-soc/drivers/usb/typec/ucsi/trace.h:9:0,
from /git/arm-soc/drivers/usb/typec/ucsi/trace.c:2:
drivers/usb/typec/ucsi/ucsi.h:331:39: error: 'struct device' declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
This includes the required header file.
Fixes: c1b0bc2dabfa ("usb: typec: Add support for UCSI interface") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug 2284925
Change-Id: I0c3625c9021bfa9425691e47d9d8f788697d0a10
(cherry picked from commit 86be7f7b2d940ddc18143061e77989b017d93bf8) Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/1928285
(cherry picked from commit 8a6fe2878be58c8c25300297bfe652b958b296e9)
Reviewed-on: https://git-master.nvidia.com/r/1928614
GVS: Gerrit_Virtual_Submit Reviewed-by: Vinayak Pane <vpane@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Changes in 4.9.135
media: af9035: prevent buffer overflow on write
batman-adv: Fix segfault when writing to throughput_override
batman-adv: Fix segfault when writing to sysfs elp_interval
batman-adv: Prevent duplicated nc_node entry
batman-adv: Prevent duplicated softif_vlan entry
batman-adv: Prevent duplicated global TT entry
batman-adv: Prevent duplicated tvlv handler
batman-adv: fix backbone_gw refcount on queue_work() failure
batman-adv: fix hardif_neigh refcount on queue_work() failure
clocksource/drivers/ti-32k: Add CLOCK_SOURCE_SUSPEND_NONSTOP flag for non-am43 SoCs
scsi: ibmvscsis: Fix a stringop-overflow warning
scsi: ibmvscsis: Ensure partition name is properly NUL terminated
Input: atakbd - fix Atari keymap
Input: atakbd - fix Atari CapsLock behaviour
ravb: do not write 1 to reserved bits
drm: mali-dp: Call drm_crtc_vblank_reset on device init
scsi: sd: don't crash the host on invalid commands
net/mlx4: Use cpumask_available for eq->affinity_mask
powerpc/tm: Fix userspace r13 corruption
powerpc/tm: Avoid possible userspace r1 corruption on reclaim
iommu/amd: Return devid as alias for ACPI HID devices
mremap: properly flush TLB before releasing the page
mm: Preserve _PAGE_DEVMAP across mprotect() calls
netfilter: check for seqadj ext existence before adding it in nf_nat_setup_info
ARC: build: Get rid of toolchain check
ARC: build: Don't set CROSS_COMPILE in arch's Makefile
HID: quirks: fix support for Apple Magic Keyboards
usb: gadget: serial: fix oops when data rx'd after close
sched/cputime: Convert kcpustat to nsecs
macintosh/rack-meter: Convert cputime64_t use to u64
sched/cputime: Increment kcpustat directly on irqtime account
sched/cputime: Fix ksoftirqd cputime accounting regression
ext4: avoid running out of journal credits when appending to an inline file
HV: properly delay KVP packets when negotiation is in progress
Linux 4.9.135
The host may send multiple negotiation packets
(due to timeout) before the KVP user-mode daemon
is connected. KVP user-mode daemon is connected.
We need to defer processing those packets
until the daemon is negotiated and connected.
It's okay for guest to respond
to all negotiation packets.
In addition, the host may send multiple staged
KVP requests as soon as negotiation is done.
We need to properly process those packets using one
tasklet for exclusive access to ring buffer.
This patch is based on the work of
Nick Meier <Nick.Meier@microsoft.com>.
The above is the original changelog of a3ade8cc474d ("HV: properly delay KVP packets when negotiation is in progress"
Here I re-worked the original patch because the mainline version
can't work for the linux-4.4.y branch, on which channel->callback_event
doesn't exist yet. In the mainline, channel->callback_event was added by: 631e63a9f346 ("vmbus: change to per channel tasklet"). Here we don't want
to backport it to v4.4, as it requires extra supporting changes and fixes,
which are unnecessary as to the KVP bug we're trying to resolve.
NOTE: before this patch is used, we should cherry-pick the other related
3 patches from the mainline first:
The background of this backport request is that: recently Wang Jian reported
some KVP issues: https://github.com/LIS/lis-next/issues/593:
e.g. the /var/lib/hyperv/.kvp_pool_* files can not be updated, and sometimes
if the hv_kvp_daemon doesn't timely start, the host may not be able to query
the VM's IP address via KVP.
Reported-by: Wang Jian <jianjian.wang1@gmail.com> Tested-by: Wang Jian <jianjian.wang1@gmail.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Use a separate journal transaction if it turns out that we need to
convert an inline file to use an data block. Otherwise we could end
up failing due to not having journal credits.
irq_time_read() returns the irqtime minus the ksoftirqd time. This
is necessary because irq_time_read() is used to substract the IRQ time
from the sum_exec_runtime of a task. If we were to include the softirq
time of ksoftirqd, this task would substract its own CPU time everytime
it updates ksoftirqd->sum_exec_runtime which would therefore never
progress.
But this behaviour got broken by:
a499a5a14db ("sched/cputime: Increment kcpustat directly on irqtime account")
... which now includes ksoftirqd softirq time in the time returned by
irq_time_read().
This has resulted in wrong ksoftirqd cputime reported to userspace
through /proc/stat and thus "top" not showing ksoftirqd when it should
after intense networking load.
ksoftirqd->stime happens to be correct but it gets scaled down by
sum_exec_runtime through task_cputime_adjusted().
To fix this, just account the strict IRQ time in a separate counter and
use it to report the IRQ time.
The irqtime is accounted is nsecs and stored in
cpu_irq_time.hardirq_time and cpu_irq_time.softirq_time. Once the
accumulated amount reaches a new jiffy, this one gets accounted to the
kcpustat.
This was necessary when kcpustat was stored in cputime_t, which could at
worst have jiffies granularity. But now kcpustat is stored in nsecs
so this whole discretization game with temporary irqtime storage has
become unnecessary.
We can now directly account the irqtime to the kcpustat.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-17-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ivan Delalande <colona@arista.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kernel CPU stats are stored in cputime_t which is an architecture
defined type, and hence a bit opaque and requiring accessors and mutators
for any operation.
Converting them to nsecs simplifies the code and is one step toward
the removal of cputime_t in the core code.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
[colona: minor conflict as 527b0a76f41d ("sched/cpuacct: Avoid %lld seq_printf
warning") is missing from v4.9] Signed-off-by: Ivan Delalande <colona@arista.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the gadget serial device has no associated TTY, do not pass any
received data into the TTY layer for processing; simply drop it instead.
This prevents the TTY layer from calling back into the gadget serial
driver, which will then crash in e.g. gs_write_room() due to lack of
gadget serial device to TTY association (i.e. a NULL pointer dereference).
Signed-off-by: Stephen Warren <swarren@nvidia.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Natanael Copa [Thu, 18 Oct 2018 15:04:17 +0000 (17:04 +0200)]
HID: quirks: fix support for Apple Magic Keyboards
Commit b6cc0ba2cbf4 (HID: add support for Apple Magic Keyboards)
backported support for the Magic Keyboard over Bluetooth, but did not
add the BT_VENDOR_ID_APPLE to hid_have_special_driver[] so the hid-apple
driver is never loaded and Fn key does not work at all.
Adding BT_VENDOR_ID_APPLE to hid_have_special_driver[] is not needed
after commit e04a0442d33b (HID: core: remove the absolute need of
hid_have_special_driver[]), so 4.16 kernels and newer does not need it.
Fixes: b6cc0ba2cbf4 (HID: add support for Apple Magic Keyboards)
Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=99881 Signed-off-by: Natanael Copa <ncopa@alpinelinux.org> Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There's not much sense in doing that because if user or
his build-system didn't set CROSS_COMPILE we still may
very well make incorrect guess.
But as it turned out setting CROSS_COMPILE is not as harmless
as one may think: with recent changes that implemented automatic
discovery of __host__ gcc features unconditional setup of
CROSS_COMPILE leads to failures on execution of "make xxx_defconfig"
with absent cross-compiler, for more info see [1].
Set CROSS_COMPILE as well gets in the way if we want only to build
.dtb's (again with absent cross-compiler which is not really needed
for building .dtb's), see [2].
Note, we had to change LIBGCC assignment type from ":=" to "="
so that is is resolved on its usage, otherwise if it is resolved
at declaration time with missing CROSS_COMPILE we're getting this
error message from host GCC:
| gcc: error: unrecognized command line option -mmedium-calls
| gcc: error: unrecognized command line option -mno-sdata
This check is very naive: we simply test if GCC invoked without
"-mcpu=XXX" has ARC700 define set. In that case we think that GCC
was built with "--with-cpu=arc700" and has libgcc built for ARC700.
Otherwise if ARC700 is not defined we think that everythng was built
for ARCv2.
But in reality our life is much more interesting.
1. Regardless of GCC configuration (i.e. what we pass in "--with-cpu"
it may generate code for any ARC core).
2. libgcc might be built with explicitly specified "--mcpu=YYY"
That's exactly what happens in case of multilibbed toolchains:
- GCC is configured with default settings
- All the libs built for many different CPU flavors
I.e. that check gets in the way of usage of multilibbed
toolchains. And even non-multilibbed toolchains are affected.
OpenEmbedded also builds GCC without "--with-cpu" because
each and every target component later is compiled with explicitly
set "-mcpu=ZZZ".
Commit 4440a2ab3b9f ("netfilter: synproxy: Check oom when adding synproxy
and seqadj ct extensions") wanted to drop the packet when it fails to add
seqadj ext due to no memory by checking if nfct_seqadj_ext_add returns
NULL.
But that nfct_seqadj_ext_add returns NULL can also happen when seqadj ext
already exists in a nf_conn. It will cause that userspace protocol doesn't
work when both dnat and snat are configured.
In router, when both dnat and snat are added, nf_nat_setup_info will be
called twice. The packet can be dropped at the 2nd time for DNAT due to
seqadj ext is already added at the 1st time for SNAT.
This patch is to fix it by checking for seqadj ext existence before adding
it, so that the packet will not be dropped if seqadj ext already exists.
Note that as Florian mentioned, as a long term, we should review ext_add()
behaviour, it's better to return a pointer to the existing ext instead.
Fixes: 4440a2ab3b9f ("netfilter: synproxy: Check oom when adding synproxy and seqadj ct extensions") Reported-by: Li Shuang <shuali@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
when mprotect(2) gets used on DAX mappings. Also there is a wide variety
of other failures that can result from the missing _PAGE_DEVMAP flag
when the area gets used by get_user_pages() later.
Fix the problem by including _PAGE_DEVMAP in a set of flags that get
preserved by mprotect(2).
Fixes: 69660fd797c3 ("x86, mm: introduce _PAGE_DEVMAP") Fixes: ebd31197931d ("powerpc/mm: Add devmap support for ppc64") Cc: <stable@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jann Horn points out that our TLB flushing was subtly wrong for the
mremap() case. What makes mremap() special is that we don't follow the
usual "add page to list of pages to be freed, then flush tlb, and then
free pages". No, mremap() obviously just _moves_ the page from one page
table location to another.
That matters, because mremap() thus doesn't directly control the
lifetime of the moved page with a freelist: instead, the lifetime of the
page is controlled by the page table locking, that serializes access to
the entry.
As a result, we need to flush the TLB not just before releasing the lock
for the source location (to avoid any concurrent accesses to the entry),
but also before we release the destination page table lock (to avoid the
TLB being flushed after somebody else has already done something to that
page).
This also makes the whole "need_flush" logic unnecessary, since we now
always end up flushing the TLB for every valid entry.
ACPI HID devices do not actually have an alias for
them in the IVRS. But dev_data->alias is still used
for indexing into the IOMMU device table for devices
being handled by the IOMMU. So for ACPI HID devices,
we simply return the corresponding devid as an alias,
as parsed from IVRS table.
Current we store the userspace r1 to PACATMSCRATCH before finally
saving it to the thread struct.
In theory an exception could be taken here (like a machine check or
SLB miss) that could write PACATMSCRATCH and hence corrupt the
userspace r1. The SLB fault currently doesn't touch PACATMSCRATCH, but
others do.
We've never actually seen this happen but it's theoretically
possible. Either way, the code is fragile as it is.
This patch saves r1 to the kernel stack (which can't fault) before we
turn MSR[RI] back on. PACATMSCRATCH is still used but only with
MSR[RI] off. We then copy r1 from the kernel stack to the thread
struct once we have MSR[RI] back on.
Suggested-by: Breno Leitao <leitao@debian.org> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we treclaim we store the userspace checkpointed r13 to a scratch
SPR and then later save the scratch SPR to the user thread struct.
Unfortunately, this doesn't work as accessing the user thread struct
can take an SLB fault and the SLB fault handler will write the same
scratch SPRG that now contains the userspace r13.
To fix this, we store r13 to the kernel stack (which can't fault)
before we access the user thread struct.
Found by running P8 guest + powervm + disable_1tb_segments + TM. Seen
as a random userspace segfault with r13 looking like a kernel address.
Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Clang warns that the address of a pointer will always evaluated as true
in a boolean context:
drivers/net/ethernet/mellanox/mlx4/eq.c:243:11: warning: address of
array 'eq->affinity_mask' will always evaluate to 'true'
[-Wpointer-bool-conversion]
if (!eq->affinity_mask || cpumask_empty(eq->affinity_mask))
~~~~~^~~~~~~~~~~~~
1 warning generated.
Use cpumask_available, introduced in commit f7e30f01a9e2 ("cpumask: Add
helper cpumask_available()"), which does the proper checking and avoids
this warning.
Link: https://github.com/ClangBuiltLinux/linux/issues/86 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When sd_init_command() get's a command with a unknown req_op() it crashes the
system via BUG().
This makes debugging the actual reason for the broken request cmd_flags pretty
hard as the system is down before it's able to write out debugging data on the
serial console or the trace buffer.
Change the BUG() to a WARN_ON() and return BLKPREP_KILL to fail gracefully and
return an I/O error to the producer of the request.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, if userspace calls drm_wait_vblank before the crtc is
activated the crtc vblank_enable hook is called, which in case of
malidp driver triggers some warninngs. This happens because on
device init we don't inform the drm core about the vblank state
by calling drm_crtc_vblank_on/off/reset which together with
drm_vblank_get have some magic that prevents calling drm_vblank_enable
when crtc is off.
Signed-off-by: Alexandru Gheorghe <alexandru-cosmin.gheorghe@arm.com> Acked-by: Liviu Dudau <liviu.dudau@arm.com> Signed-off-by: Liviu Dudau <liviu.dudau@arm.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
EtherAVB hardware requires 0 to be written to status register bits in
order to clear them, however, care must be taken not to:
1. Clear other bits, by writing zero to them
2. Write one to reserved bits
This patch corrects the ravb driver with respect to the second point above.
This is done by defining reserved bit masks for the affected registers and,
after auditing the code, ensure all sites that may write a one to a
reserved bit use are suitably masked.
Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com> Signed-off-by: Simon Horman <horms+renesas@verge.net.au> Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While reviewing another part of the code, Kees noticed that the strncpy of the
partition name might not always be NUL terminated. Switch to using strscpy
which does this safely.
Reported-by: Kees Cook <keescook@chromium.org> Signed-off-by: Laura Abbott <labbott@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The hardif_neigh refcounter is to be decreased by the queued work and
currently is never decreased if the queue_work() call fails.
Fix by checking the queue_work() return value and decrease refcount
if necessary.
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The backbone_gw refcounter is to be decreased by the queued work and
currently is never decreased if the queue_work() call fails.
Fix by checking the queue_work() return value and decrease refcount
if necessary.
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The function batadv_tvlv_handler_register is responsible for adding new
tvlv_handler to the handler_list. It first checks whether the entry
already is in the list or not. If it is, then the creation of a new entry
is aborted.
But the lock for the list is only held when the list is really modified.
This could lead to duplicated entries because another context could create
an entry with the same key between the check and the list manipulation.
The check and the manipulation of the list must therefore be in the same
locked code section.
The function batadv_tt_global_orig_entry_add is responsible for adding new
tt_orig_list_entry to the orig_list. It first checks whether the entry
already is in the list or not. If it is, then the creation of a new entry
is aborted.
But the lock for the list is only held when the list is really modified.
This could lead to duplicated entries because another context could create
an entry with the same key between the check and the list manipulation.
The check and the manipulation of the list must therefore be in the same
locked code section.
Fixes: d657e621a0f5 ("batman-adv: add reference counting for type batadv_tt_orig_list_entry") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The function batadv_softif_vlan_get is responsible for adding new
softif_vlan to the softif_vlan_list. It first checks whether the entry
already is in the list or not. If it is, then the creation of a new entry
is aborted.
But the lock for the list is only held when the list is really modified.
This could lead to duplicated entries because another context could create
an entry with the same key between the check and the list manipulation.
The check and the manipulation of the list must therefore be in the same
locked code section.
Fixes: 5d2c05b21337 ("batman-adv: add per VLAN interface attribute framework") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The function batadv_nc_get_nc_node is responsible for adding new nc_nodes
to the in_coding_list and out_coding_list. It first checks whether the
entry already is in the list or not. If it is, then the creation of a new
entry is aborted.
But the lock for the list is only held when the list is really modified.
This could lead to duplicated entries because another context could create
an entry with the same key between the check and the list manipulation.
The check and the manipulation of the list must therefore be in the same
locked code section.
Fixes: d56b1705e28c ("batman-adv: network coding - detect coding nodes and remove these after timeout") Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The per hardif sysfs file "batman_adv/elp_interval" is using the generic
functions to store/show uint values. The helper __batadv_store_uint_attr
requires the softif net_device as parameter to print the resulting change
as info text when the users writes to this file. It uses the helper
function batadv_info to add it at the same time to the kernel ring buffer
and to the batman-adv debug log (when CONFIG_BATMAN_ADV_DEBUG is enabled).
The function batadv_info requires as first parameter the batman-adv softif
net_device. This parameter is then used to find the private buffer which
contains the debug log for this batman-adv interface. But
batadv_store_throughput_override used as first argument the slave
net_device. This slave device doesn't have the batadv_priv private data
which is access by batadv_info.
Writing to this file with CONFIG_BATMAN_ADV_DEBUG enabled can either lead
to a segfault or to memory corruption.
Fixes: 0744ff8fa8fa ("batman-adv: Add hard_iface specific sysfs wrapper macros for UINT") Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The per hardif sysfs file "batman_adv/throughput_override" prints the
resulting change as info text when the users writes to this file. It uses
the helper function batadv_info to add it at the same time to the kernel
ring buffer and to the batman-adv debug log (when CONFIG_BATMAN_ADV_DEBUG
is enabled).
The function batadv_info requires as first parameter the batman-adv softif
net_device. This parameter is then used to find the private buffer which
contains the debug log for this batman-adv interface. But
batadv_store_throughput_override used as first argument the slave
net_device. This slave device doesn't have the batadv_priv private data
which is access by batadv_info.
Writing to this file with CONFIG_BATMAN_ADV_DEBUG enabled can either lead
to a segfault or to memory corruption.
Fixes: 0b5ecc6811bd ("batman-adv: add throughput override attribute to hard_ifaces") Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When less than 3 bytes are written to the device, memcpy is called with
negative array size which leads to buffer overflow and kernel panic. This
patch adds a condition and returns -EOPNOTSUPP instead.
Fixes bugzilla issue 64871
[mchehab+samsung@kernel.org: fix a merge conflict and changed the
condition to match the patch's comment, e. g. len == 3 could
also be valid] Signed-off-by: Jozef Balga <jozef.balga@gmail.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
blk-mq: flush pending hctx->run_work before freeing hctx
During resume block driver starts the queues and add hctx->run_work to wq
to process the pending request in hardware queue. In the same sequence
block driver frees the hw queues if respective mapped CPU is offline.
If the hctx->run_work schedules afterwards, kernel crashes because hctx
is freed. To fix the issue flush hctx->run_work before freeing hctx.
blk-mq: remap queues when adding/removing hardware queues
blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
behavior that drivers expect. However, commit 4e68a011428a changed
blk_mq_queue_reinit() to not remap queues for the case of CPU
hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
queues as well. This breaks, for example, NBD's multi-connection mode,
leaving the added hardware queues unused. Fix it by making
blk_mq_update_nr_hw_queues() explicitly remap the queues.
David Zeuthen [Tue, 24 Jan 2017 18:17:01 +0000 (13:17 -0500)]
ANDROID: AVB error handler to invalidate vbmeta partition.
If androidboot.vbmeta.invalidate_on_error is 'yes' and
androidboot.vbmeta.device is set and points to a device with vbmeta
magic, this header will be overwritten upon an irrecoverable dm-verity
error. The side-effect of this is that the slot will fail to verify on
next reboot, effectively triggering the boot loader to fallback to
another slot. This work both if the vbmeta struct is at the start of a
partition or if there's an AVB footer at the end.
This code is based on drivers/md/dm-verity-chromeos.c from ChromiumOS.
Ken Chang [Tue, 2 Oct 2018 05:34:51 +0000 (13:34 +0800)]
memory: tegra: wait correct DLL state for dll update
At the end of the DLL disable/enable sequence, it needs to wait on
update of bit CFG_DLL_EN of the EMC_CFG_DIG_DLL shadowed register.
Currently the enable seqneuce also waits on DISABLED state, this is
causing long time elapsed in wait_for_update() and eventually bails
out due to timeout.
This patch fix the above issue.
Change-Id: Ied2e8be1babd62f703495682b0d17b422ab17cb6 Signed-off-by: Ken Chang <kenc@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/1917223
(cherry picked from commit dd1c10f635df58419edf5183c512cfc5d7714605) Signed-off-by: Joseph Lo <josephl@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/1928629
GVS: Gerrit_Virtual_Submit Reviewed-by: Aleksandr Frid <afrid@nvidia.com> Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com> Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
Stephen Warren [Thu, 11 Oct 2018 20:39:04 +0000 (14:39 -0600)]
memory: tegra: avoid divide-by-zero during boot on TX1
gcc 7.3 detects a code-path that can perform a divide-by-zero and replaces
this with an explicit "brk #0x3e8" (i.e. "brk #1000") instruction, which
causes a crash if the code is executed. This happens in the EMC driver
when Jetson TX1 is booting:
Unexpected kernel BRK exception at EL1
Unhandled debug exception: ptrace BRK handler (0xf20003e8) at 0x0000000000000000
This occurs because update_clock_tree_delay() always calculates "cval" in
various code-paths, but has only calculated the inputs to cval's
calculation in certain cases, hence causing divide-by-zero to occur in
other cases. This patch skips the calculation of cval except in the cases
where the input values have been calculated. This is acceptable since the
value of cval is only used in those same cases.
Changes in 4.9.134
ASoC: wm8804: Add ACPI support
ASoC: sigmadsp: safeload should not have lower byte limit
selftests/efivarfs: add required kernel configs
selftests: memory-hotplug: add required configs
mfd: omap-usb-host: Fix dts probe of children
scsi: iscsi: target: Don't use stack buffer for scatterlist
scsi: qla2xxx: Fix an endian bug in fcpcmd_is_corrupted()
sound: enable interrupt after dma buffer initialization
stmmac: fix valid numbers of unicast filter entries
net: macb: disable scatter-gather for macb on sama5d3
ARM: dts: at91: add new compatibility string for macb on sama5d3
x86/kvm/lapic: always disable MMIO interface in x2APIC mode
drm/amdgpu: Fix SDMA HQD destroy error on gfx_v7
ext4: Fix error code in ext4_xattr_set_entry()
mm/vmstat.c: fix outdated vmstat_text
mach64: detect the dot clock divider correctly on sparc
perf script python: Fix export-to-postgresql.py occasional failure
i2c: i2c-scmi: fix for i2c_smbus_write_block_data
xhci: Don't print a warning when setting link state for disabled ports
bnxt_en: Fix TX timeout during netpoll.
bonding: avoid possible dead-lock
ip6_tunnel: be careful when accessing the inner header
ip_tunnel: be careful when accessing the inner header
ipv4: fix use-after-free in ip_cmsg_recv_dstaddr()
ipv6: take rcu lock in rawv6_send_hdrinc()
net: dsa: bcm_sf2: Call setup during switch resume
net: hns: fix for unmapping problem when SMMU is on
net: ipv4: update fnhe_pmtu when first hop's MTU changes
net/ipv6: Display all addresses in output of /proc/net/if_inet6
netlabel: check for IPV4MASK in addrinfo_get
net/usb: cancel pending work when unbinding smsc75xx
qlcnic: fix Tx descriptor corruption on 82xx devices
qmi_wwan: Added support for Gemalto's Cinterion ALASxx WWAN interface
team: Forbid enslaving team device to itself
net: dsa: bcm_sf2: Fix unbind ordering
net: mvpp2: Extract the correct ethtype from the skb for tx csum offload
net: systemport: Fix wake-up interrupt race during resume
rtnl: limit IFLA_NUM_TX_QUEUES and IFLA_NUM_RX_QUEUES to 4096
tcp/dccp: fix lockdep issue when SYN is backlogged
inet: make sure to grab rcu_read_lock before using ireq->ireq_opt
inet: frags: change inet_frags_init_net() return value
inet: frags: add a pointer to struct netns_frags
inet: frags: refactor ipfrag_init()
inet: frags: refactor ipv6_frag_init()
inet: frags: refactor lowpan_net_frag_init()
ipv6: export ip6 fragments sysctl to unprivileged users
rhashtable: add schedule points
inet: frags: use rhashtables for reassembly units
inet: frags: remove some helpers
inet: frags: get rif of inet_frag_evicting()
inet: frags: remove inet_frag_maybe_warn_overflow()
inet: frags: break the 2GB limit for frags storage
inet: frags: do not clone skb in ip_expire()
ipv6: frags: rewrite ip6_expire_frag_queue()
rhashtable: reorganize struct rhashtable layout
inet: frags: reorganize struct netns_frags
inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
inet: frags: fix ip6frag_low_thresh boundary
ip: discard IPv4 datagrams with overlapping segments.
net: speed up skb_rbtree_purge()
net: modify skb_rbtree_purge to return the truesize of all purged skbs.
ipv6: defrag: drop non-last frags smaller than min mtu
net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
net: add rb_to_skb() and other rb tree helpers
ip: use rb trees for IP frag queue.
ip: add helpers to process in-order fragments faster.
ip: process in-order fragments efficiently
ip: frags: fix crash in ip_do_fragment()
ipv4: frags: precedence bug in ip_expire()
Linux 4.9.134
We accidentally removed the parentheses here, but they are required
because '!' has higher precedence than '&'.
Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A kernel crash occurrs when defragmented packet is fragmented
in ip_do_fragment().
In defragment routine, skb_orphan() is called and
skb->ip_defrag_offset is set. but skb->sk and
skb->ip_defrag_offset are same union member. so that
frag->sk is not NULL.
Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
defragmented packet is fragmented.
v2:
- clear skb->sk at reassembly routine.(Eric Dumarzet)
Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Peter Oskolkov [Wed, 10 Oct 2018 19:30:15 +0000 (12:30 -0700)]
ip: process in-order fragments efficiently
This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.
On some workloads, UDP stream performance is substantially improved:
Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Peter Oskolkov <posk@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a4fd284a1f8fd4b6c59aa59db2185b1e17c5c11c) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Peter Oskolkov [Wed, 10 Oct 2018 19:30:14 +0000 (12:30 -0700)]
ip: add helpers to process in-order fragments faster.
This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.
The new logic (fully implemented in the second patch) is as follows:
* Nodes in the rb-tree will now contain not single fragments, but lists
of consecutive fragments ("runs").
* At each point in time, the current "active" run at the tail is
maintained/tracked. Fragments that arrive in-order, adjacent
to the previous tail fragment, are added to this tail run without
triggering the re-balancing of the rb-tree.
* If a fragment arrives out of order with the offset _before_ the tail run,
it is inserted into the rb-tree as a single fragment.
* If a fragment arrives after the current tail fragment (with a gap),
it starts a new "tail" run, as is inserted into the rb-tree
at the end as the head of the new run.
skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).
Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Peter Oskolkov <posk@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 353c9cb360874e737fb000545f783df756c06f9a) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.
Tested:
- a follow-up patch contains a rather comprehensive ip defrag
self-test (functional)
- ran neper `udp_stream -c -H <host> -F 100 -l 300 -T 20`:
netstat --statistics
Ip: 282078937 total packets received
0 forwarded
0 incoming packets discarded
946760 incoming packets delivered 18743456 requests sent out
101 fragments dropped after timeout 282077129 reassemblies required
944952 packets reassembled ok 262734239 packet reassembles failed
(The numbers/stats above are somewhat better re:
reassemblies vs a kernel without this patchset. More
comprehensive performance testing TBD).
Reported-by: Jann Horn <jannh@google.com> Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Wed, 10 Oct 2018 19:30:12 +0000 (12:30 -0700)]
net: add rb_to_skb() and other rb tree helpers
Geeralize private netem_rb_to_skb()
TCP rtx queue will soon be converted to rb-tree,
so we will need skb_rbtree_walk() helpers.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 18a4c0eab2623cc95be98a1e6af1ad18e7695977) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric Dumazet [Wed, 10 Oct 2018 19:30:11 +0000 (12:30 -0700)]
net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
After working on IP defragmentation lately, I found that some large
packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
zero paddings on the last (small) fragment.
While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
to CHECKSUM_NONE, forcing a full csum validation, even if all prior
fragments had CHECKSUM_COMPLETE set.
We can instead compute the checksum of the part we are trimming,
usually smaller than the part we keep.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 88078d98d1bb085d72af8437707279e203524fa5) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Florian Westphal [Wed, 10 Oct 2018 19:30:10 +0000 (12:30 -0700)]
ipv6: defrag: drop non-last frags smaller than min mtu
don't bother with pathological cases, they only waste cycles.
IPv6 requires a minimum MTU of 1280 so we should never see fragments
smaller than this (except last frag).
v3: don't use awkward "-offset + len"
v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
There were concerns that there could be even smaller frags
generated by intermediate nodes, e.g. on radio networks.
Cc: Peter Oskolkov <posk@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0ed4229b08c13c84a3c301a08defdc9e7f4467e6) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>