next/master boot bisection: next-20190215 on beaglebone-black


Guillaume Tucker
 

On 01/03/2019 00:55, Dan Williams wrote:
On Thu, Feb 28, 2019 at 3:14 PM Andrew Morton <akpm@linux-foundation.org> wrote:

On Tue, 26 Feb 2019 16:04:04 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

On Tue, Feb 26, 2019 at 4:00 PM Andrew Morton <akpm@linux-foundation.org> wrote:

On Fri, 15 Feb 2019 18:51:51 +0000 Mark Brown <broonie@kernel.org> wrote:

On Fri, Feb 15, 2019 at 10:43:25AM -0800, Andrew Morton wrote:
On Fri, 15 Feb 2019 10:20:10 -0800 (PST) "kernelci.org bot" <bot@kernelci.org> wrote:
Details: https://kernelci.org/boot/id/5c666ea959b514b017fe6017
Plain log: https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.txt
HTML log: https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.html
Thanks.
But what actually went wrong? Kernel doesn't boot?
The linked logs show the kernel dying early in boot before the console
comes up so yeah. There should be kernel output at the bottom of the
logs.
I assume Dan is distracted - I'll keep this patchset on hold until we
can get to the bottom of this.
Michal had asked if the free space accounting fix up addressed this
boot regression? I was awaiting word on that.
hm, does bot@kernelci.org actually read emails? Let's try info@ as well..
bot@kernelci.org is not person, it's a send-only account for
automated reports. So no, it doesn't read emails.

I guess the tricky point here is that the authors of the commits
found by bisections may not always have the hardware needed to
reproduce the problem. So it needs to be dealt with on a
case-by-case basis: sometimes they do have the hardware,
sometimes someone else on the list or on CC does, and sometimes
it's better for the people who have access to the test lab which
ran the KernelCI test to deal with it.

This case seems to fall into the last category. As I have access
to the Collabora lab, I can do some quick checks to confirm
whether the proposed patch does fix the issue. I hadn't realised
that someone was waiting for this to happen, especially as the
BeagleBone Black is a very common platform. Sorry about that,
I'll take a look today.

It may be a nice feature to be able to give access to the
KernelCI test infrastructure to anyone who wants to debug an
issue reported by KernelCI or verify a fix, so they won't need to
have the hardware locally. Something to think about for the
future.

Is it possible to determine whether this regression is still present in
current linux-next?
I'll try to re-apply the patch that caused the issue, then see if
the suggested change fixes it. As far as the current linux-next
master branch is concerned, KernelCI boot tests are passing fine
on that platform.

Guillaume


Andrew Morton <akpm@...>
 

On Tue, 26 Feb 2019 16:04:04 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

On Tue, Feb 26, 2019 at 4:00 PM Andrew Morton <akpm@linux-foundation.org> wrote:

On Fri, 15 Feb 2019 18:51:51 +0000 Mark Brown <broonie@kernel.org> wrote:

On Fri, Feb 15, 2019 at 10:43:25AM -0800, Andrew Morton wrote:
On Fri, 15 Feb 2019 10:20:10 -0800 (PST) "kernelci.org bot" <bot@kernelci.org> wrote:
Details: https://kernelci.org/boot/id/5c666ea959b514b017fe6017
Plain log: https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.txt
HTML log: https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.html
Thanks.
But what actually went wrong? Kernel doesn't boot?
The linked logs show the kernel dying early in boot before the console
comes up so yeah. There should be kernel output at the bottom of the
logs.
I assume Dan is distracted - I'll keep this patchset on hold until we
can get to the bottom of this.
Michal had asked if the free space accounting fix up addressed this
boot regression? I was awaiting word on that.
hm, does bot@kernelci.org actually read emails? Let's try info@ as well..

Is it possible to determine whether this regression is still present in
current linux-next?

I assume you're not willing to entertain a "depends
NOT_THIS_ARM_BOARD" hack in the meantime?
We'd probably never be able to remove it. And we don't know whether
other systems might be affected.


Dan Williams <dan.j.williams@...>
 

On Thu, Feb 28, 2019 at 3:14 PM Andrew Morton <akpm@linux-foundation.org> wrote:

On Tue, 26 Feb 2019 16:04:04 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

On Tue, Feb 26, 2019 at 4:00 PM Andrew Morton <akpm@linux-foundation.org> wrote:

On Fri, 15 Feb 2019 18:51:51 +0000 Mark Brown <broonie@kernel.org> wrote:

On Fri, Feb 15, 2019 at 10:43:25AM -0800, Andrew Morton wrote:
On Fri, 15 Feb 2019 10:20:10 -0800 (PST) "kernelci.org bot" <bot@kernelci.org> wrote:
Details: https://kernelci.org/boot/id/5c666ea959b514b017fe6017
Plain log: https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.txt
HTML log: https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.html
Thanks.
But what actually went wrong? Kernel doesn't boot?
The linked logs show the kernel dying early in boot before the console
comes up so yeah. There should be kernel output at the bottom of the
logs.
I assume Dan is distracted - I'll keep this patchset on hold until we
can get to the bottom of this.
Michal had asked if the free space accounting fix up addressed this
boot regression? I was awaiting word on that.
hm, does bot@kernelci.org actually read emails? Let's try info@ as well..
Thanks, yes. The logs don't give much to go on, so I can only iterate
on this as fast as I can drum up feedback.


Is it possible to determine whether this regression is still present in
current linux-next?

I assume you're not willing to entertain a "depends
NOT_THIS_ARM_BOARD" hack in the meantime?
We'd probably never be able to remove it. And we don't know whether
other systems might be affected.
Right, and agree. I was just grasping at straws because I know of
users that want to take advantage of this and was lamenting the
upcoming apology tour saying, "sorry, maybe v5.2". I had always
expected that platforms outside of x86-servers would need to do their
own validation / evaluation before recommending this, and the
regression concern is why it defaulted to disabled... but boot
regressions are boot regressions.


Mike Rapoport <rppt@...>
 

On Fri, Mar 01, 2019 at 09:25:24AM +0100, Guillaume Tucker wrote:
On 01/03/2019 00:55, Dan Williams wrote:
On Thu, Feb 28, 2019 at 3:14 PM Andrew Morton <akpm@linux-foundation.org> wrote:

On Tue, 26 Feb 2019 16:04:04 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

On Tue, Feb 26, 2019 at 4:00 PM Andrew Morton <akpm@linux-foundation.org> wrote:

On Fri, 15 Feb 2019 18:51:51 +0000 Mark Brown <broonie@kernel.org> wrote:

On Fri, Feb 15, 2019 at 10:43:25AM -0800, Andrew Morton wrote:
On Fri, 15 Feb 2019 10:20:10 -0800 (PST) "kernelci.org bot" <bot@kernelci.org> wrote:
Details: https://kernelci.org/boot/id/5c666ea959b514b017fe6017
Plain log: https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.txt
HTML log: https://storage.kernelci.org//next/master/next-20190215/arm/multi_v7_defconfig+CONFIG_SMP=n/gcc-7/lab-collabora/boot-am335x-boneblack.html
Thanks.
But what actually went wrong? Kernel doesn't boot?
The linked logs show the kernel dying early in boot before the console
comes up so yeah. There should be kernel output at the bottom of the
logs.
I assume Dan is distracted - I'll keep this patchset on hold until we
can get to the bottom of this.
Michal had asked if the free space accounting fix up addressed this
boot regression? I was awaiting word on that.
hm, does bot@kernelci.org actually read emails? Let's try info@ as well..
bot@kernelci.org is not person, it's a send-only account for
automated reports. So no, it doesn't read emails.

I guess the tricky point here is that the authors of the commits
found by bisections may not always have the hardware needed to
reproduce the problem. So it needs to be dealt with on a
case-by-case basis: sometimes they do have the hardware,
sometimes someone else on the list or on CC does, and sometimes
it's better for the people who have access to the test lab which
ran the KernelCI test to deal with it.

This case seems to fall into the last category. As I have access
to the Collabora lab, I can do some quick checks to confirm
whether the proposed patch does fix the issue. I hadn't realised
that someone was waiting for this to happen, especially as the
BeagleBone Black is a very common platform. Sorry about that,
I'll take a look today.

It may be a nice feature to be able to give access to the
KernelCI test infrastructure to anyone who wants to debug an
issue reported by KernelCI or verify a fix, so they won't need to
have the hardware locally. Something to think about for the
future.
Another thing to consider is adding "earlyprintk debug" to the kernel
command line for the boot tests.

Is it possible to determine whether this regression is still present in
current linux-next?
I'll try to re-apply the patch that caused the issue, then see if
the suggested change fixes it. As far as the current linux-next
master branch is concerned, KernelCI boot tests are passing fine
on that platform.

Guillaume
--
Sincerely yours,
Mike.


Mark Brown
 

On Fri, Mar 01, 2019 at 12:40:11PM +0200, Mike Rapoport wrote:

Another thing to consider is adding "earlyprintk debug" to the kernel
command line for the boot tests.
We probably don't want to do that on all the tests since it does
occasionally change timing enough to "fix" things but doing a final boot
with the failing commit and earlyprintk turned on is definitely a good
idea.


Guillaume Tucker
 

On 01/03/2019 20:41, Andrew Morton wrote:
On Fri, 1 Mar 2019 09:25:24 +0100 Guillaume Tucker <guillaume.tucker@collabora.com> wrote:

Michal had asked if the free space accounting fix up addressed this
boot regression? I was awaiting word on that.
hm, does bot@kernelci.org actually read emails? Let's try info@ as well..
bot@kernelci.org is not person, it's a send-only account for
automated reports. So no, it doesn't read emails.

I guess the tricky point here is that the authors of the commits
found by bisections may not always have the hardware needed to
reproduce the problem. So it needs to be dealt with on a
case-by-case basis: sometimes they do have the hardware,
sometimes someone else on the list or on CC does, and sometimes
it's better for the people who have access to the test lab which
ran the KernelCI test to deal with it.

This case seems to fall into the last category. As I have access
to the Collabora lab, I can do some quick checks to confirm
whether the proposed patch does fix the issue. I hadn't realised
that someone was waiting for this to happen, especially as the
BeagleBone Black is a very common platform. Sorry about that,
I'll take a look today.

It may be a nice feature to be able to give access to the
KernelCI test infrastructure to anyone who wants to debug an
issue reported by KernelCI or verify a fix, so they won't need to
have the hardware locally. Something to think about for the
future.
Thanks, that all sounds good.

Is it possible to determine whether this regression is still present in
current linux-next?
I'll try to re-apply the patch that caused the issue, then see if
the suggested change fixes it. As far as the current linux-next
master branch is concerned, KernelCI boot tests are passing fine
on that platform.
They would, because I dropped
mm-shuffle-default-enable-all-shuffling.patch, so your tests presumably
now have shuffling disabled.

Is it possible to add the below to linux-next and try again?
I've actually already done that, and essentially the issue can
still be reproduced by applying that patch. See this branch:

https://gitlab.collabora.com/gtucker/linux/commits/next-20190301-beaglebone-black-debug

next-20190301 boots fine but the head fails, using
multi_v7_defconfig + SMP=n in both cases and
SHUFFLE_PAGE_ALLOCATOR=y enabled in the 2nd case as a result
of the change in the default value.

The change suggested by Michal Hocko on Feb 15th has now been
applied in linux-next, it's part of this commit but as
explained above it does not actually resolve the boot failure:

98cf198ee8ce mm: move buddy list manipulations into helpers

I can send more details on Monday and do a bit of debugging to
help narrowing down the problem. Please let me know if
there's anything in particular that would seem be worth
trying.

Or I can re-add this to linux-next. Where should we go to determine
the results of such a change? There are a heck of a lot of results on
https://kernelci.org/boot/ and entering "beaglebone-black" doesn't get
me anything.
The BeagleBone Black board was offline for a few days in our
lab, which probably explains why you're not getting much
results from the web interface. Hopefully we'll see passing
boot results in linux-next tomorrow now that the board is back
on track.

It's quite easy for me to submit test jobs with kernels I've
built myself instead of going through the full linux-next and
KernelCI loop. So that's the best way to try things out, then
when a fix has been found it can be applied in linux-next on
top of the mm/shuffle change to verify it in KernelCI.

Guillaume

From: Dan Williams <dan.j.williams@intel.com>
Subject: mm/shuffle: default enable all shuffling

Per Andrew's request arrange for all memory allocation shuffling code to
be enabled by default.

The page_alloc.shuffle command line parameter can still be used to disable
shuffling at boot, but the kernel will default enable the shuffling if the
command line option is not specified.

Link: http://lkml.kernel.org/r/154943713572.3858443.11206307988382889377.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

init/Kconfig | 4 ++--
mm/shuffle.c | 4 ++--
mm/shuffle.h | 2 +-
3 files changed, 5 insertions(+), 5 deletions(-)

--- a/init/Kconfig~mm-shuffle-default-enable-all-shuffling
+++ a/init/Kconfig
@@ -1709,7 +1709,7 @@ config SLAB_MERGE_DEFAULT
command line.

config SLAB_FREELIST_RANDOM
- default n
+ default y
depends on SLAB || SLUB
bool "SLAB freelist randomization"
help
@@ -1728,7 +1728,7 @@ config SLAB_FREELIST_HARDENED

config SHUFFLE_PAGE_ALLOCATOR
bool "Page allocator randomization"
- default SLAB_FREELIST_RANDOM && ACPI_NUMA
+ default y
help
Randomization of the page allocator improves the average
utilization of a direct-mapped memory-side-cache. See section
--- a/mm/shuffle.c~mm-shuffle-default-enable-all-shuffling
+++ a/mm/shuffle.c
@@ -9,8 +9,8 @@
#include "internal.h"
#include "shuffle.h"

-DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
-static unsigned long shuffle_state __ro_after_init;
+DEFINE_STATIC_KEY_TRUE(page_alloc_shuffle_key);
+static unsigned long shuffle_state __ro_after_init = 1 << SHUFFLE_ENABLE;

/*
* Depending on the architecture, module parameter parsing may run
--- a/mm/shuffle.h~mm-shuffle-default-enable-all-shuffling
+++ a/mm/shuffle.h
@@ -19,7 +19,7 @@ enum mm_shuffle_ctl {
#define SHUFFLE_ORDER (MAX_ORDER-1)

#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
-DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
+DECLARE_STATIC_KEY_TRUE(page_alloc_shuffle_key);
extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
extern void __shuffle_free_memory(pg_data_t *pgdat);
static inline void shuffle_free_memory(pg_data_t *pgdat)
_


Andrew Morton <akpm@...>
 

On Fri, 1 Mar 2019 09:25:24 +0100 Guillaume Tucker <guillaume.tucker@collabora.com> wrote:

Michal had asked if the free space accounting fix up addressed this
boot regression? I was awaiting word on that.
hm, does bot@kernelci.org actually read emails? Let's try info@ as well..
bot@kernelci.org is not person, it's a send-only account for
automated reports. So no, it doesn't read emails.

I guess the tricky point here is that the authors of the commits
found by bisections may not always have the hardware needed to
reproduce the problem. So it needs to be dealt with on a
case-by-case basis: sometimes they do have the hardware,
sometimes someone else on the list or on CC does, and sometimes
it's better for the people who have access to the test lab which
ran the KernelCI test to deal with it.

This case seems to fall into the last category. As I have access
to the Collabora lab, I can do some quick checks to confirm
whether the proposed patch does fix the issue. I hadn't realised
that someone was waiting for this to happen, especially as the
BeagleBone Black is a very common platform. Sorry about that,
I'll take a look today.

It may be a nice feature to be able to give access to the
KernelCI test infrastructure to anyone who wants to debug an
issue reported by KernelCI or verify a fix, so they won't need to
have the hardware locally. Something to think about for the
future.
Thanks, that all sounds good.

Is it possible to determine whether this regression is still present in
current linux-next?
I'll try to re-apply the patch that caused the issue, then see if
the suggested change fixes it. As far as the current linux-next
master branch is concerned, KernelCI boot tests are passing fine
on that platform.
They would, because I dropped
mm-shuffle-default-enable-all-shuffling.patch, so your tests presumably
now have shuffling disabled.

Is it possible to add the below to linux-next and try again?

Or I can re-add this to linux-next. Where should we go to determine
the results of such a change? There are a heck of a lot of results on
https://kernelci.org/boot/ and entering "beaglebone-black" doesn't get
me anything.

Thanks.



From: Dan Williams <dan.j.williams@intel.com>
Subject: mm/shuffle: default enable all shuffling

Per Andrew's request arrange for all memory allocation shuffling code to
be enabled by default.

The page_alloc.shuffle command line parameter can still be used to disable
shuffling at boot, but the kernel will default enable the shuffling if the
command line option is not specified.

Link: http://lkml.kernel.org/r/154943713572.3858443.11206307988382889377.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

init/Kconfig | 4 ++--
mm/shuffle.c | 4 ++--
mm/shuffle.h | 2 +-
3 files changed, 5 insertions(+), 5 deletions(-)

--- a/init/Kconfig~mm-shuffle-default-enable-all-shuffling
+++ a/init/Kconfig
@@ -1709,7 +1709,7 @@ config SLAB_MERGE_DEFAULT
command line.

config SLAB_FREELIST_RANDOM
- default n
+ default y
depends on SLAB || SLUB
bool "SLAB freelist randomization"
help
@@ -1728,7 +1728,7 @@ config SLAB_FREELIST_HARDENED

config SHUFFLE_PAGE_ALLOCATOR
bool "Page allocator randomization"
- default SLAB_FREELIST_RANDOM && ACPI_NUMA
+ default y
help
Randomization of the page allocator improves the average
utilization of a direct-mapped memory-side-cache. See section
--- a/mm/shuffle.c~mm-shuffle-default-enable-all-shuffling
+++ a/mm/shuffle.c
@@ -9,8 +9,8 @@
#include "internal.h"
#include "shuffle.h"

-DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
-static unsigned long shuffle_state __ro_after_init;
+DEFINE_STATIC_KEY_TRUE(page_alloc_shuffle_key);
+static unsigned long shuffle_state __ro_after_init = 1 << SHUFFLE_ENABLE;

/*
* Depending on the architecture, module parameter parsing may run
--- a/mm/shuffle.h~mm-shuffle-default-enable-all-shuffling
+++ a/mm/shuffle.h
@@ -19,7 +19,7 @@ enum mm_shuffle_ctl {
#define SHUFFLE_ORDER (MAX_ORDER-1)

#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
-DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
+DECLARE_STATIC_KEY_TRUE(page_alloc_shuffle_key);
extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
extern void __shuffle_free_memory(pg_data_t *pgdat);
static inline void shuffle_free_memory(pg_data_t *pgdat)
_


Dan Williams <dan.j.williams@...>
 

On Fri, Mar 1, 2019 at 1:05 PM Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:

On 01/03/2019 20:41, Andrew Morton wrote:
On Fri, 1 Mar 2019 09:25:24 +0100 Guillaume Tucker <guillaume.tucker@collabora.com> wrote:

Michal had asked if the free space accounting fix up addressed this
boot regression? I was awaiting word on that.
hm, does bot@kernelci.org actually read emails? Let's try info@ as well..
bot@kernelci.org is not person, it's a send-only account for
automated reports. So no, it doesn't read emails.

I guess the tricky point here is that the authors of the commits
found by bisections may not always have the hardware needed to
reproduce the problem. So it needs to be dealt with on a
case-by-case basis: sometimes they do have the hardware,
sometimes someone else on the list or on CC does, and sometimes
it's better for the people who have access to the test lab which
ran the KernelCI test to deal with it.

This case seems to fall into the last category. As I have access
to the Collabora lab, I can do some quick checks to confirm
whether the proposed patch does fix the issue. I hadn't realised
that someone was waiting for this to happen, especially as the
BeagleBone Black is a very common platform. Sorry about that,
I'll take a look today.

It may be a nice feature to be able to give access to the
KernelCI test infrastructure to anyone who wants to debug an
issue reported by KernelCI or verify a fix, so they won't need to
have the hardware locally. Something to think about for the
future.
Thanks, that all sounds good.

Is it possible to determine whether this regression is still present in
current linux-next?
I'll try to re-apply the patch that caused the issue, then see if
the suggested change fixes it. As far as the current linux-next
master branch is concerned, KernelCI boot tests are passing fine
on that platform.
They would, because I dropped
mm-shuffle-default-enable-all-shuffling.patch, so your tests presumably
now have shuffling disabled.

Is it possible to add the below to linux-next and try again?
I've actually already done that, and essentially the issue can
still be reproduced by applying that patch. See this branch:

https://gitlab.collabora.com/gtucker/linux/commits/next-20190301-beaglebone-black-debug

next-20190301 boots fine but the head fails, using
multi_v7_defconfig + SMP=n in both cases and
SHUFFLE_PAGE_ALLOCATOR=y enabled in the 2nd case as a result
of the change in the default value.

The change suggested by Michal Hocko on Feb 15th has now been
applied in linux-next, it's part of this commit but as
explained above it does not actually resolve the boot failure:

98cf198ee8ce mm: move buddy list manipulations into helpers

I can send more details on Monday and do a bit of debugging to
help narrowing down the problem. Please let me know if
there's anything in particular that would seem be worth
trying.
Thanks for taking a look!

Some questions when you get a chance:

Is there an early-printk facility that can be turned on to see how far
we get in the boot?

Do any of the QEMU machine types [1] approximate this board? I.e. so I
might be able to independently debug.

Were there any boot *successes* on ARM with shuffling enabled? I.e.
clues about what's different about the specific memory setup for
beagle-bone-black.

Thanks for the help!

[1]: https://wiki.qemu.org/Documentation/Platforms/ARM


Guillaume Tucker
 

On 01/03/2019 23:23, Dan Williams wrote:
On Fri, Mar 1, 2019 at 1:05 PM Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:

On 01/03/2019 20:41, Andrew Morton wrote:
On Fri, 1 Mar 2019 09:25:24 +0100 Guillaume Tucker <guillaume.tucker@collabora.com> wrote:

Michal had asked if the free space accounting fix up addressed this
boot regression? I was awaiting word on that.
hm, does bot@kernelci.org actually read emails? Let's try info@ as well..
bot@kernelci.org is not person, it's a send-only account for
automated reports. So no, it doesn't read emails.

I guess the tricky point here is that the authors of the commits
found by bisections may not always have the hardware needed to
reproduce the problem. So it needs to be dealt with on a
case-by-case basis: sometimes they do have the hardware,
sometimes someone else on the list or on CC does, and sometimes
it's better for the people who have access to the test lab which
ran the KernelCI test to deal with it.

This case seems to fall into the last category. As I have access
to the Collabora lab, I can do some quick checks to confirm
whether the proposed patch does fix the issue. I hadn't realised
that someone was waiting for this to happen, especially as the
BeagleBone Black is a very common platform. Sorry about that,
I'll take a look today.

It may be a nice feature to be able to give access to the
KernelCI test infrastructure to anyone who wants to debug an
issue reported by KernelCI or verify a fix, so they won't need to
have the hardware locally. Something to think about for the
future.
Thanks, that all sounds good.

Is it possible to determine whether this regression is still present in
current linux-next?
I'll try to re-apply the patch that caused the issue, then see if
the suggested change fixes it. As far as the current linux-next
master branch is concerned, KernelCI boot tests are passing fine
on that platform.
They would, because I dropped
mm-shuffle-default-enable-all-shuffling.patch, so your tests presumably
now have shuffling disabled.

Is it possible to add the below to linux-next and try again?
I've actually already done that, and essentially the issue can
still be reproduced by applying that patch. See this branch:

https://gitlab.collabora.com/gtucker/linux/commits/next-20190301-beaglebone-black-debug

next-20190301 boots fine but the head fails, using
multi_v7_defconfig + SMP=n in both cases and
SHUFFLE_PAGE_ALLOCATOR=y enabled in the 2nd case as a result
of the change in the default value.

The change suggested by Michal Hocko on Feb 15th has now been
applied in linux-next, it's part of this commit but as
explained above it does not actually resolve the boot failure:

98cf198ee8ce mm: move buddy list manipulations into helpers

I can send more details on Monday and do a bit of debugging to
help narrowing down the problem. Please let me know if
there's anything in particular that would seem be worth
trying.
Thanks for taking a look!

Some questions when you get a chance:

Is there an early-printk facility that can be turned on to see how far
we get in the boot?
Yes, I've done that now by enabling CONFIG_DEBUG_AM33XXUART1 and
earlyprintk in the command line. Here's the result, with the
commit cherry picked on top of next-20190304:

https://lava.collabora.co.uk/scheduler/job/1526326

[ 1.379522] ti-sysc 4804a000.target-module: sysc_flags 00000222 != 00000022
[ 1.396718] Unable to handle kernel paging request at virtual address 77bb4003
[ 1.404203] pgd = (ptrval)
[ 1.406971] [77bb4003] *pgd=00000000
[ 1.410650] Internal error: Oops: 5 [#1] ARM
[...]
[ 1.672310] [<c07051a0>] (clk_hw_create_clk.part.21) from [<c06fea34>] (devm_clk_get+0x4c/0x80)
[ 1.681232] [<c06fea34>] (devm_clk_get) from [<c064253c>] (sysc_probe+0x28c/0xde4)

It's always failing at that point in the code. Also when
enabling "debug" on the kernel command line, the issue goes
away (exact same binaries etc..):

https://lava.collabora.co.uk/scheduler/job/1526327

For the record, here's the branch I've been using:

https://gitlab.collabora.com/gtucker/linux/tree/beaglebone-black-next-20190304-debug

The board otherwise boots fine with next-20190304 (SMP=n), and
also with the patch applied but the shuffle configs set to n.

Do any of the QEMU machine types [1] approximate this board? I.e. so I
might be able to independently debug.
Unfortunately there doesn't appear to be any QEMU machine
emulating the TI AM335x SoC or the BeagleBone Black board.

Were there any boot *successes* on ARM with shuffling enabled? I.e.
clues about what's different about the specific memory setup for
beagle-bone-black.
Looking at the KernelCI results from next-20190215, it looks like
only the BeagleBone Black with SMP=n failed to boot:

https://kernelci.org/boot/all/job/next/branch/master/kernel/next-20190215/

Of course that's not all the ARM boards that exist out there, but
it's a fairly large coverage already.

As the kernel panic always seems to originate in ti-sysc.c,
there's a chance it's only visible on that platform... I'm doing
a KernelCI run now with my test branch to double check that,
it'll take a few hours so I'll send an update later if I get
anything useful out of it.

In the meantime, I'm happy to try out other things with more
debug configs turned on or any potential fixes someone might
have.

Thanks,
Guillaume

Thanks for the help!

[1]: https://wiki.qemu.org/Documentation/Platforms/ARM


Mike Rapoport <rppt@...>
 

On Wed, Mar 06, 2019 at 10:14:47AM +0000, Guillaume Tucker wrote:
On 01/03/2019 23:23, Dan Williams wrote:
On Fri, Mar 1, 2019 at 1:05 PM Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:

Is there an early-printk facility that can be turned on to see how far
we get in the boot?
Yes, I've done that now by enabling CONFIG_DEBUG_AM33XXUART1 and
earlyprintk in the command line. Here's the result, with the
commit cherry picked on top of next-20190304:

https://lava.collabora.co.uk/scheduler/job/1526326

[ 1.379522] ti-sysc 4804a000.target-module: sysc_flags 00000222 != 00000022
[ 1.396718] Unable to handle kernel paging request at virtual address 77bb4003
[ 1.404203] pgd = (ptrval)
[ 1.406971] [77bb4003] *pgd=00000000
[ 1.410650] Internal error: Oops: 5 [#1] ARM
[...]
[ 1.672310] [<c07051a0>] (clk_hw_create_clk.part.21) from [<c06fea34>] (devm_clk_get+0x4c/0x80)
[ 1.681232] [<c06fea34>] (devm_clk_get) from [<c064253c>] (sysc_probe+0x28c/0xde4)

It's always failing at that point in the code. Also when
enabling "debug" on the kernel command line, the issue goes
away (exact same binaries etc..):

https://lava.collabora.co.uk/scheduler/job/1526327

For the record, here's the branch I've been using:

https://gitlab.collabora.com/gtucker/linux/tree/beaglebone-black-next-20190304-debug

The board otherwise boots fine with next-20190304 (SMP=n), and
also with the patch applied but the shuffle configs set to n.

Were there any boot *successes* on ARM with shuffling enabled? I.e.
clues about what's different about the specific memory setup for
beagle-bone-black.
Looking at the KernelCI results from next-20190215, it looks like
only the BeagleBone Black with SMP=n failed to boot:

https://kernelci.org/boot/all/job/next/branch/master/kernel/next-20190215/

Of course that's not all the ARM boards that exist out there, but
it's a fairly large coverage already.

As the kernel panic always seems to originate in ti-sysc.c,
there's a chance it's only visible on that platform... I'm doing
a KernelCI run now with my test branch to double check that,
it'll take a few hours so I'll send an update later if I get
anything useful out of it.

In the meantime, I'm happy to try out other things with more
debug configs turned on or any potential fixes someone might
have.
ARM is the only arch that sets ARCH_HAS_HOLES_MEMORYMODEL to 'y'. Maybe the
failure has something to do with it...

Guillaume, can you try this patch:

diff --git a/mm/shuffle.c b/mm/shuffle.c
index 3ce1248..4a04aac 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -58,7 +58,8 @@ module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);
* For two pages to be swapped in the shuffle, they must be free (on a
* 'free_area' lru), have the same order, and have the same migratetype.
*/
-static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
+static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order,
+ struct zone *z)
{
struct page *page;

@@ -80,6 +81,9 @@ static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
if (!PageBuddy(page))
return NULL;

+ if (!memmap_valid_within(pfn, page, z))
+ return NULL;
+
/*
* ...is the page on the same list as the page we will
* shuffle it with?
@@ -123,7 +127,7 @@ void __meminit __shuffle_zone(struct zone *z)
* page_j randomly selected in the span @zone_start_pfn to
* @spanned_pages.
*/
- page_i = shuffle_valid_page(i, order);
+ page_i = shuffle_valid_page(i, order, z);
if (!page_i)
continue;

@@ -137,7 +141,7 @@ void __meminit __shuffle_zone(struct zone *z)
j = z->zone_start_pfn +
ALIGN_DOWN(get_random_long() % z->spanned_pages,
order_pages);
- page_j = shuffle_valid_page(j, order);
+ page_j = shuffle_valid_page(j, order, z);
if (page_j && page_j != page_i)
break;
}


--
Sincerely yours,
Mike.


Guillaume Tucker
 

On 06/03/2019 14:05, Mike Rapoport wrote:
On Wed, Mar 06, 2019 at 10:14:47AM +0000, Guillaume Tucker wrote:
On 01/03/2019 23:23, Dan Williams wrote:
On Fri, Mar 1, 2019 at 1:05 PM Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:

Is there an early-printk facility that can be turned on to see how far
we get in the boot?
Yes, I've done that now by enabling CONFIG_DEBUG_AM33XXUART1 and
earlyprintk in the command line. Here's the result, with the
commit cherry picked on top of next-20190304:

https://lava.collabora.co.uk/scheduler/job/1526326

[ 1.379522] ti-sysc 4804a000.target-module: sysc_flags 00000222 != 00000022
[ 1.396718] Unable to handle kernel paging request at virtual address 77bb4003
[ 1.404203] pgd = (ptrval)
[ 1.406971] [77bb4003] *pgd=00000000
[ 1.410650] Internal error: Oops: 5 [#1] ARM
[...]
[ 1.672310] [<c07051a0>] (clk_hw_create_clk.part.21) from [<c06fea34>] (devm_clk_get+0x4c/0x80)
[ 1.681232] [<c06fea34>] (devm_clk_get) from [<c064253c>] (sysc_probe+0x28c/0xde4)

It's always failing at that point in the code. Also when
enabling "debug" on the kernel command line, the issue goes
away (exact same binaries etc..):

https://lava.collabora.co.uk/scheduler/job/1526327

For the record, here's the branch I've been using:

https://gitlab.collabora.com/gtucker/linux/tree/beaglebone-black-next-20190304-debug

The board otherwise boots fine with next-20190304 (SMP=n), and
also with the patch applied but the shuffle configs set to n.

Were there any boot *successes* on ARM with shuffling enabled? I.e.
clues about what's different about the specific memory setup for
beagle-bone-black.
Looking at the KernelCI results from next-20190215, it looks like
only the BeagleBone Black with SMP=n failed to boot:

https://kernelci.org/boot/all/job/next/branch/master/kernel/next-20190215/

Of course that's not all the ARM boards that exist out there, but
it's a fairly large coverage already.

As the kernel panic always seems to originate in ti-sysc.c,
there's a chance it's only visible on that platform... I'm doing
a KernelCI run now with my test branch to double check that,
it'll take a few hours so I'll send an update later if I get
anything useful out of it.
Here's the result, there were a couple of failures but some were
due to infrastructure errors (nyan-big) and I'm not sure about
what was the problem with the meson boards:

https://staging.kernelci.org/boot/all/job/gtucker/branch/kernelci-local/kernel/next-20190304-1-g4f0b547b03da/

So there's no clear indicator that the shuffle config is causing
any issue on any other platform than the BeagleBone Black.

In the meantime, I'm happy to try out other things with more
debug configs turned on or any potential fixes someone might
have.
ARM is the only arch that sets ARCH_HAS_HOLES_MEMORYMODEL to 'y'. Maybe the
failure has something to do with it...

Guillaume, can you try this patch:
Sure, it doesn't seem to be fixing the problem though:

https://lava.collabora.co.uk/scheduler/job/1527471

I've added the patch to the same branch based on next-20190304.

I guess this needs to be debugged a little further to see what
the panic really is about. I'll see if I can spend a bit more
time on it this week, unless there's any BeagleBone expert
available to help or if someone has another fix to try out.

Guillaume

diff --git a/mm/shuffle.c b/mm/shuffle.c
index 3ce1248..4a04aac 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -58,7 +58,8 @@ module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);
* For two pages to be swapped in the shuffle, they must be free (on a
* 'free_area' lru), have the same order, and have the same migratetype.
*/
-static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
+static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order,
+ struct zone *z)
{
struct page *page;

@@ -80,6 +81,9 @@ static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
if (!PageBuddy(page))
return NULL;

+ if (!memmap_valid_within(pfn, page, z))
+ return NULL;
+
/*
* ...is the page on the same list as the page we will
* shuffle it with?
@@ -123,7 +127,7 @@ void __meminit __shuffle_zone(struct zone *z)
* page_j randomly selected in the span @zone_start_pfn to
* @spanned_pages.
*/
- page_i = shuffle_valid_page(i, order);
+ page_i = shuffle_valid_page(i, order, z);
if (!page_i)
continue;

@@ -137,7 +141,7 @@ void __meminit __shuffle_zone(struct zone *z)
j = z->zone_start_pfn +
ALIGN_DOWN(get_random_long() % z->spanned_pages,
order_pages);
- page_j = shuffle_valid_page(j, order);
+ page_j = shuffle_valid_page(j, order, z);
if (page_j && page_j != page_i)
break;
}


Dan Williams <dan.j.williams@...>
 

On Thu, Mar 7, 2019 at 1:17 AM Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:

On 06/03/2019 14:05, Mike Rapoport wrote:
On Wed, Mar 06, 2019 at 10:14:47AM +0000, Guillaume Tucker wrote:
On 01/03/2019 23:23, Dan Williams wrote:
On Fri, Mar 1, 2019 at 1:05 PM Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:

Is there an early-printk facility that can be turned on to see how far
we get in the boot?
Yes, I've done that now by enabling CONFIG_DEBUG_AM33XXUART1 and
earlyprintk in the command line. Here's the result, with the
commit cherry picked on top of next-20190304:

https://lava.collabora.co.uk/scheduler/job/1526326

[ 1.379522] ti-sysc 4804a000.target-module: sysc_flags 00000222 != 00000022
[ 1.396718] Unable to handle kernel paging request at virtual address 77bb4003
[ 1.404203] pgd = (ptrval)
[ 1.406971] [77bb4003] *pgd=00000000
[ 1.410650] Internal error: Oops: 5 [#1] ARM
[...]
[ 1.672310] [<c07051a0>] (clk_hw_create_clk.part.21) from [<c06fea34>] (devm_clk_get+0x4c/0x80)
[ 1.681232] [<c06fea34>] (devm_clk_get) from [<c064253c>] (sysc_probe+0x28c/0xde4)

It's always failing at that point in the code. Also when
enabling "debug" on the kernel command line, the issue goes
away (exact same binaries etc..):

https://lava.collabora.co.uk/scheduler/job/1526327

For the record, here's the branch I've been using:

https://gitlab.collabora.com/gtucker/linux/tree/beaglebone-black-next-20190304-debug

The board otherwise boots fine with next-20190304 (SMP=n), and
also with the patch applied but the shuffle configs set to n.

Were there any boot *successes* on ARM with shuffling enabled? I.e.
clues about what's different about the specific memory setup for
beagle-bone-black.
Looking at the KernelCI results from next-20190215, it looks like
only the BeagleBone Black with SMP=n failed to boot:

https://kernelci.org/boot/all/job/next/branch/master/kernel/next-20190215/

Of course that's not all the ARM boards that exist out there, but
it's a fairly large coverage already.

As the kernel panic always seems to originate in ti-sysc.c,
there's a chance it's only visible on that platform... I'm doing
a KernelCI run now with my test branch to double check that,
it'll take a few hours so I'll send an update later if I get
anything useful out of it.
Here's the result, there were a couple of failures but some were
due to infrastructure errors (nyan-big) and I'm not sure about
what was the problem with the meson boards:

https://staging.kernelci.org/boot/all/job/gtucker/branch/kernelci-local/kernel/next-20190304-1-g4f0b547b03da/

So there's no clear indicator that the shuffle config is causing
any issue on any other platform than the BeagleBone Black.

In the meantime, I'm happy to try out other things with more
debug configs turned on or any potential fixes someone might
have.
ARM is the only arch that sets ARCH_HAS_HOLES_MEMORYMODEL to 'y'. Maybe the
failure has something to do with it...

Guillaume, can you try this patch:
Mike, I appreciate the help!


Sure, it doesn't seem to be fixing the problem though:

https://lava.collabora.co.uk/scheduler/job/1527471

I've added the patch to the same branch based on next-20190304.

I guess this needs to be debugged a little further to see what
the panic really is about. I'll see if I can spend a bit more
time on it this week, unless there's any BeagleBone expert
available to help or if someone has another fix to try out.
Thanks for the help Guillaume!

I went ahead and acquired one of these boards to see if I can can
debug this locally.


Kees Cook <keescook@...>
 

On Thu, Mar 7, 2019 at 7:43 AM Dan Williams <dan.j.williams@intel.com> wrote:

On Thu, Mar 7, 2019 at 1:17 AM Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:

On 06/03/2019 14:05, Mike Rapoport wrote:
On Wed, Mar 06, 2019 at 10:14:47AM +0000, Guillaume Tucker wrote:
On 01/03/2019 23:23, Dan Williams wrote:
On Fri, Mar 1, 2019 at 1:05 PM Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:

Is there an early-printk facility that can be turned on to see how far
we get in the boot?
Yes, I've done that now by enabling CONFIG_DEBUG_AM33XXUART1 and
earlyprintk in the command line. Here's the result, with the
commit cherry picked on top of next-20190304:

https://lava.collabora.co.uk/scheduler/job/1526326

[ 1.379522] ti-sysc 4804a000.target-module: sysc_flags 00000222 != 00000022
[ 1.396718] Unable to handle kernel paging request at virtual address 77bb4003
[ 1.404203] pgd = (ptrval)
[ 1.406971] [77bb4003] *pgd=00000000
[ 1.410650] Internal error: Oops: 5 [#1] ARM
[...]
[ 1.672310] [<c07051a0>] (clk_hw_create_clk.part.21) from [<c06fea34>] (devm_clk_get+0x4c/0x80)
[ 1.681232] [<c06fea34>] (devm_clk_get) from [<c064253c>] (sysc_probe+0x28c/0xde4)

It's always failing at that point in the code. Also when
enabling "debug" on the kernel command line, the issue goes
away (exact same binaries etc..):

https://lava.collabora.co.uk/scheduler/job/1526327

For the record, here's the branch I've been using:

https://gitlab.collabora.com/gtucker/linux/tree/beaglebone-black-next-20190304-debug

The board otherwise boots fine with next-20190304 (SMP=n), and
also with the patch applied but the shuffle configs set to n.

Were there any boot *successes* on ARM with shuffling enabled? I.e.
clues about what's different about the specific memory setup for
beagle-bone-black.
Looking at the KernelCI results from next-20190215, it looks like
only the BeagleBone Black with SMP=n failed to boot:

https://kernelci.org/boot/all/job/next/branch/master/kernel/next-20190215/

Of course that's not all the ARM boards that exist out there, but
it's a fairly large coverage already.

As the kernel panic always seems to originate in ti-sysc.c,
there's a chance it's only visible on that platform... I'm doing
a KernelCI run now with my test branch to double check that,
it'll take a few hours so I'll send an update later if I get
anything useful out of it.
Here's the result, there were a couple of failures but some were
due to infrastructure errors (nyan-big) and I'm not sure about
what was the problem with the meson boards:

https://staging.kernelci.org/boot/all/job/gtucker/branch/kernelci-local/kernel/next-20190304-1-g4f0b547b03da/

So there's no clear indicator that the shuffle config is causing
any issue on any other platform than the BeagleBone Black.

In the meantime, I'm happy to try out other things with more
debug configs turned on or any potential fixes someone might
have.
ARM is the only arch that sets ARCH_HAS_HOLES_MEMORYMODEL to 'y'. Maybe the
failure has something to do with it...

Guillaume, can you try this patch:
Mike, I appreciate the help!


Sure, it doesn't seem to be fixing the problem though:

https://lava.collabora.co.uk/scheduler/job/1527471

I've added the patch to the same branch based on next-20190304.

I guess this needs to be debugged a little further to see what
the panic really is about. I'll see if I can spend a bit more
time on it this week, unless there's any BeagleBone expert
available to help or if someone has another fix to try out.
Thanks for the help Guillaume!

I went ahead and acquired one of these boards to see if I can can
debug this locally.
Hi! Any progress on this? Might it be possible to unblock this series
for v5.2 by adding a temporary "not on ARM" flag?

Thanks!

--
Kees Cook


Guenter Roeck
 

On Thu, Apr 11, 2019 at 9:19 AM Kees Cook <keescook@chromium.org> wrote:

On Thu, Mar 7, 2019 at 7:43 AM Dan Williams <dan.j.williams@intel.com> wrote:

On Thu, Mar 7, 2019 at 1:17 AM Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:

On 06/03/2019 14:05, Mike Rapoport wrote:
On Wed, Mar 06, 2019 at 10:14:47AM +0000, Guillaume Tucker wrote:
On 01/03/2019 23:23, Dan Williams wrote:
On Fri, Mar 1, 2019 at 1:05 PM Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:

Is there an early-printk facility that can be turned on to see how far
we get in the boot?
Yes, I've done that now by enabling CONFIG_DEBUG_AM33XXUART1 and
earlyprintk in the command line. Here's the result, with the
commit cherry picked on top of next-20190304:

https://lava.collabora.co.uk/scheduler/job/1526326

[ 1.379522] ti-sysc 4804a000.target-module: sysc_flags 00000222 != 00000022
[ 1.396718] Unable to handle kernel paging request at virtual address 77bb4003
[ 1.404203] pgd = (ptrval)
[ 1.406971] [77bb4003] *pgd=00000000
[ 1.410650] Internal error: Oops: 5 [#1] ARM
[...]
[ 1.672310] [<c07051a0>] (clk_hw_create_clk.part.21) from [<c06fea34>] (devm_clk_get+0x4c/0x80)
[ 1.681232] [<c06fea34>] (devm_clk_get) from [<c064253c>] (sysc_probe+0x28c/0xde4)

It's always failing at that point in the code. Also when
enabling "debug" on the kernel command line, the issue goes
away (exact same binaries etc..):

https://lava.collabora.co.uk/scheduler/job/1526327

For the record, here's the branch I've been using:

https://gitlab.collabora.com/gtucker/linux/tree/beaglebone-black-next-20190304-debug

The board otherwise boots fine with next-20190304 (SMP=n), and
also with the patch applied but the shuffle configs set to n.

Were there any boot *successes* on ARM with shuffling enabled? I.e.
clues about what's different about the specific memory setup for
beagle-bone-black.
Looking at the KernelCI results from next-20190215, it looks like
only the BeagleBone Black with SMP=n failed to boot:

https://kernelci.org/boot/all/job/next/branch/master/kernel/next-20190215/

Of course that's not all the ARM boards that exist out there, but
it's a fairly large coverage already.

As the kernel panic always seems to originate in ti-sysc.c,
there's a chance it's only visible on that platform... I'm doing
a KernelCI run now with my test branch to double check that,
it'll take a few hours so I'll send an update later if I get
anything useful out of it.
Here's the result, there were a couple of failures but some were
due to infrastructure errors (nyan-big) and I'm not sure about
what was the problem with the meson boards:

https://staging.kernelci.org/boot/all/job/gtucker/branch/kernelci-local/kernel/next-20190304-1-g4f0b547b03da/

So there's no clear indicator that the shuffle config is causing
any issue on any other platform than the BeagleBone Black.

In the meantime, I'm happy to try out other things with more
debug configs turned on or any potential fixes someone might
have.
ARM is the only arch that sets ARCH_HAS_HOLES_MEMORYMODEL to 'y'. Maybe the
failure has something to do with it...

Guillaume, can you try this patch:
Mike, I appreciate the help!


Sure, it doesn't seem to be fixing the problem though:

https://lava.collabora.co.uk/scheduler/job/1527471

I've added the patch to the same branch based on next-20190304.

I guess this needs to be debugged a little further to see what
the panic really is about. I'll see if I can spend a bit more
time on it this week, unless there's any BeagleBone expert
available to help or if someone has another fix to try out.
Thanks for the help Guillaume!

I went ahead and acquired one of these boards to see if I can can
debug this locally.
Hi! Any progress on this? Might it be possible to unblock this series
for v5.2 by adding a temporary "not on ARM" flag?
Can someone send me a pointer to the series in question ? I would like
to run it through my testbed.

Thanks,
Guenter

Thanks!

--
Kees Cook



Kees Cook <keescook@...>
 

On Thu, Apr 11, 2019 at 9:42 AM Guenter Roeck <groeck@google.com> wrote:

On Thu, Apr 11, 2019 at 9:19 AM Kees Cook <keescook@chromium.org> wrote:

On Thu, Mar 7, 2019 at 7:43 AM Dan Williams <dan.j.williams@intel.com> wrote:
I went ahead and acquired one of these boards to see if I can can
debug this locally.
Hi! Any progress on this? Might it be possible to unblock this series
for v5.2 by adding a temporary "not on ARM" flag?
Can someone send me a pointer to the series in question ? I would like
to run it through my testbed.
It's already in -mm and linux-next (",mm: shuffle initial free memory
to improve memory-side-cache utilization") but it gets enabled with
CONFIG_SHUFFLE_PAGE_ALLOCATOR=y (which was made the default briefly in
-mm which triggered problems on ARM as was reverted).

--
Kees Cook


Guenter Roeck
 

On Thu, Apr 11, 2019 at 10:35 AM Kees Cook <keescook@chromium.org> wrote:

On Thu, Apr 11, 2019 at 9:42 AM Guenter Roeck <groeck@google.com> wrote:

On Thu, Apr 11, 2019 at 9:19 AM Kees Cook <keescook@chromium.org> wrote:

On Thu, Mar 7, 2019 at 7:43 AM Dan Williams <dan.j.williams@intel.com> wrote:
I went ahead and acquired one of these boards to see if I can can
debug this locally.
Hi! Any progress on this? Might it be possible to unblock this series
for v5.2 by adding a temporary "not on ARM" flag?
Can someone send me a pointer to the series in question ? I would like
to run it through my testbed.
It's already in -mm and linux-next (",mm: shuffle initial free memory
to improve memory-side-cache utilization") but it gets enabled with
CONFIG_SHUFFLE_PAGE_ALLOCATOR=y (which was made the default briefly in
-mm which triggered problems on ARM as was reverted).
Boot tests report

Qemu test results:
total: 345 pass: 345 fail: 0

This is on top of next-20190410 with CONFIG_SHUFFLE_PAGE_ALLOCATOR=y
and the known crashes fixed.

$ git log --oneline next-20190410..
3367c36ce744 Set SHUFFLE_PAGE_ALLOCATOR=y for testing.
d2aee8b3cd5d Revert "crypto: scompress - Use per-CPU struct instead
multiple variables"
4bc9f5bc9a84 Fix: rhashtable: use bit_spin_locks to protect hash bucket.

Boot tests on arm are:

Building arm:versatilepb:versatile_defconfig:aeabi:pci:scsi:mem128:versatile-pb:rootfs
... running ........ passed
Building arm:versatilepb:versatile_defconfig:aeabi:pci:mem128:versatile-pb:initrd
... running ........ passed
Building arm:versatileab:versatile_defconfig:mem128:versatile-ab:initrd
... running ........ passed
Building arm:imx25-pdk:imx_v4_v5_defconfig:nonand:mem128:imx25-pdk:initrd
... running ........ passed
Building arm:kzm:imx_v6_v7_defconfig:nodrm:mem128:initrd ... running
.......... passed
Building arm:mcimx6ul-evk:imx_v6_v7_defconfig:nodrm:mem256:imx6ul-14x14-evk:initrd
... running .......... passed
Building arm:mcimx6ul-evk:imx_v6_v7_defconfig:nodrm:sd:mem256:imx6ul-14x14-evk:rootfs
... running .......... passed
Building arm:vexpress-a9:multi_v7_defconfig:nolocktests:mem128:vexpress-v2p-ca9:initrd
... running ........ passed
Building arm:vexpress-a9:multi_v7_defconfig:nolocktests:sd:mem128:vexpress-v2p-ca9:rootfs
... running ........ passed
Building arm:vexpress-a9:multi_v7_defconfig:nolocktests:virtio-blk:mem128:vexpress-v2p-ca9:rootfs
... running ........ passed
Building arm:vexpress-a15:multi_v7_defconfig:nolocktests:sd:mem128:vexpress-v2p-ca15-tc1:rootfs
... running ........ passed
Building arm:vexpress-a15-a7:multi_v7_defconfig:nolocktests:sd:mem256:vexpress-v2p-ca15_a7:rootfs
... running ........ passed
Building arm:beagle:multi_v7_defconfig:sd:mem256:omap3-beagle:rootfs
... running ............ passed
Building arm:beaglexm:multi_v7_defconfig:sd:mem512:omap3-beagle-xm:rootfs
... running ........... passed
Building arm:overo:multi_v7_defconfig:sd:mem256:omap3-overo-tobi:rootfs
... running ........... passed
Building arm:midway:multi_v7_defconfig:mem2G:ecx-2000:initrd ...
running .......... passed
Building arm:sabrelite:multi_v7_defconfig:mem256:imx6dl-sabrelite:initrd
... running ............ passed
Building arm:mcimx7d-sabre:multi_v7_defconfig:mem256:imx7d-sdb:initrd
... running .......... passed
Building arm:xilinx-zynq-a9:multi_v7_defconfig:mem128:zynq-zc702:initrd
... running ............ passed
Building arm:xilinx-zynq-a9:multi_v7_defconfig:sd:mem128:zynq-zc702:rootfs
... running ............ passed
Building arm:xilinx-zynq-a9:multi_v7_defconfig:sd:mem128:zynq-zc706:rootfs
... running ............ passed
Building arm:xilinx-zynq-a9:multi_v7_defconfig:sd:mem128:zynq-zed:rootfs
... running ........... passed
Building arm:cubieboard:multi_v7_defconfig:mem128:sun4i-a10-cubieboard:initrd
... running ........... passed
Building arm:raspi2:multi_v7_defconfig:bcm2836-rpi-2-b:initrd ...
running .......... passed
Building arm:raspi2:multi_v7_defconfig:sd:bcm2836-rpi-2-b:rootfs ...
running .......... passed
Building arm:virt:multi_v7_defconfig:virtio-blk:mem512:rootfs ...
running ......... passed
Building arm:smdkc210:exynos_defconfig:cpuidle:nocrypto:mem128:exynos4210-smdkv310:initrd
... running ......... passed
Building arm:realview-pb-a8:realview_defconfig:realview_pb:mem512:arm-realview-pba8:initrd
... running ........ passed
Building arm:realview-pbx-a9:realview_defconfig:realview_pb:arm-realview-pbx-a9:initrd
... running ........ passed
Building arm:realview-eb:realview_defconfig:realview_eb:mem512:arm-realview-eb:initrd
... running ........ passed
Building arm:realview-eb-mpcore:realview_defconfig:realview_eb:mem512:arm-realview-eb-11mp-ctrevb:initrd
... running ......... passed
Building arm:akita:pxa_defconfig:nofdt:nodebug:notests:novirt:nousb:noscsi:initrd
... running ..... passed
Building arm:borzoi:pxa_defconfig:nofdt:nodebug:notests:novirt:nousb:noscsi:initrd
... running ..... passed
Building arm:mainstone:pxa_defconfig:nofdt:nodebug:notests:novirt:nousb:noscsi:initrd
... running ..... passed
Building arm:spitz:pxa_defconfig:nofdt:nodebug:notests:novirt:nousb:noscsi:initrd
... running ..... passed
Building arm:terrier:pxa_defconfig:nofdt:nodebug:notests:novirt:nousb:noscsi:initrd
... running ..... passed
Building arm:tosa:pxa_defconfig:nofdt:nodebug:notests:novirt:nousb:noscsi:initrd
... running ..... passed
Building arm:z2:pxa_defconfig:nofdt:nodebug:notests:novirt:nousb:noscsi:initrd
... running ..... passed
Building arm:collie:collie_defconfig:aeabi:notests:initrd ... running
..... passed
Building arm:integratorcp:integrator_defconfig:mem128:integratorcp:initrd
... running ....... passed
Building arm:palmetto-bmc:aspeed_g4_defconfig:aspeed-bmc-opp-palmetto:initrd
... running ................. passed
Building arm:witherspoon-bmc:aspeed_g5_defconfig:notests:aspeed-bmc-opp-witherspoon:initrd
... running ........... passed
Building arm:ast2500-evb:aspeed_g5_defconfig:notests:aspeed-ast2500-evb:initrd
... running ................ passed
Building arm:romulus-bmc:aspeed_g5_defconfig:notests:aspeed-bmc-opp-romulus:initrd
... running ......................... passed
Building arm:mps2-an385:mps2_defconfig:mps2-an385:initrd ... running
...... passed

Guenter


Guenter Roeck
 

On Thu, Apr 11, 2019 at 1:22 PM Dan Williams <dan.j.williams@intel.com> wrote:

On Thu, Apr 11, 2019 at 1:08 PM Guenter Roeck <groeck@google.com> wrote:

On Thu, Apr 11, 2019 at 10:35 AM Kees Cook <keescook@chromium.org> wrote:

On Thu, Apr 11, 2019 at 9:42 AM Guenter Roeck <groeck@google.com> wrote:

On Thu, Apr 11, 2019 at 9:19 AM Kees Cook <keescook@chromium.org> wrote:

On Thu, Mar 7, 2019 at 7:43 AM Dan Williams <dan.j.williams@intel.com> wrote:
I went ahead and acquired one of these boards to see if I can can
debug this locally.
Hi! Any progress on this? Might it be possible to unblock this series
for v5.2 by adding a temporary "not on ARM" flag?
Can someone send me a pointer to the series in question ? I would like
to run it through my testbed.
It's already in -mm and linux-next (",mm: shuffle initial free memory
to improve memory-side-cache utilization") but it gets enabled with
CONFIG_SHUFFLE_PAGE_ALLOCATOR=y (which was made the default briefly in
-mm which triggered problems on ARM as was reverted).
Boot tests report

Qemu test results:
total: 345 pass: 345 fail: 0

This is on top of next-20190410 with CONFIG_SHUFFLE_PAGE_ALLOCATOR=y
and the known crashes fixed.
In addition to CONFIG_SHUFFLE_PAGE_ALLOCATOR=y you also need the
kernel command line option "page_alloc.shuffle=1"

...so I doubt you are running with shuffling enabled. Another way to
double check is:

cat /sys/module/page_alloc/parameters/shuffle
Yes, you are right. Because, with it enabled, I see:

Kernel command line: rdinit=/sbin/init page_alloc.shuffle=1 panic=-1
console=ttyAMA0,115200 page_alloc.shuffle=1
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at ./include/linux/jump_label.h:303
page_alloc_shuffle+0x12c/0x1ac
static_key_enable(): static key 'page_alloc_shuffle_key+0x0/0x4' used
before call to jump_label_init()
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted
5.1.0-rc4-next-20190410-00003-g3367c36ce744 #1
Hardware name: ARM Integrator/CP (Device Tree)
[<c0011c68>] (unwind_backtrace) from [<c000ec48>] (show_stack+0x10/0x18)
[<c000ec48>] (show_stack) from [<c07e9710>] (dump_stack+0x18/0x24)
[<c07e9710>] (dump_stack) from [<c001bb1c>] (__warn+0xe0/0x108)
[<c001bb1c>] (__warn) from [<c001bb88>] (warn_slowpath_fmt+0x44/0x6c)
[<c001bb88>] (warn_slowpath_fmt) from [<c0b0c4a8>]
(page_alloc_shuffle+0x12c/0x1ac)
[<c0b0c4a8>] (page_alloc_shuffle) from [<c0b0c550>] (shuffle_store+0x28/0x48)
[<c0b0c550>] (shuffle_store) from [<c003e6a0>] (parse_args+0x1f4/0x350)
[<c003e6a0>] (parse_args) from [<c0ac3c00>] (start_kernel+0x1c0/0x488)
[<c0ac3c00>] (start_kernel) from [<00000000>] ( (null))

I'll re-run the test, but I suspect it will drown in warnings.

Guenter


Dan Williams <dan.j.williams@...>
 

On Thu, Apr 11, 2019 at 1:08 PM Guenter Roeck <groeck@google.com> wrote:

On Thu, Apr 11, 2019 at 10:35 AM Kees Cook <keescook@chromium.org> wrote:

On Thu, Apr 11, 2019 at 9:42 AM Guenter Roeck <groeck@google.com> wrote:

On Thu, Apr 11, 2019 at 9:19 AM Kees Cook <keescook@chromium.org> wrote:

On Thu, Mar 7, 2019 at 7:43 AM Dan Williams <dan.j.williams@intel.com> wrote:
I went ahead and acquired one of these boards to see if I can can
debug this locally.
Hi! Any progress on this? Might it be possible to unblock this series
for v5.2 by adding a temporary "not on ARM" flag?
Can someone send me a pointer to the series in question ? I would like
to run it through my testbed.
It's already in -mm and linux-next (",mm: shuffle initial free memory
to improve memory-side-cache utilization") but it gets enabled with
CONFIG_SHUFFLE_PAGE_ALLOCATOR=y (which was made the default briefly in
-mm which triggered problems on ARM as was reverted).
Boot tests report

Qemu test results:
total: 345 pass: 345 fail: 0

This is on top of next-20190410 with CONFIG_SHUFFLE_PAGE_ALLOCATOR=y
and the known crashes fixed.
In addition to CONFIG_SHUFFLE_PAGE_ALLOCATOR=y you also need the
kernel command line option "page_alloc.shuffle=1"

...so I doubt you are running with shuffling enabled. Another way to
double check is:

cat /sys/module/page_alloc/parameters/shuffle


Mike Rapoport <rppt@...>
 

On Thu, Apr 11, 2019 at 01:08:15PM -0700, Guenter Roeck wrote:
On Thu, Apr 11, 2019 at 10:35 AM Kees Cook <keescook@chromium.org> wrote:

On Thu, Apr 11, 2019 at 9:42 AM Guenter Roeck <groeck@google.com> wrote:

On Thu, Apr 11, 2019 at 9:19 AM Kees Cook <keescook@chromium.org> wrote:

On Thu, Mar 7, 2019 at 7:43 AM Dan Williams <dan.j.williams@intel.com> wrote:
I went ahead and acquired one of these boards to see if I can can
debug this locally.
Hi! Any progress on this? Might it be possible to unblock this series
for v5.2 by adding a temporary "not on ARM" flag?
Can someone send me a pointer to the series in question ? I would like
to run it through my testbed.
It's already in -mm and linux-next (",mm: shuffle initial free memory
to improve memory-side-cache utilization") but it gets enabled with
CONFIG_SHUFFLE_PAGE_ALLOCATOR=y (which was made the default briefly in
-mm which triggered problems on ARM as was reverted).
Boot tests report

Qemu test results:
total: 345 pass: 345 fail: 0

This is on top of next-20190410 with CONFIG_SHUFFLE_PAGE_ALLOCATOR=y
and the known crashes fixed.

$ git log --oneline next-20190410..
3367c36ce744 Set SHUFFLE_PAGE_ALLOCATOR=y for testing.
d2aee8b3cd5d Revert "crypto: scompress - Use per-CPU struct instead
multiple variables"
4bc9f5bc9a84 Fix: rhashtable: use bit_spin_locks to protect hash bucket.

Boot tests on arm are:

Building arm:versatilepb:versatile_defconfig:aeabi:pci:scsi:mem128:versatile-pb:rootfs
... running ........ passed
Building arm:versatilepb:versatile_defconfig:aeabi:pci:mem128:versatile-pb:initrd
... running ........ passed
...

Building arm:witherspoon-bmc:aspeed_g5_defconfig:notests:aspeed-bmc-opp-witherspoon:initrd
... running ........... passed
Building arm:ast2500-evb:aspeed_g5_defconfig:notests:aspeed-ast2500-evb:initrd
... running ................ passed
Building arm:romulus-bmc:aspeed_g5_defconfig:notests:aspeed-bmc-opp-romulus:initrd
... running ......................... passed
Building arm:mps2-an385:mps2_defconfig:mps2-an385:initrd ... running
...... passed
The issue was with an omap2 board and, AFAIK, qemu does not simulate those.

--
Sincerely yours,
Mike.


Dan Williams <dan.j.williams@...>
 

On Thu, Apr 11, 2019 at 1:54 PM Guenter Roeck <groeck@google.com> wrote:
[..]
Boot tests report

Qemu test results:
total: 345 pass: 345 fail: 0

This is on top of next-20190410 with CONFIG_SHUFFLE_PAGE_ALLOCATOR=y
and the known crashes fixed.
In addition to CONFIG_SHUFFLE_PAGE_ALLOCATOR=y you also need the
kernel command line option "page_alloc.shuffle=1"

...so I doubt you are running with shuffling enabled. Another way to
double check is:

cat /sys/module/page_alloc/parameters/shuffle
Yes, you are right. Because, with it enabled, I see:

Kernel command line: rdinit=/sbin/init page_alloc.shuffle=1 panic=-1
console=ttyAMA0,115200 page_alloc.shuffle=1
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at ./include/linux/jump_label.h:303
page_alloc_shuffle+0x12c/0x1ac
static_key_enable(): static key 'page_alloc_shuffle_key+0x0/0x4' used
before call to jump_label_init()
This looks to be specific to ARM never having had to deal with
DEFINE_STATIC_KEY_TRUE in the past.

I am able to avoid this warning by simply not enabling JUMP_LABEL
support in my build.

Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted
5.1.0-rc4-next-20190410-00003-g3367c36ce744 #1
Hardware name: ARM Integrator/CP (Device Tree)
[<c0011c68>] (unwind_backtrace) from [<c000ec48>] (show_stack+0x10/0x18)
[<c000ec48>] (show_stack) from [<c07e9710>] (dump_stack+0x18/0x24)
[<c07e9710>] (dump_stack) from [<c001bb1c>] (__warn+0xe0/0x108)
[<c001bb1c>] (__warn) from [<c001bb88>] (warn_slowpath_fmt+0x44/0x6c)
[<c001bb88>] (warn_slowpath_fmt) from [<c0b0c4a8>]
(page_alloc_shuffle+0x12c/0x1ac)
[<c0b0c4a8>] (page_alloc_shuffle) from [<c0b0c550>] (shuffle_store+0x28/0x48)
[<c0b0c550>] (shuffle_store) from [<c003e6a0>] (parse_args+0x1f4/0x350)
[<c003e6a0>] (parse_args) from [<c0ac3c00>] (start_kernel+0x1c0/0x488)
[<c0ac3c00>] (start_kernel) from [<00000000>] ( (null))

I'll re-run the test, but I suspect it will drown in warnings.
I slogged through getting a Beagle Bone Black up and running with a
Yocto build and it is not failing. I have tried apply the patches on
top of v5.1-rc5 as well as re-testing next-20190215 label, no
reproduction. The shuffle appears to avoid anything sensitive by
default, below are the shuffle actions that were taken relative to
iomem. Can someone with a failure reproduction please send me more
details about their configuration? It would also help to get a failing
boot log with the pr_debug() statements in mm/shuffle.c enabled to see
if the failure is correlated with any unexpected shuffle actions.

80000000-9fffffff : System RAM
80008000-809fffff : Kernel code
80b00000-812be523 : Kernel data

[ 0.086469] __shuffle_zone: swap: 0x81800 -> 0x99800
[ 0.086558] __shuffle_zone: swap: 0x82000 -> 0x88800
[ 0.086575] __shuffle_zone: swap: 0x82800 -> 0x89800
[ 0.086591] __shuffle_zone: swap: 0x83000 -> 0x89000
[ 0.086606] __shuffle_zone: swap: 0x83800 -> 0x8a800
[ 0.086621] __shuffle_zone: swap: 0x84000 -> 0x93800
[ 0.086636] __shuffle_zone: swap: 0x84800 -> 0x83000
[ 0.086651] __shuffle_zone: swap: 0x85000 -> 0x8f000
[ 0.086666] __shuffle_zone: swap: 0x85800 -> 0x88000
[ 0.086689] __shuffle_zone: swap: 0x86000 -> 0x84000
[ 0.086704] __shuffle_zone: swap: 0x86800 -> 0x8c800
[ 0.086719] __shuffle_zone: swap: 0x87000 -> 0x93000
[ 0.086735] __shuffle_zone: swap: 0x87800 -> 0x94000
[ 0.086751] __shuffle_zone: swap: 0x88000 -> 0x90800
[ 0.086766] __shuffle_zone: swap: 0x88800 -> 0x9d000
[ 0.086781] __shuffle_zone: swap: 0x89000 -> 0x82800
[ 0.086796] __shuffle_zone: swap: 0x89800 -> 0x95800
[ 0.086811] __shuffle_zone: swap: 0x8a000 -> 0x98000
[ 0.086826] __shuffle_zone: swap: 0x8a800 -> 0x89000
[ 0.086842] __shuffle_zone: swap: 0x8b000 -> 0x81800
[ 0.086857] __shuffle_zone: swap: 0x8b800 -> 0x88800
[ 0.086872] __shuffle_zone: swap: 0x8c000 -> 0x8a000
[ 0.086891] __shuffle_zone: swap: 0x8c800 -> 0x84800
[ 0.086906] __shuffle_zone: swap: 0x8d000 -> 0x95000
[ 0.086921] __shuffle_zone: swap: 0x8d800 -> 0x8d000
[ 0.086935] __shuffle_zone: swap: 0x8e000 -> 0x8e800
[ 0.086950] __shuffle_zone: swap: 0x8e800 -> 0x99000
[ 0.086964] __shuffle_zone: swap: 0x8f000 -> 0x8d000
[ 0.086979] __shuffle_zone: swap: 0x90000 -> 0x91000
[ 0.086994] __shuffle_zone: swap: 0x90800 -> 0x83000
[ 0.087009] __shuffle_zone: swap: 0x91000 -> 0x91800
[ 0.087025] __shuffle_zone: swap: 0x91800 -> 0x8d800
[ 0.087040] __shuffle_zone: swap: 0x92000 -> 0x86800
[ 0.087054] __shuffle_zone: swap: 0x92800 -> 0x92000
[ 0.087070] __shuffle_zone: swap: 0x93000 -> 0x91000
[ 0.087088] __shuffle_zone: swap: 0x93800 -> 0x85000
[ 0.087103] __shuffle_zone: swap: 0x94000 -> 0x8b800
[ 0.087117] __shuffle_zone: swap: 0x94800 -> 0x96000
[ 0.087132] __shuffle_zone: swap: 0x95000 -> 0x91000
[ 0.087147] __shuffle_zone: swap: 0x95800 -> 0x8e000
[ 0.087161] __shuffle_zone: swap: 0x96000 -> 0x95800
[ 0.087179] __shuffle_zone: swap: 0x96800 -> 0x8c800
[ 0.087193] __shuffle_zone: swap: 0x97000 -> 0x89000
[ 0.087208] __shuffle_zone: swap: 0x97800 -> 0x85000
[ 0.087224] __shuffle_zone: swap: 0x98000 -> 0x85000
[ 0.087239] __shuffle_zone: swap: 0x98800 -> 0x93000
[ 0.087255] __shuffle_zone: swap: 0x99000 -> 0x94800
[ 0.087269] __shuffle_zone: swap: 0x99800 -> 0x94000