Re: next/master boot bisection: next-20190215 on beaglebone-black


Kees Cook <keescook@...>
 

On Thu, Mar 7, 2019 at 7:43 AM Dan Williams <dan.j.williams@intel.com> wrote:

On Thu, Mar 7, 2019 at 1:17 AM Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:

On 06/03/2019 14:05, Mike Rapoport wrote:
On Wed, Mar 06, 2019 at 10:14:47AM +0000, Guillaume Tucker wrote:
On 01/03/2019 23:23, Dan Williams wrote:
On Fri, Mar 1, 2019 at 1:05 PM Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:

Is there an early-printk facility that can be turned on to see how far
we get in the boot?
Yes, I've done that now by enabling CONFIG_DEBUG_AM33XXUART1 and
earlyprintk in the command line. Here's the result, with the
commit cherry picked on top of next-20190304:

https://lava.collabora.co.uk/scheduler/job/1526326

[ 1.379522] ti-sysc 4804a000.target-module: sysc_flags 00000222 != 00000022
[ 1.396718] Unable to handle kernel paging request at virtual address 77bb4003
[ 1.404203] pgd = (ptrval)
[ 1.406971] [77bb4003] *pgd=00000000
[ 1.410650] Internal error: Oops: 5 [#1] ARM
[...]
[ 1.672310] [<c07051a0>] (clk_hw_create_clk.part.21) from [<c06fea34>] (devm_clk_get+0x4c/0x80)
[ 1.681232] [<c06fea34>] (devm_clk_get) from [<c064253c>] (sysc_probe+0x28c/0xde4)

It's always failing at that point in the code. Also when
enabling "debug" on the kernel command line, the issue goes
away (exact same binaries etc..):

https://lava.collabora.co.uk/scheduler/job/1526327

For the record, here's the branch I've been using:

https://gitlab.collabora.com/gtucker/linux/tree/beaglebone-black-next-20190304-debug

The board otherwise boots fine with next-20190304 (SMP=n), and
also with the patch applied but the shuffle configs set to n.

Were there any boot *successes* on ARM with shuffling enabled? I.e.
clues about what's different about the specific memory setup for
beagle-bone-black.
Looking at the KernelCI results from next-20190215, it looks like
only the BeagleBone Black with SMP=n failed to boot:

https://kernelci.org/boot/all/job/next/branch/master/kernel/next-20190215/

Of course that's not all the ARM boards that exist out there, but
it's a fairly large coverage already.

As the kernel panic always seems to originate in ti-sysc.c,
there's a chance it's only visible on that platform... I'm doing
a KernelCI run now with my test branch to double check that,
it'll take a few hours so I'll send an update later if I get
anything useful out of it.
Here's the result, there were a couple of failures but some were
due to infrastructure errors (nyan-big) and I'm not sure about
what was the problem with the meson boards:

https://staging.kernelci.org/boot/all/job/gtucker/branch/kernelci-local/kernel/next-20190304-1-g4f0b547b03da/

So there's no clear indicator that the shuffle config is causing
any issue on any other platform than the BeagleBone Black.

In the meantime, I'm happy to try out other things with more
debug configs turned on or any potential fixes someone might
have.
ARM is the only arch that sets ARCH_HAS_HOLES_MEMORYMODEL to 'y'. Maybe the
failure has something to do with it...

Guillaume, can you try this patch:
Mike, I appreciate the help!


Sure, it doesn't seem to be fixing the problem though:

https://lava.collabora.co.uk/scheduler/job/1527471

I've added the patch to the same branch based on next-20190304.

I guess this needs to be debugged a little further to see what
the panic really is about. I'll see if I can spend a bit more
time on it this week, unless there's any BeagleBone expert
available to help or if someone has another fix to try out.
Thanks for the help Guillaume!

I went ahead and acquired one of these boards to see if I can can
debug this locally.
Hi! Any progress on this? Might it be possible to unblock this series
for v5.2 by adding a temporary "not on ARM" flag?

Thanks!

--
Kees Cook

Join kernelci@groups.io to automatically receive all group messages.