Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libdrgn: add support for remote debugging via OpenOCD #338

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

pcc
Copy link
Contributor

@pcc pcc commented Jul 8, 2023

OpenOCD may be used to establish a JTAG or SWD connection to a remote
debugging target. This patch adds the necessary support to drgn for
debugging a remote target using OpenOCD.

A new memory backend is added that issues memory transactions using
OpenOCD's Tcl interface. With remote debugging using direct physical
memory accesses there is no need for the kernel to have core dump
support. Instead, we use the debugging information in the vmlinux
file, together with information available in the physical memory of
the target. A new arch-specific hook is added for figuring out the
information in vmcoreinfo using the vmlinux file and the physical load
address of the kernel, which must be supplied by the user.

With this patch, drgn can also act as a debugger for microcontroller
firmware and other bare-metal programs that run without an MMU, using
the same OpenOCD memory backend used for kernel debugging.

(Since this has dependencies on my other PRs, I included their commits in this PR.)

pcc added 5 commits July 7, 2023 21:44
The current C string reading implementation is inefficient, especially for
low bandwidth remote targets, as it needs to do a separate segment read
(including a fresh page table lookup) for each character read. A more
efficient approach is to retain the page table between character reads,
only discarding it when we hit the null terminator.

Implement this approach by allowing segments to also specify a C string
reading callback. The callback for page tables will preserve the page
table iterator while reading characters.

Signed-off-by: Peter Collingbourne <[email protected]>
Currently, when drgn is used to debug a running program, we assume it to
be running on the local machine. However, with remote debugging, this will
no longer be the case. To accommodate remote debugging, introduce a flag
DRGN_PROGRAM_IS_LOCAL, and use it to decide whether to use /sys/module.

Signed-off-by: Peter Collingbourne <[email protected]>
…evel

The current page table walker will on average read around half of the
entire page table for each level. This is inefficient, especially when
debugging a remote target which may have a low bandwidth connection to
the debugger. Address this by only reading one PTE per level.

I've only done the aarch64 page table walker because that's all that I
needed, but in principle the other page table walkers could work in a
similar way.

Signed-off-by: Peter Collingbourne <[email protected]>
In tag-based KASAN modes, TCR_EL1.TBI1 is enabled, which causes the
top 8 bits of virtual addresses to be ignored for address translation
purposes. Do the same when reading from memory. There is no harm in doing
so unconditionally, as the architecture does not support >56 bit VA sizes.

Signed-off-by: Peter Collingbourne <[email protected]>
OpenOCD may be used to establish a JTAG or SWD connection to a remote
debugging target. This patch adds the necessary support to drgn for
debugging a remote target using OpenOCD.

A new memory backend is added that issues memory transactions using
OpenOCD's Tcl interface. With remote debugging using direct physical
memory accesses there is no need for the kernel to have core dump
support. Instead, we use the debugging information in the vmlinux
file, together with information available in the physical memory of
the target. A new arch-specific hook is added for figuring out the
information in vmcoreinfo using the vmlinux file and the physical load
address of the kernel, which must be supplied by the user.

With this patch, drgn can also act as a debugger for microcontroller
firmware and other bare-metal programs that run without an MMU, using
the same OpenOCD memory backend used for kernel debugging.

Signed-off-by: Peter Collingbourne <[email protected]>
@pcc
Copy link
Contributor Author

pcc commented Jul 11, 2023

Looks like some of the RPM builders failed, but I didn't have much luck finding the actual build failure amongst the various log files under the Details link. Am I missing anything?

@osandov
Copy link
Owner

osandov commented Jul 11, 2023

The logs are showing "No matching package to install: 'libkdumpfile-devel'", which is also happening on the main branch. So something appears to be busted with Fedora ELN, unrelated to your change.

@pcc
Copy link
Contributor Author

pcc commented Oct 2, 2023

Hi @osandov, do you think you will have a chance to look at this (or any of my other PRs) soon? I'm still actively using them for my kernel development work. I tried rebasing them onto the current main branch and ran into some merge conflicts, so for now I'm sticking with an older version of drgn. It looks like #312 and #315 still apply cleanly; if those go in I can try to rebase the rest.

@osandov
Copy link
Owner

osandov commented Oct 2, 2023

Thanks for the ping, I'll make some time to look at your PRs this week.

@brenns10 since you've been looking at recovering the vmcoreinfo from core dumps lacking virtual address information, would you mind taking a look at that part of the top commit of this PR and sharing how that compares to what you've been doing?

@brenns10
Copy link
Contributor

brenns10 commented Oct 2, 2023

Yeah definitely, I'll take a look -- probably tomorrow morning.

Copy link
Contributor

@brenns10 brenns10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the top commit as mentioned, though I took a brief look at the whole series to get the context as best I could. I'd be happy to go through more code and have more of a reviewing angle -- here I was discussing more architecture rather than nitty gritty code review.

I think the approach here for physical page table base addresses is much better than my hacks for handling missing data. I think it would be best if we could try to avoid fiddling with the vmcoreinfo structure and lying about being a Linux kernel, but that could be something we work on down the line.

Incidentally @pcc what sort of test setup are you using? I have a Raspberry Pi 4B that I've (somewhat) used with OpenOCD and some bare metal programs before. I might try to give it a go on that, once I dust everything off :)

Comment on lines +470 to +474
static struct drgn_error *
linux_kernel_init_vmcoreinfo_from_phys_aarch64(struct drgn_program *prog,
Elf *vmlinux, GElf_Ehdr *ehdr,
uint64_t text_pa)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever way of getting the important information into Drgn without needing to have a more architectural shift to understand "bare-metal" targets explicitly. But the downside is that faking vmcoreinfo data is a time-honored tradition in Linux core dump debugging which has been the source of several bugs (e.g. makedumpfile and azure-linux-utils both do it, badly). That said, I don't have a specific issue I'm worried about in this case -- especially given that this hook is only used by the aarch64 OpenOCD backend (for now).

I guess the thing that would make me more comfortable is not tying this directly to the "vmcoreinfo" concept, which is Linux-specific. If instead we could have a way of representing that this is a bare metal target (i.e. one where we need to handle all virtual address translations ourselves, and need to bootstrap those translations with physical address information), then we can separate this from the linux kernel / vmcoreinfo specific code.

That way, in the future we could still have a target which is both bare-metal and Linux kernel, where we can find the vmcoreinfo via the equivalent of prog["vmcoreinfo_data"] so we can get at the rest of that data without clobbering the page table data which is set here.

That said, I don't know if it's worth blocking on these concerns or if we should try to update the architecture as it becomes an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be worth trying to remove dependencies on Linux-isms such as vmcoreinfo as drgn starts to become useful for debugging bare-metal programs other than the Linux kernel. I personally don't think it needs to be done immediately though.

Comment on lines +207 to +208
print("error: --openocd-tap required when using OpenOCD", file=sys.stderr)
sys.exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny suggestion, Python allows you to shorten this:

              sys.exit("error: --openocd-tap required when using OpenOCD")

Non-int args are printed to stderr and an error code of 1 is returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I'll take care of that when rebasing.

@@ -408,7 +409,7 @@ linux_kernel_pgtable_iterator_next_aarch64(struct drgn_program *prog,
int level;
uint16_t num_entries = it->last_level_num_entries;
uint64_t table = it->it.pgtable;
bool table_physical = false;
bool table_physical = it->it.pgtable_phys;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't delved deep into the pgtable code, so I was not aware that table_physical was here for pretty much all of them and that it would be that easy to properly support a physical page table base.

I had a hack where I fooled drgn into using a physical swapper_pg_dir address. But your way is the proper way, I like it a lot!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer term I was thinking that we might want to try removing the pgtable_phys flag and have all page table iterator accesses go via physical memory. That should simplify the code a bit. This seems like it should also work with core dumps because the physical address can be calculated from the program headers in the core dump.

static struct drgn_error *
linux_kernel_init_vmcoreinfo_from_phys_aarch64(struct drgn_program *prog,
Elf *vmlinux, GElf_Ehdr *ehdr,
uint64_t text_pa)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having text_pa makes sense for OpenOCD targets. But for other "bare metal" cases (e.g. linux core dumps without proper virtual address ranges provided), it may be that you can compute text_pa by some other data in the vmcoreinfo note. I don't know if there's one function signature that could cover all the necessary data that an arch-specific hook would need...

We may find ourselves needing to take a page out of crash's book, which allows you to use -m key=value to specify arch-specific key-value data to help it make sense of the target. This would be require more work, but it would give architectures the flexibility to handle all sorts of interesting cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if there's one function signature that could cover all the necessary data that an arch-specific hook would need...

To me it seems that there should be different APIs for "starting from a vmcoreinfo note" and "starting from a physical address". In each case the information discovery mechanism is very different. For example, initialization from PA doesn't even require the kernel to be configured with CONFIG_CRASH_CORE.

We may find ourselves needing to take a page out of crash's book, which allows you to use -m key=value to specify arch-specific key-value data to help it make sense of the target. This would be require more work, but it would give architectures the flexibility to handle all sorts of interesting cases.

Possibly. I suppose that another option would be to get any information we need added to the vmlinux file in some form, and then we can read it from there. The page size in the arm64 kernel header is one example. You might also have noticed changes in the code to tolerate missing information (i.e. changes to linux_kernel.c to tolerate missing nr_section_roots).

@pcc
Copy link
Contributor Author

pcc commented Oct 5, 2023

Incidentally @pcc what sort of test setup are you using? I have a Raspberry Pi 4B that I've (somewhat) used with OpenOCD and some bare metal programs before. I might try to give it a go on that, once I dust everything off :)

Unfortunately my target is proprietary. I was thinking about getting a Raspberry Pi or something so that I can verify that this all works on publicly available hardware, but if you want to give it a go I'd be happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants