libdrgn: add support for remote debugging via OpenOCD #338

pcc · 2023-07-08T04:48:30Z

OpenOCD may be used to establish a JTAG or SWD connection to a remote
debugging target. This patch adds the necessary support to drgn for
debugging a remote target using OpenOCD.

A new memory backend is added that issues memory transactions using
OpenOCD's Tcl interface. With remote debugging using direct physical
memory accesses there is no need for the kernel to have core dump
support. Instead, we use the debugging information in the vmlinux
file, together with information available in the physical memory of
the target. A new arch-specific hook is added for figuring out the
information in vmcoreinfo using the vmlinux file and the physical load
address of the kernel, which must be supplied by the user.

With this patch, drgn can also act as a debugger for microcontroller
firmware and other bare-metal programs that run without an MMU, using
the same OpenOCD memory backend used for kernel debugging.

(Since this has dependencies on my other PRs, I included their commits in this PR.)

The current C string reading implementation is inefficient, especially for low bandwidth remote targets, as it needs to do a separate segment read (including a fresh page table lookup) for each character read. A more efficient approach is to retain the page table between character reads, only discarding it when we hit the null terminator. Implement this approach by allowing segments to also specify a C string reading callback. The callback for page tables will preserve the page table iterator while reading characters. Signed-off-by: Peter Collingbourne <[email protected]>

Currently, when drgn is used to debug a running program, we assume it to be running on the local machine. However, with remote debugging, this will no longer be the case. To accommodate remote debugging, introduce a flag DRGN_PROGRAM_IS_LOCAL, and use it to decide whether to use /sys/module. Signed-off-by: Peter Collingbourne <[email protected]>

…evel The current page table walker will on average read around half of the entire page table for each level. This is inefficient, especially when debugging a remote target which may have a low bandwidth connection to the debugger. Address this by only reading one PTE per level. I've only done the aarch64 page table walker because that's all that I needed, but in principle the other page table walkers could work in a similar way. Signed-off-by: Peter Collingbourne <[email protected]>

In tag-based KASAN modes, TCR_EL1.TBI1 is enabled, which causes the top 8 bits of virtual addresses to be ignored for address translation purposes. Do the same when reading from memory. There is no harm in doing so unconditionally, as the architecture does not support >56 bit VA sizes. Signed-off-by: Peter Collingbourne <[email protected]>

OpenOCD may be used to establish a JTAG or SWD connection to a remote debugging target. This patch adds the necessary support to drgn for debugging a remote target using OpenOCD. A new memory backend is added that issues memory transactions using OpenOCD's Tcl interface. With remote debugging using direct physical memory accesses there is no need for the kernel to have core dump support. Instead, we use the debugging information in the vmlinux file, together with information available in the physical memory of the target. A new arch-specific hook is added for figuring out the information in vmcoreinfo using the vmlinux file and the physical load address of the kernel, which must be supplied by the user. With this patch, drgn can also act as a debugger for microcontroller firmware and other bare-metal programs that run without an MMU, using the same OpenOCD memory backend used for kernel debugging. Signed-off-by: Peter Collingbourne <[email protected]>

pcc · 2023-07-11T01:29:45Z

Looks like some of the RPM builders failed, but I didn't have much luck finding the actual build failure amongst the various log files under the Details link. Am I missing anything?

osandov · 2023-07-11T02:06:20Z

The logs are showing "No matching package to install: 'libkdumpfile-devel'", which is also happening on the main branch. So something appears to be busted with Fedora ELN, unrelated to your change.

pcc · 2023-10-02T19:54:46Z

Hi @osandov, do you think you will have a chance to look at this (or any of my other PRs) soon? I'm still actively using them for my kernel development work. I tried rebasing them onto the current main branch and ran into some merge conflicts, so for now I'm sticking with an older version of drgn. It looks like #312 and #315 still apply cleanly; if those go in I can try to rebase the rest.

osandov · 2023-10-02T21:55:07Z

Thanks for the ping, I'll make some time to look at your PRs this week.

@brenns10 since you've been looking at recovering the vmcoreinfo from core dumps lacking virtual address information, would you mind taking a look at that part of the top commit of this PR and sharing how that compares to what you've been doing?

brenns10 · 2023-10-02T22:38:37Z

Yeah definitely, I'll take a look -- probably tomorrow morning.

brenns10

I only reviewed the top commit as mentioned, though I took a brief look at the whole series to get the context as best I could. I'd be happy to go through more code and have more of a reviewing angle -- here I was discussing more architecture rather than nitty gritty code review.

I think the approach here for physical page table base addresses is much better than my hacks for handling missing data. I think it would be best if we could try to avoid fiddling with the vmcoreinfo structure and lying about being a Linux kernel, but that could be something we work on down the line.

Incidentally @pcc what sort of test setup are you using? I have a Raspberry Pi 4B that I've (somewhat) used with OpenOCD and some bare metal programs before. I might try to give it a go on that, once I dust everything off :)

brenns10 · 2023-10-03T16:52:38Z

libdrgn/arch_aarch64.c

+static struct drgn_error *
+linux_kernel_init_vmcoreinfo_from_phys_aarch64(struct drgn_program *prog,
+					       Elf *vmlinux, GElf_Ehdr *ehdr,
+					       uint64_t text_pa)
+{


This is a clever way of getting the important information into Drgn without needing to have a more architectural shift to understand "bare-metal" targets explicitly. But the downside is that faking vmcoreinfo data is a time-honored tradition in Linux core dump debugging which has been the source of several bugs (e.g. makedumpfile and azure-linux-utils both do it, badly). That said, I don't have a specific issue I'm worried about in this case -- especially given that this hook is only used by the aarch64 OpenOCD backend (for now).

I guess the thing that would make me more comfortable is not tying this directly to the "vmcoreinfo" concept, which is Linux-specific. If instead we could have a way of representing that this is a bare metal target (i.e. one where we need to handle all virtual address translations ourselves, and need to bootstrap those translations with physical address information), then we can separate this from the linux kernel / vmcoreinfo specific code.

That way, in the future we could still have a target which is both bare-metal and Linux kernel, where we can find the vmcoreinfo via the equivalent of prog["vmcoreinfo_data"] so we can get at the rest of that data without clobbering the page table data which is set here.

That said, I don't know if it's worth blocking on these concerns or if we should try to update the architecture as it becomes an issue.

Yes, I think it would be worth trying to remove dependencies on Linux-isms such as vmcoreinfo as drgn starts to become useful for debugging bare-metal programs other than the Linux kernel. I personally don't think it needs to be done immediately though.

brenns10 · 2023-10-03T17:04:44Z

drgn/cli.py

+              print("error: --openocd-tap required when using OpenOCD", file=sys.stderr)
+              sys.exit(1)


Tiny suggestion, Python allows you to shorten this:

sys.exit("error: --openocd-tap required when using OpenOCD")

Non-int args are printed to stderr and an error code of 1 is returned.

Ack, I'll take care of that when rebasing.

brenns10 · 2023-10-03T17:20:54Z

libdrgn/arch_aarch64.c

@@ -408,7 +409,7 @@ linux_kernel_pgtable_iterator_next_aarch64(struct drgn_program *prog,
 	int level;
 	uint16_t num_entries = it->last_level_num_entries;
 	uint64_t table = it->it.pgtable;
-	bool table_physical = false;
+	bool table_physical = it->it.pgtable_phys;


I haven't delved deep into the pgtable code, so I was not aware that table_physical was here for pretty much all of them and that it would be that easy to properly support a physical page table base.

I had a hack where I fooled drgn into using a physical swapper_pg_dir address. But your way is the proper way, I like it a lot!

Longer term I was thinking that we might want to try removing the pgtable_phys flag and have all page table iterator accesses go via physical memory. That should simplify the code a bit. This seems like it should also work with core dumps because the physical address can be calculated from the program headers in the core dump.

brenns10 · 2023-10-03T17:45:22Z

libdrgn/arch_aarch64.c

+static struct drgn_error *
+linux_kernel_init_vmcoreinfo_from_phys_aarch64(struct drgn_program *prog,
+					       Elf *vmlinux, GElf_Ehdr *ehdr,
+					       uint64_t text_pa)


Having text_pa makes sense for OpenOCD targets. But for other "bare metal" cases (e.g. linux core dumps without proper virtual address ranges provided), it may be that you can compute text_pa by some other data in the vmcoreinfo note. I don't know if there's one function signature that could cover all the necessary data that an arch-specific hook would need...

We may find ourselves needing to take a page out of crash's book, which allows you to use -m key=value to specify arch-specific key-value data to help it make sense of the target. This would be require more work, but it would give architectures the flexibility to handle all sorts of interesting cases.

I don't know if there's one function signature that could cover all the necessary data that an arch-specific hook would need...

To me it seems that there should be different APIs for "starting from a vmcoreinfo note" and "starting from a physical address". In each case the information discovery mechanism is very different. For example, initialization from PA doesn't even require the kernel to be configured with CONFIG_CRASH_CORE.

We may find ourselves needing to take a page out of crash's book, which allows you to use -m key=value to specify arch-specific key-value data to help it make sense of the target. This would be require more work, but it would give architectures the flexibility to handle all sorts of interesting cases.

Possibly. I suppose that another option would be to get any information we need added to the vmlinux file in some form, and then we can read it from there. The page size in the arm64 kernel header is one example. You might also have noticed changes in the code to tolerate missing information (i.e. changes to linux_kernel.c to tolerate missing nr_section_roots).

pcc · 2023-10-05T02:44:27Z

Incidentally @pcc what sort of test setup are you using? I have a Raspberry Pi 4B that I've (somewhat) used with OpenOCD and some bare metal programs before. I might try to give it a go on that, once I dust everything off :)

Unfortunately my target is proprietary. I was thinking about getting a Raspberry Pi or something so that I can verify that this all works on publicly available hardware, but if you want to give it a go I'd be happy to help.

pcc added 5 commits July 7, 2023 21:44

brenns10 reviewed Oct 3, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libdrgn: add support for remote debugging via OpenOCD #338

libdrgn: add support for remote debugging via OpenOCD #338

pcc commented Jul 8, 2023

pcc commented Jul 11, 2023

osandov commented Jul 11, 2023

pcc commented Oct 2, 2023

osandov commented Oct 2, 2023

brenns10 commented Oct 2, 2023

brenns10 left a comment

brenns10 Oct 3, 2023

pcc Oct 5, 2023

brenns10 Oct 3, 2023

pcc Oct 5, 2023

brenns10 Oct 3, 2023

pcc Oct 5, 2023

brenns10 Oct 3, 2023

pcc Oct 5, 2023

pcc commented Oct 5, 2023

		print("error: --openocd-tap required when using OpenOCD", file=sys.stderr)
		sys.exit(1)

libdrgn: add support for remote debugging via OpenOCD #338

Are you sure you want to change the base?

libdrgn: add support for remote debugging via OpenOCD #338

Conversation

pcc commented Jul 8, 2023

pcc commented Jul 11, 2023

osandov commented Jul 11, 2023

pcc commented Oct 2, 2023

osandov commented Oct 2, 2023

brenns10 commented Oct 2, 2023

brenns10 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcc commented Oct 5, 2023