Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experience in running Firejail inside a Docker container? #1210

Closed
zimmski opened this issue Apr 10, 2017 · 3 comments
Closed

Experience in running Firejail inside a Docker container? #1210

zimmski opened this issue Apr 10, 2017 · 3 comments
Labels
information_old (Deprecated; use "doc-todo" or "needinfo" instead) Information was/is required

Comments

@zimmski
Copy link

zimmski commented Apr 10, 2017

We are running arbitrary user programs and obviously need to isolate them in a sandbox. That is why I am extremely interested in Firejail, since running single-shot processes in QEMU seems like a good idea but is a handling and performance nightmare. I am opening this issue because I am looking for suggestions if Firejail is the right fit for our problems or if something else would be better suited (I am not looking for another self-made solution), and to report some problems I encountered. I only had a peak at some of Firejail's source code and the used technologies, so bear with me. If someone else is doing something similar I would be extremely happy to read about how they are doing it.

We are trying to solve the following problems:

  • We are running some services using Docker (controlled by Kubernetes), some of these services need to execute arbitrary user programs. We really do not control and should not even care about what is inside these programs, but we can at least define that they should not do something fishy.
  • Each program must run in isolation. If a program tries to do something bad, we want that program to be instantly killed and we want to get notified somehow.
  • Isolation involves
    • Process level - The program should think it is the only program in the world.
    • File level - The program should have its own FS (and limits of the FS) but we need to access the FS data after the program died
    • Resource level - Limit CPU, Memory, IO and network
    • Basically we do not want the program do use anything outside its isolation

I would like to see this solved inside the service Docker container, and not inside QEMU or yet another Docker container. The reason is that the FS of the service already has all FS data, so a Overlay FS would IMHO perform quite good. It would also reduce the workflow to calling a process instead of creating an image, setting up a container/VM for it, run and recycle the whole workflow. Additionally, it would solve the really bad connection for us between QEMU and Docker/Kubernetes to simple process calls in the same container.

It seems (or does it?) that Firejail can solve everything, we need so I tried it out. However, I am having a hard time to setting this up. The following problems were encountered (if I should open up an issue for each of them, just mumble the word):

a.) --quiet is definitely not quiet. The warnings seem OK but I think they should be hidden behind another argument if --quiet is used. Also, there is definitely non-warning output when an overlay is used.
b.) Since the Firejail process runs inside Docker, --force was needed. Why is this restriction needed at all, if it can be simply overwritten?
c.) We need "ptrace" working inside firejail, so --allow-debuggers was needed. The argument cannot be used with the Kernel 4.4 because of a serious bug, which got fixed in 4.8 which can be read here https://lwn.net/Articles/690685/ and "seccomp reordered after ptrace" in https://outflux.net/blog/archives/2016/10/04/security-things-in-linux-v4-8/. We are using Ubuntu 16.04 and thankfully there is a kernel upgrade available using apt install --install-recommends linux-generic-hwe-16.04. This could be better documented on the Firejail side.
d.) The Firejail home for the user is read using getpwuid which forces one to make that particular directory writable. This should be changed so that the environment variable "HOME" is used (I hacked that in and can send it upstream if you like), or should be settable using an argument.
e.) The Firejail home for the user needs to have the right UID and GID. Why? Isn't it enough to have the user write and read to it?
f.) The Firejail binary saves its runtime data to "/run/firejail" with UID=root and GID=root. It would be neat if this would be documented somewhere and if this was a compile-time option.
g.) Is it possible to define a file to collect every seccomp (and other) violation?
h.) I only found --overlay-clean to clean up overlays but what if I want to clean the overlay of a specific process?
i.) I tried to get firejail running without marking the Docker container "privileged" and failed. Is it even possible, and why not?
j.) I haven't looked into this yet but why isn't it possible to use the --private* arguments with --overlay? I want to have an overlay over certain directories and files, but do a clean FS for the rest of the FS, and then look at changes after the process has died. Does anybody have a solution for this?
k.) We are using Vagrant with NFS to develop our product, and overlay does not play with NFS. Does anybody have any luck with BindFS to trick OverlayFS into working?
l.) I will reiterate over my feature branch hopefully at the end of next week and try to get my tiny changes upstream. Is there an official configuration for a formatter?

I am sure that I forgot some things but if anyone can help with a few of these problems I would be really happy.

@Ferroin
Copy link
Contributor

Ferroin commented Apr 10, 2017

I'm only going to comment on the stuff that I can reasonably answer here:

a.) --quiet is definitely not quiet. The warnings seem OK but I think they should be hidden behind another argument if --quiet is used. Also, there is definitely non-warning output when an overlay is used.

I agree, this should definitely be fixed (or perhaps a --silent option could be added to force no output, although TBH, for your usage, you may want to log somewhere so you can figure out why something got killed).

b.) Since the Firejail process runs inside Docker, --force was needed. Why is this restriction needed at all, if it can be simply overwritten?

Docker is (or at least, should be) using a decent percentage of what Firejail does already (with the exception of the Seccomp-BPF stuff), because it needs to use namespaces to provide it's own isolation. Nesting namespaces created by different tools can be tricky to get correct (and is prone to random failures), and I believe that's why Firejail is complaining.

c.) We need "ptrace" working inside firejail, so --allow-debuggers was needed. The argument cannot be used with the Kernel 4.4 because of a serious bug, which got fixed in 4.8 which can be read here https://lwn.net/Articles/690685/ and "seccomp reordered after ptrace" in https://outflux.net/blog/archives/2016/10/04/security-things-in-linux-v4-8/. We are using Ubuntu 16.04 and thankfully there is a kernel upgrade available using apt install --install-recommends linux-generic-hwe-16.04. This could be better documented on the Firejail side.

Agreed, this should be better documented.

d.) The Firejail home for the user is read using getpwuid which forces one to make that particular directory writable. This should be changed so that the environment variable "HOME" is used (I hacked that in and can send it upstream if you like), or should be settable using an argument.

Ideally this should be a switch. The current behavior is significantly safer than using the value of $HOME (the user has control over the value of $HOME, but not what /etc/passwd says their home directory is), but is as you found out not always viable.

e.) The Firejail home for the user needs to have the right UID and GID. Why? Isn't it enough to have the user write and read to it?

I'm not 100% certain about this, but I will comment that there are quite a few applications (not even just graphical ones) which expect to at least have read access to most of $HOME, and quite often want (but don't need) write access to somewhere under there.

i.) I tried to get firejail running without marking the Docker container "privileged" and failed. Is it even possible, and why not?

Probably not without at least a patched version of Docker, and possibly a patched kernel. The privileged thing in docker controls access to various POSIX capabilities (CAP_SYS_ADMIN is the big one, but I'm pretty sure it at least includes CAP_NET_ADMIN and CAP_NET_RAW too). Firejail actually needs at least some of the stuff that Docker blocks to be able to set up namespaces and Seccomp-BPF properly.

k.) We are using Vagrant with NFS to develop our product, and overlay does not play with NFS. Does anybody have any luck with BindFS to trick OverlayFS into working?

While I have not tried this myself, based on my (limited) knowledge of the Linux VFS layer, I doubt it will work. As far as I understand it, a bind mount is kind of like a hardlink. You can change the properties of that particular link, but there are certain things that can't be changed, and the stuff that the kernel's NFS client is lacking implementations for which OverlayFS needs are one of those things. You may have slightly more luck hacking something together with FUSE (or possibly using a userspace NFS client), but I doubt you'll be able to get OverlayFS to work on NFS without patching the kernel.

@Ferroin
Copy link
Contributor

Ferroin commented Apr 10, 2017

Oh, also, sorry about double posting, but I just noticed your comment about requiring resource level isolation for this. In short, to do that, you'll need something else working together with firejail, as firejail doesn't (currently) do any resource-level isolation.

@netblue30
Copy link
Owner

netblue30 commented Apr 12, 2017

We are running arbitrary user programs and obviously need to isolate them in a sandbox.

The current security technologiy in Linux kernel is access technology - netfilter, SELinux ( mandatory access controls), PID namespace (no access to system PID namespace) etc. It works fine if the bad guy is outside and tries to bring in his exploit code and take over the system.

If you bring untrusted user code in your system yourself, it will be more difficult for the kernel to deal with it. In this situation I would say your best approach would be:

  • run a real virtual machine (XEN, KVM, VirtualBox)
  • build the VM using a solid distro such as Debian stable/oldstable, CentOS, or Gentoo hardened (no, you don't want anything new and flashy at this point)
  • maybe install a Grsecurity kernel in VM
  • run your program directly in a very conservative Firejail setting - seccomp and namespaces are your best friends.

Things to stay away from:

  • ptrace syscall - this is the most problematic syscall ever invented. In kernels prior to 4.8 ptrace allows the bad guys to bypass seccomp. It is also used to attach exploits to running programs. Also, make sure you don't have gdb and strace installed in VM.
  • OverlayFS - very new kernel technology, bugs are still coming in.
  • FUSE
  • Docker - huge attack surface
  • NFS - I don't think anybody ever tested Firejail on NFS filesystems, who knows what's working and what's not working!

Good luck!

@netblue30 netblue30 added the information_old (Deprecated; use "doc-todo" or "needinfo" instead) Information was/is required label Apr 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
information_old (Deprecated; use "doc-todo" or "needinfo" instead) Information was/is required
Projects
None yet
Development

No branches or pull requests

3 participants