Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux packages not working for centos (aarch64 / arm64v8) #4270

Closed
noly opened this issue Nov 2, 2021 · 35 comments · Fixed by #5020, fluent/fluent-bit-packaging#29 or #5027
Closed

Linux packages not working for centos (aarch64 / arm64v8) #4270

noly opened this issue Nov 2, 2021 · 35 comments · Fixed by #5020, fluent/fluent-bit-packaging#29 or #5027
Assignees

Comments

@noly
Copy link

noly commented Nov 2, 2021

Cloned from #4007

Bug Report

Describe the bug
I'm getting an error while running FB in Centos 8 in arm.

To Reproduce

Expected behavior
Fluent-bit works.

Screenshots
Execution error:

/opt/td-agent-bit/bin/td-agent-bit -i cpu -t my_cpu -o stdout -m '*'
<jemalloc>: Unsupported system page size
<jemalloc>: Unsupported system page size
<jemalloc>: Unsupported system page size
FATAL: error reading `/proc/sys/crypto/fips_enabled' in libgcrypt: Cannot allocate memory
<jemalloc>: Unsupported system page size
Aborted (core dumped)

Screen Shot 2021-11-02 at 19 37 50

Service start up error:

$ service td-agent-bit status

td-agent-bit.service - TD Agent Bit
   Loaded: loaded (/usr/lib/systemd/system/td-agent-bit.service; disabled; vendor preset: disabled)
   Active: failed (Result: core-dump) since Tue 2021-11-02 18:21:44 UTC; 3s ago
  Process: 10989 ExecStart=/opt/td-agent-bit/bin/td-agent-bit -c /etc/td-agent-bit/td-agent-bit.conf (code=dumped, signal=ABRT)
 Main PID: 10989 (code=dumped, signal=ABRT)

Nov 02 18:21:44: td-agent-bit.service: Service RestartSec=100ms expired, scheduling restart.
Nov 02 18:21:44: td-agent-bit.service: Scheduled restart job, restart counter is at 5.
Nov 02 18:21:44: Stopped TD Agent Bit.
Nov 02 18:21:44: td-agent-bit.service: Start request repeated too quickly.
Nov 02 18:21:44: td-agent-bit.service: Failed with result 'core-dump'.
Nov 02 18:21:44: Failed to start TD Agent Bit.

Your Environment
Version used: fluent-bit 1.8.9
Configuration: default/standard
Environment: AWS Linux Centos 8

@edsiper
Copy link
Member

edsiper commented Nov 2, 2021

are you able to reproduce the issue with Centos 7 ? (we don't ship packages for Centos 8)

@noly
Copy link
Author

noly commented Nov 3, 2021

Hi @edsiper. Yes, I was able to reproduce the issue in AWS AMi CentOS 7.9.2009 aarch64

Screen Shot 2021-11-03 at 10 39 13

Even thou you don't ship packages for CentOs8 it's a supported platforms.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Dec 4, 2021
@noly
Copy link
Author

noly commented Dec 7, 2021

Commenting to avoid stale

@github-actions github-actions bot removed the Stale label Dec 8, 2021
@patrick-stephens patrick-stephens self-assigned this Dec 20, 2021
@patrick-stephens
Copy link
Contributor

I'll pick up to try to include in the staging test updates.

@patrick-stephens
Copy link
Contributor

@noly to keep you in the loop, the staging test workflow is in place now so I'll start extending tests to see if I can replicate although probably after the holidays now.

@ANBUZHIDAO
Copy link

I also found the problem when I use the arm64 image in k8s
please see jemalloc/jemalloc#467

I think this is because jmemalloc compiled-in page size 4k, if your arm64 machine pagesize is 64k, it will not work.
getconf PAGESIZE can get the pagesize.

I change the dockerfile not use jemalloc and build a new image myself, then it can work on my arm64 machine. my arm64 machine pagesize is 65536.

@patrick-stephens
Copy link
Contributor

I'll admit I'm not an expert but looking at that issue it appears there is no general solution to it - you have to compile for your page size. Do you have a suggestion on what to do @ANBUZHIDAO for Fluent Bit containers or was this just for info on people who have the issue? We can/should document it as well.

@patrick-stephens
Copy link
Contributor

I've not forgotten... @noly did you see the page size comment above, is it relevant to your set up?

@noly
Copy link
Author

noly commented Feb 1, 2022

Hi @patrick-stephens, it is relevant indeed. Fedoras distro use 64k as page size:

Screen Shot 2022-02-01 at 15 18 22

Are you planing to release a fixed version of arm64 bin with correct page_size?

@patrick-stephens
Copy link
Contributor

If that is the default for Fedora we probably should - do you know if it is always the default and also for the downstream Red Hat distros as well? I'd rather not have to make another distro-specific artefact so be good if we can share them.

The packages are all compiled via containerised builds currently: https://github.com/fluent/fluent-bit/blob/master/packaging/distros/centos/Dockerfile

My concern as well is testing, verifying this is tricky in CI I think but I do need to ramp up the ARM tests specifically in this area.

A PR would really help if you get a chance so ping me if you do and I'll push for it.

@icereed
Copy link

icereed commented Mar 7, 2022

@ANBUZHIDAO, I would also be very interested in your ARM build.

I'm facing the same error using the Docker image in a Oracle Linux VM with Oracle Kubernetes Engine using ARM instances.
Inside any container the page size is reported to be 65536:

bash-5.1$ getconf PAGESIZE
65536

I would be fine to have a binary without jemalloc for the time being.

@cosmo0920
Copy link
Contributor

cosmo0920 commented Mar 7, 2022

We should use a configure option on jemalloc to specify 65535 page size on CentOS as: --with-lg-page=16 ( 2^16 )

@patrick-stephens
Copy link
Contributor

Seems a good shout so if anyone can submit a PR just ping me if it needs approval.

@ANBUZHIDAO
Copy link

I just change -DFLB_JEMALLOC=On to -DFLB_JEMALLOC=Off in the dockerfiles/Dockerfile
Maybe this have worse performance, but it worked and it's enough for me.

@patrick-stephens
Copy link
Contributor

So it does sound like for CentOS 7 arm derivatives the default page size determined during the build is incorrect. This may be down to using QEMU on Ubuntu hosts as it is run with Github Actions.

pypa/manylinux#735

We need to ensure therefore it is built with the right settings on the CentOS containers.

@patrick-stephens
Copy link
Contributor

patrick-stephens commented Mar 8, 2022

@noly just to keep you in the loop, we've identified a fix for this but I also want to add some verification to prevent it happening again so may take a bit of effort to sort.

The current smoke tests for packaging use containers to run which won't detect this problem unfortunately:

$ docker run --rm -it --platform=linux/arm64 centos:7 getconf PAGESIZE
4096

The same seem to be true for Vagrant presumably because that would also be reliant on QEMU.

I'm going to have to auto-provision an ARM instance for the target OS in CI and then use that - we wanted to extend the testing anyway so an opportunity to do so for this issue now.

@noly
Copy link
Author

noly commented Mar 9, 2022

@patrick-stephens sounds great!! Thanks for taking care of this issue.

@patrick-stephens
Copy link
Contributor

patrick-stephens commented Mar 9, 2022

@noly Any chance you can just test the change to confirm on master? CI testing is still WIP so if you have an actual target you can run on that would be ace to confirm.

I'm just pushing a change to allow me to build a specific target for master as that will mean you can just download the package then as built by the official CI, i.e. on an Ubuntu AMD64 host: #5026

Otherwise if you have a similar host and can set up QEMU with docker on it then we can build the same target manually:

./packaging/build.sh -v master -b master -d centos/7.arm64v8

@patrick-stephens
Copy link
Contributor

@noly (and anyone else) can you test the packages from here (once completed) on your target to confirm?
https://github.com/fluent/fluent-bit/actions/runs/1956875452

You can now control jemalloc configuration via the FLB_JEMALLOC_OPTIONS CMake variable, it defaults to --with-lg-quantum=3 so make sure to add that in any override.

@patrick-stephens
Copy link
Contributor

Looks like it is not splitting the arguments so I'll fix that:

CONFIG : '--with-lg-page=16 --with-lg-quantum=3'

Apologies for that.

@patrick-stephens
Copy link
Contributor

One good aspect this picked up was that the default fluent-bit package was being built without jemalloc so fixed two bugs for the price of 1 @noly !

@patrick-stephens
Copy link
Contributor

@noly try this one https://github.com/fluent/fluent-bit/actions/runs/1961696121

There should be a package to download based on master for the ARM 64 target

@noly
Copy link
Author

noly commented Mar 10, 2022

On my way!!

@noly
Copy link
Author

noly commented Mar 10, 2022

Did a fast test and worked!! 👯

Screen Shot 2022-03-10 at 11 54 02

Amazing work everyone!! and Kudos to you @patrick-stephens!!!

@noly
Copy link
Author

noly commented Mar 11, 2022

@patrick-stephens do you know when this change will be available? is with v1.9.0?

@patrick-stephens
Copy link
Contributor

yes, the staging builds from 1.9 should also have it if you want to test the latest RC: https://github.com/fluent/fluent-bit/actions/runs/1961119944

@patrick-stephens
Copy link
Contributor

I've updated the discussion on RCs as well with the details: #4954

@patrick-stephens
Copy link
Contributor

The next 1.8 release should also have it, basically the next release of either.

@loxiaopohaiol
Copy link

@patrick-stephens I use the lates images on the arm64 machine, have the same problem:
image

@patrick-stephens
Copy link
Contributor

Container images? They're debian based.

@loxiaopohaiol
Copy link

Container images? They're debian based.

I use the fluent/fluent-bit:latest docker images to running the fluentbit , but it has the same error log. My machine is arm64 and the OS is YHKYLIN-OS(base on linux)

@patrick-stephens
Copy link
Contributor

The container images are a Debian distro so sounds like your distro uses a different page size is my guess. Can you confirm the page size your distro uses (I've never heard of it)?

@loxiaopohaiol
Copy link

loxiaopohaiol commented Aug 24, 2022

DFLB_JEMALLOC

The system page size looks like is 65536:
image

@lusson-luo
Copy link

The container images are a Debian distro so sounds like your distro uses a different page size is my guess. Can you confirm the page size your distro uses (I've never heard of it)?

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment