Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64/PyTorch Fatal Python error: Illegal instruction #744

Closed
2 tasks done
Mindstan opened this issue Feb 5, 2023 · 12 comments
Closed
2 tasks done

ARM64/PyTorch Fatal Python error: Illegal instruction #744

Mindstan opened this issue Feb 5, 2023 · 12 comments
Labels
ARM bug Something isn't working

Comments

@Mindstan
Copy link

Mindstan commented Feb 5, 2023

πŸ› Bug Report

  • πŸ“ I've Included a ZIP file containing my librephotos log files: podman_logs.zip
  • ❌ I have looked for similar issues (including closed ones)

πŸ“ Description of issue:

When starting a scan from Nextcloud, the workers are crashing with the error Fatal Python error: Illegal instruction, which is coming from PyTorch.

The backend is deployed using the docker-compose.yml and Podman (4.3), on a Raspberry Pi 4 8Go v1.

This is a duplicate of #406, but the bug is still present in both latest and dev Docker images. It was working back in November 2022, but I couldn't make it work since.

It looks like every other functionalities (that does not involve scanning new images) are working as they usually do.

πŸ” How can we reproduce it:

Start a scan on an ARM64 device running in Docker/Podman.

Please provide additional information:

  • πŸ’» Operating system: Raspberry Pi OS 64bits (Debian 11, up to date), kernel 5.15.84-v8+.

  • βš™ Architecture (x86 or ARM): ARM64 (armv8)

  • πŸ”’ Librephotos version: 2023-01-30T11:15:53.089771527Z

  • πŸ“Έ Librephotos installation method (Docker, Kubernetes, .deb, etc.): Docker

    • πŸ‹ If Docker or Kubernetes, provide docker-compose image tag: both latest and dev
  • πŸ“ How is you picture library mounted (Local file system (Type), NFS, SMB, etc.): Local filesystem as Docker volume

Bellow is the slightly modified docker-compose.yml I’m using :

version: "3.8"

networks:
  proxy:
    external: true

services:
  proxy:
    image: docker://reallibrephotos/librephotos-proxy:${tag}
    container_name: librephotos-proxy
    restart: always
    volumes:
      - ${scanDirectory}:/data
      - ${data}/protected_media:/protected_media
    ports: # - ${httpPort}:80
    depends_on:
      - backend
      - frontend
    networks:
      - proxy
    labels:
      traefik.http.routers.photos.entrypoints: websecure
      traefik.http.routers.photos.rule: Host(`photos.domain`)
      traefik.http.services.photos.loadbalancer.server.port: 80
      traefik.enable: true

  db:
    image: docker://postgres:13
    container_name: librephotos-db
    restart: always
    environment:
      - POSTGRES_USER=${dbUser}
      - POSTGRES_PASSWORD=${dbPass}
      - POSTGRES_DB=${dbName}
    volumes:
      - ${data}/db:/var/lib/postgresql/data
    command: postgres -c fsync=off -c synchronous_commit=off -c full_page_writes=off -c random_page_cost=1.0
    #Checking health of Postgres db
    healthcheck:
      test: psql -U ${dbUser} -d ${dbName} -c "SELECT 1;"
      interval: 5s
      timeout: 5s
      retries: 5

  frontend:
    image: docker://reallibrephotos/librephotos-frontend:${tag}
    container_name: librephotos-frontend
    restart: always
    depends_on:
      - backend

  backend:
    image: docker://reallibrephotos/librephotos:${tag}
    container_name: librephotos-backend
    restart: always
    volumes:
      - ${scanDirectory}:/data
      - ${data}/protected_media:/protected_media
      - ${data}/logs:/logs
      - ${data}/cache:/root/.cache

    environment:
      - SECRET_KEY=${shhhhKey}
      - BACKEND_HOST=backend
      - ADMIN_EMAIL=${adminEmail}
      - ADMIN_USERNAME=${userName}
      - ADMIN_PASSWORD=${userPass}
      - DB_BACKEND=postgresql
      - DB_NAME=${dbName}
      - DB_USER=${dbUser}
      - DB_PASS=${dbPass}
      - DB_HOST=${dbHost}
      - DB_PORT=5432
      - REDIS_HOST=redis
      - REDIS_PORT=6379
      - MAPBOX_API_KEY=${mapApiKey}
      - WEB_CONCURRENCY=${gunniWorkers}
      - SKIP_PATTERNS=${skipPatterns}
      - ALLOW_UPLOAD=${allowUpload}
      - DEBUG=0
      - HEAVYWEIGHT_PROCESS=${HEAVYWEIGHT_PROCESS}

    # Wait for Postgres
    depends_on:
      db:
        condition: service_healthy

  redis:
    image: docker://redis:6
    container_name: librephotos-redis
    restart: always
@Mindstan Mindstan added the bug Something isn't working label Feb 5, 2023
@sickelap
Copy link
Contributor

sickelap commented Feb 5, 2023

I did a test and scanning Nextcloud photos on aarch64 machine, and it worked fine for me.

FYI, I am using Rock 4SE (4Gb) with NVMe drive. In general, it is almost the same as RPi. I run docker, not podman. My docker-compose.yml is not modified. I rebuilt all images yesterday from source, including base and dependencies.

@SlimyKitten
Copy link

For me, I also have the same problem, even after pulling the latest docker-files as described on the website (https://docs.librephotos.com/1/standard_install/). Attached is the log for the backend container, which is throwing the error. For me this happens when a scan of the filesystem is started.
_backend_logs.txt

@Mindstan
Copy link
Author

Mindstan commented Feb 6, 2023

I've read on other projects that PyTorch might be compiled for arm v8.2, which has a few more instructions than arm v8-A used by the Raspberry 4 CPU Cortex A72.

What has changed since November is the release of Pytorch 1.13, and this is the version currently shipped with the Docker image :

$ podman exec -it librephotos-backend pip show torch
Name: torch
Version: 1.13.0

There is an open issue on the PyTorch repository: pytorch/pytorch#90535 .
I added the environment variable OPENBLAS_CORETYPE=ARMV8

I tried it, and now I have this issue which is probably unrelated :

[2023-02-06 13:21:34 +0000] [140] [CRITICAL] WORKER TIMEOUT (pid:143)
Exception ignored from cffi callback <function _log_handler_callback at 0x7f74b83ac0>:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pyvips/__init__.py", line 132, in _log_handler_callback
    @ffi.def_extern()
  File "/usr/local/lib/python3.10/dist-packages/gunicorn/workers/base.py", line 203, in handle_abort
    sys.exit(1)
SystemExit: 1

@SlimyKitten
Copy link

Downgrading to torch v1.12.0 by executing into the container also seems to work, so that might be a workaround in the meantime

@derneuere
Copy link
Member

Probably the same issue as in #738

It could be that this dependency in this line here changed: https://github.com/LibrePhotos/librephotos-docker/blob/00d97785992b8417d82489614d187e38b481288c/backend/base/Dockerfile#L61

We need a custom pytorch dependency, because raspberry pis do not support the whole arm instruction set. We solved this issue by using this 3rd party pytorch dependency: https://github.com/KumaTea/pytorch-aarch64

Would be great if one of you could figure out, which version works correcty as my raspberry pi died a while ago.

@SlimyKitten
Copy link

I can help with this, however I am not very experienced in the development of Docker and pythonprojects. So if you have specific instructions, I would be happy to follow them :)

@Mindstan
Copy link
Author

Mindstan commented Feb 6, 2023

I've done it for multiple PyTorch versions on Jetson Xavier, sadly they are using arm v8.2 so I can't reuse the wheels (it took 1.5 hours to build PyTorch with CUDA bindings on AGX Xavier, which has 8 cores and 32Go of RAM).

It's not hard to build the wheels, you need to install recent versions of the compilation dependencies (like CMake and Ninja). The PyTorch build script is highly parametrized, so you also need to know which acceleration/compute libraries you want to work with and install them as system dependencies. The instructions to compile for Jetson devices can be found here (theoretically, you don't need to apply the patches since they are for CUDA).

@Mindstan
Copy link
Author

I confirm that my second issue is unrelated, if I set the Gunicorn worker timeout from 30s (default value) to 300s in the entrypoint.sh, the workers are crashing arround 10 minutes latter. For now I deactivated the timeout by setting it to 0 (for people looking for the solution: you shouldn't do this in production) and the scan could finish.

The workers did successfully process all pictures without errors, so the fix for the initial problem is to add OPENBLAS_CORETYPE=ARMV8 in the environment variables.

@derneuere derneuere added the ARM label Mar 5, 2023
@derneuere
Copy link
Member

I changed a couple of things in the base image and in the entrypoint file. It would be great, if you could test, if it works again.

@derneuere derneuere mentioned this issue Mar 24, 2023
@derneuere
Copy link
Member

Should work now in general, but the timeout issue still persists. Would be great if somebody else creates a separate issue for the timeout issue.

@Mindstan
Copy link
Author

The current dev image doesn't run on my Raspberry, again with Illegal instruction error, this time coming from PyTorch 2.0.0 : pytorch/pytorch#97226
I will try to downgrade to 1.13.1 and see what happen.

@Mindstan
Copy link
Author

After the downgrade, I don't see any errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARM bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants