ARM64/PyTorch Fatal Python error: Illegal instruction #744

Mindstan · 2023-02-05T16:33:02Z

🐛 Bug Report

📁 I've Included a ZIP file containing my librephotos log files: podman_logs.zip
❌ I have looked for similar issues (including closed ones)

📝 Description of issue:

When starting a scan from Nextcloud, the workers are crashing with the error Fatal Python error: Illegal instruction, which is coming from PyTorch.

The backend is deployed using the docker-compose.yml and Podman (4.3), on a Raspberry Pi 4 8Go v1.

This is a duplicate of #406, but the bug is still present in both latest and dev Docker images. It was working back in November 2022, but I couldn't make it work since.

It looks like every other functionalities (that does not involve scanning new images) are working as they usually do.

🔁 How can we reproduce it:

Start a scan on an ARM64 device running in Docker/Podman.

Please provide additional information:

💻 Operating system: Raspberry Pi OS 64bits (Debian 11, up to date), kernel 5.15.84-v8+.
⚙ Architecture (x86 or ARM): ARM64 (armv8)
🔢 Librephotos version: 2023-01-30T11:15:53.089771527Z
📸 Librephotos installation method (Docker, Kubernetes, .deb, etc.): Docker
- 🐋 If Docker or Kubernetes, provide docker-compose image tag: both latest and dev
📁 How is you picture library mounted (Local file system (Type), NFS, SMB, etc.): Local filesystem as Docker volume

Bellow is the slightly modified docker-compose.yml I’m using :

version: "3.8"

networks:
  proxy:
    external: true

services:
  proxy:
    image: docker://reallibrephotos/librephotos-proxy:${tag}
    container_name: librephotos-proxy
    restart: always
    volumes:
      - ${scanDirectory}:/data
      - ${data}/protected_media:/protected_media
    ports: # - ${httpPort}:80
    depends_on:
      - backend
      - frontend
    networks:
      - proxy
    labels:
      traefik.http.routers.photos.entrypoints: websecure
      traefik.http.routers.photos.rule: Host(`photos.domain`)
      traefik.http.services.photos.loadbalancer.server.port: 80
      traefik.enable: true

  db:
    image: docker://postgres:13
    container_name: librephotos-db
    restart: always
    environment:
      - POSTGRES_USER=${dbUser}
      - POSTGRES_PASSWORD=${dbPass}
      - POSTGRES_DB=${dbName}
    volumes:
      - ${data}/db:/var/lib/postgresql/data
    command: postgres -c fsync=off -c synchronous_commit=off -c full_page_writes=off -c random_page_cost=1.0
    #Checking health of Postgres db
    healthcheck:
      test: psql -U ${dbUser} -d ${dbName} -c "SELECT 1;"
      interval: 5s
      timeout: 5s
      retries: 5

  frontend:
    image: docker://reallibrephotos/librephotos-frontend:${tag}
    container_name: librephotos-frontend
    restart: always
    depends_on:
      - backend

  backend:
    image: docker://reallibrephotos/librephotos:${tag}
    container_name: librephotos-backend
    restart: always
    volumes:
      - ${scanDirectory}:/data
      - ${data}/protected_media:/protected_media
      - ${data}/logs:/logs
      - ${data}/cache:/root/.cache

    environment:
      - SECRET_KEY=${shhhhKey}
      - BACKEND_HOST=backend
      - ADMIN_EMAIL=${adminEmail}
      - ADMIN_USERNAME=${userName}
      - ADMIN_PASSWORD=${userPass}
      - DB_BACKEND=postgresql
      - DB_NAME=${dbName}
      - DB_USER=${dbUser}
      - DB_PASS=${dbPass}
      - DB_HOST=${dbHost}
      - DB_PORT=5432
      - REDIS_HOST=redis
      - REDIS_PORT=6379
      - MAPBOX_API_KEY=${mapApiKey}
      - WEB_CONCURRENCY=${gunniWorkers}
      - SKIP_PATTERNS=${skipPatterns}
      - ALLOW_UPLOAD=${allowUpload}
      - DEBUG=0
      - HEAVYWEIGHT_PROCESS=${HEAVYWEIGHT_PROCESS}

    # Wait for Postgres
    depends_on:
      db:
        condition: service_healthy

  redis:
    image: docker://redis:6
    container_name: librephotos-redis
    restart: always

The text was updated successfully, but these errors were encountered:

sickelap · 2023-02-05T17:49:39Z

I did a test and scanning Nextcloud photos on aarch64 machine, and it worked fine for me.

FYI, I am using Rock 4SE (4Gb) with NVMe drive. In general, it is almost the same as RPi. I run docker, not podman. My docker-compose.yml is not modified. I rebuilt all images yesterday from source, including base and dependencies.

SlimyKitten · 2023-02-06T12:51:36Z

For me, I also have the same problem, even after pulling the latest docker-files as described on the website (https://docs.librephotos.com/1/standard_install/). Attached is the log for the backend container, which is throwing the error. For me this happens when a scan of the filesystem is started.
_backend_logs.txt

Mindstan · 2023-02-06T13:25:08Z

I've read on other projects that PyTorch might be compiled for arm v8.2, which has a few more instructions than arm v8-A used by the Raspberry 4 CPU Cortex A72.

What has changed since November is the release of Pytorch 1.13, and this is the version currently shipped with the Docker image :

$ podman exec -it librephotos-backend pip show torch
Name: torch
Version: 1.13.0

There is an open issue on the PyTorch repository: pytorch/pytorch#90535 .
I added the environment variable OPENBLAS_CORETYPE=ARMV8

I tried it, and now I have this issue which is probably unrelated :

[2023-02-06 13:21:34 +0000] [140] [CRITICAL] WORKER TIMEOUT (pid:143)
Exception ignored from cffi callback <function _log_handler_callback at 0x7f74b83ac0>:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pyvips/__init__.py", line 132, in _log_handler_callback
    @ffi.def_extern()
  File "/usr/local/lib/python3.10/dist-packages/gunicorn/workers/base.py", line 203, in handle_abort
    sys.exit(1)
SystemExit: 1

SlimyKitten · 2023-02-06T14:31:38Z

Downgrading to torch v1.12.0 by executing into the container also seems to work, so that might be a workaround in the meantime

derneuere · 2023-02-06T15:47:16Z

Probably the same issue as in #738

It could be that this dependency in this line here changed: https://github.com/LibrePhotos/librephotos-docker/blob/00d97785992b8417d82489614d187e38b481288c/backend/base/Dockerfile#L61

We need a custom pytorch dependency, because raspberry pis do not support the whole arm instruction set. We solved this issue by using this 3rd party pytorch dependency: https://github.com/KumaTea/pytorch-aarch64

Would be great if one of you could figure out, which version works correcty as my raspberry pi died a while ago.

SlimyKitten · 2023-02-06T16:39:21Z

I can help with this, however I am not very experienced in the development of Docker and pythonprojects. So if you have specific instructions, I would be happy to follow them :)

Mindstan · 2023-02-06T18:57:14Z

I've done it for multiple PyTorch versions on Jetson Xavier, sadly they are using arm v8.2 so I can't reuse the wheels (it took 1.5 hours to build PyTorch with CUDA bindings on AGX Xavier, which has 8 cores and 32Go of RAM).

It's not hard to build the wheels, you need to install recent versions of the compilation dependencies (like CMake and Ninja). The PyTorch build script is highly parametrized, so you also need to know which acceleration/compute libraries you want to work with and install them as system dependencies. The instructions to compile for Jetson devices can be found here (theoretically, you don't need to apply the patches since they are for CUDA).

Mindstan · 2023-02-11T15:06:07Z

I confirm that my second issue is unrelated, if I set the Gunicorn worker timeout from 30s (default value) to 300s in the entrypoint.sh, the workers are crashing arround 10 minutes latter. For now I deactivated the timeout by setting it to 0 (for people looking for the solution: you shouldn't do this in production) and the scan could finish.

The workers did successfully process all pictures without errors, so the fix for the initial problem is to add OPENBLAS_CORETYPE=ARMV8 in the environment variables.

derneuere · 2023-03-23T08:42:01Z

I changed a couple of things in the base image and in the entrypoint file. It would be great, if you could test, if it works again.

derneuere · 2023-03-24T09:34:18Z

Should work now in general, but the timeout issue still persists. Would be great if somebody else creates a separate issue for the timeout issue.

Mindstan · 2023-03-25T13:21:08Z

The current dev image doesn't run on my Raspberry, again with Illegal instruction error, this time coming from PyTorch 2.0.0 : pytorch/pytorch#97226
I will try to downgrade to 1.13.1 and see what happen.

Mindstan · 2023-03-25T13:57:06Z

After the downgrade, I don't see any errors.

Mindstan added the bug Something isn't working label Feb 5, 2023

derneuere added the ARM label Mar 5, 2023

derneuere mentioned this issue Mar 24, 2023

can't scan #738

Closed

derneuere closed this as completed Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM64/PyTorch Fatal Python error: Illegal instruction #744

ARM64/PyTorch Fatal Python error: Illegal instruction #744

Mindstan commented Feb 5, 2023 •

edited

Loading

sickelap commented Feb 5, 2023

SlimyKitten commented Feb 6, 2023

Mindstan commented Feb 6, 2023

SlimyKitten commented Feb 6, 2023

derneuere commented Feb 6, 2023

SlimyKitten commented Feb 6, 2023

Mindstan commented Feb 6, 2023

Mindstan commented Feb 11, 2023

derneuere commented Mar 23, 2023

derneuere commented Mar 24, 2023

Mindstan commented Mar 25, 2023

Mindstan commented Mar 25, 2023

ARM64/PyTorch Fatal Python error: Illegal instruction #744

ARM64/PyTorch Fatal Python error: Illegal instruction #744

Comments

Mindstan commented Feb 5, 2023 • edited Loading

🐛 Bug Report

📝 Description of issue:

🔁 How can we reproduce it:

Please provide additional information:

sickelap commented Feb 5, 2023

SlimyKitten commented Feb 6, 2023

Mindstan commented Feb 6, 2023

SlimyKitten commented Feb 6, 2023

derneuere commented Feb 6, 2023

SlimyKitten commented Feb 6, 2023

Mindstan commented Feb 6, 2023

Mindstan commented Feb 11, 2023

derneuere commented Mar 23, 2023

derneuere commented Mar 24, 2023

Mindstan commented Mar 25, 2023

Mindstan commented Mar 25, 2023

Mindstan commented Feb 5, 2023 •

edited

Loading