[bug] Signals don't get propagated to the training script #632

shiftan · 2020-09-27T08:04:55Z

Checklist

[V] I've prepended issue tag with type of change: [bug]
(If applicable) I've attached the script to reproduce the bug
[V ] (If applicable) I've documented below the DLC image/dockerfile this relates to
(If applicable) I've documented below the tests I've run on the DLC image
[V ] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html

Concise Description:
Images (I've tested the PyTorch one) aren't built as suggested here:https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html.
Specifically, the entrypoint is set to a bash script, instead of directly to the python code, hence the python code isn't running as pid 1 and signals don't get propagated.

DLC image/dockerfile:
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-cpu-py3

Current behavior:
Registering to the SIGTERM signal using "signal.signal(sigName, handler)" in a training script, doesn't get called e.g. when setting max_run to 60, and waiting enough time.
Also, running "ps -elf" by using subprocess.run("ps -elf", shell=True) from a training script shows the below:
4 S root 1 0 0 80 0 - 4941 - 07:57 ? 00:00:00 bash -m start_with_right_hostname.sh train
4 S root 15 1 2 80 0 - 56741 - 07:57 ? 00:00:00 /opt/conda/bin/python /opt/conda/bin/train
4 S root 26 15 0 80 0 - 7630 - 07:57 ? 00:00:00 /opt/conda/bin/python shell_launcher.py --SSM_CMD_LINE ps -elf

0 S root 27 26 0 80 0 - 1641 - 07:57 ? 00:00:00 /bin/sh -c ps -elf
0 R root 28 27 0 80 0 - 9041 - 07:57 ? 00:00:00 ps -elf
And as you can see the python isn't PID 1

Expected behavior:
Signals to get propagated + python script to be PID 1, (unless signals can get propagated otherwise)

Additional context:
The below as a training script, when running with small value of max_run, or just stopping the training job from the console can show the problem.

`
import signal
import sys
import time

def handler(signum, frame):
print("Signal handler called with signal", signum)
print(frame)
sys.exit(0)

for sigName in [signal.SIGTERM, signal.SIGHUP, signal.SIGINT]:
signal.signal(sigName, handler)

print("Waiting for a signal...")
while True:
time.sleep(1)
`

SergTogul · 2020-10-20T00:29:34Z

Good evening.
Thank you for your question. So here it is:

Are you using BYO container or a pre-built PyTorch container?

BYO container: you should build own dockerfile and follow the instructions here: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html, with this, you can define the command you want to run when the container starts
pre-built PyTorch container: your training script ONLY needs to define the PyTorch related training steps, for example: https://github.com/aws/sagemaker-python-sdk/blob/master/tests/data/pytorch_mnist/mnist.py, because by default the pre-built container will run this (https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/master/src/sagemaker_pytorch_container/training.py#L112) instead

As per the script noted, the script is used in the PyTorch training container to update hostname to algo-1, algo-2, ... instead aws so that NCCL and MPI knows when there are multiple hosts as the behavior of socket.gethostname (https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/master/src/sagemaker_pytorch_container/training.py#L80). In the case of single host, the host name will be algo-1 then.

shiftan · 2020-10-20T07:16:12Z

I'm using a pre-built pytorch image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-cpu-py3
As per my understanding, registering to signals is the way to know the instance is about to be stopped, e.g. to save the state before.
@SergTogul Are you saying this isn't supported with the pre-built images? Is there any other way?

I've just tested it again with the code below and max_run=60 and got that output:


Invoking script with the following command:

/opt/conda/bin/python sig_test.py


Listing processes
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S root         1     0  0  80   0 -  4941 -      07:09 ?        00:00:00 bash -m start_with_right_hostname.sh train
4 S root        14     1  1  80   0 - 56767 -      07:09 ?        00:00:00 /opt/conda/bin/python /opt/conda/bin/train
0 S root        25    14  0  80   0 -  6869 -      07:10 ?        00:00:00 /opt/conda/bin/python sig_test.py
0 S root        26    25  0  80   0 -  1641 -      07:10 ?        00:00:00 /bin/sh -c ps -elf
0 R root        27    26  0  80   0 -  9041 -      07:10 ?        00:00:00 ps -elf
Waiting for a signal...

2020-10-20 07:13:55 Stopping - Stopping the training job
2020-10-20 07:16:16 Uploading - Uploading generated training model
2020-10-20 07:16:16 MaxRuntimeExceeded - Training job runtime exceeded MaxRuntimeInSeconds provided
Training seconds: 440
Billable seconds: 440

The python code:

import signal
import sys
import time
import subprocess

def handler(signum, frame):
    print("Signal handler called with signal", signum)
    print(frame)
    sys.exit(0)

for sigName in [signal.SIGTERM, signal.SIGHUP, signal.SIGINT]:
    signal.signal(sigName, handler)

print("Listing processes")
subprocess.run("ps -elf", shell=True)
print("Waiting for a signal...")
while True:
    time.sleep(1)

theo-rogers · 2024-08-26T19:55:03Z

Is there any update on this? I am running into the same issue

saimidu added the bug Something isn't working label Sep 28, 2020

SergTogul removed the bug Something isn't working label Oct 20, 2020

arjkesh added bug Something isn't working pending research Pending research... and removed bug Something isn't working labels Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Signals don't get propagated to the training script #632

[bug] Signals don't get propagated to the training script #632

shiftan commented Sep 27, 2020 •

edited

Loading

SergTogul commented Oct 20, 2020

shiftan commented Oct 20, 2020 •

edited

Loading

theo-rogers commented Aug 26, 2024

[bug] Signals don't get propagated to the training script #632

[bug] Signals don't get propagated to the training script #632

Comments

shiftan commented Sep 27, 2020 • edited Loading

SergTogul commented Oct 20, 2020

shiftan commented Oct 20, 2020 • edited Loading

theo-rogers commented Aug 26, 2024

shiftan commented Sep 27, 2020 •

edited

Loading

shiftan commented Oct 20, 2020 •

edited

Loading