-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Signals don't get propagated to the training script #632
Comments
Good evening.
|
I'm using a pre-built pytorch image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-cpu-py3 I've just tested it again with the code below and max_run=60 and got that output:
The python code:
|
Is there any update on this? I am running into the same issue |
Checklist
Concise Description:
Images (I've tested the PyTorch one) aren't built as suggested here:https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html.
Specifically, the entrypoint is set to a bash script, instead of directly to the python code, hence the python code isn't running as pid 1 and signals don't get propagated.
DLC image/dockerfile:
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-cpu-py3
Current behavior:
Registering to the SIGTERM signal using "signal.signal(sigName, handler)" in a training script, doesn't get called e.g. when setting max_run to 60, and waiting enough time.
Also, running "ps -elf" by using subprocess.run("ps -elf", shell=True) from a training script shows the below:
4 S root 1 0 0 80 0 - 4941 - 07:57 ? 00:00:00 bash -m start_with_right_hostname.sh train
4 S root 15 1 2 80 0 - 56741 - 07:57 ? 00:00:00 /opt/conda/bin/python /opt/conda/bin/train
4 S root 26 15 0 80 0 - 7630 - 07:57 ? 00:00:00 /opt/conda/bin/python shell_launcher.py --SSM_CMD_LINE ps -elf
0 S root 27 26 0 80 0 - 1641 - 07:57 ? 00:00:00 /bin/sh -c ps -elf
0 R root 28 27 0 80 0 - 9041 - 07:57 ? 00:00:00 ps -elf
And as you can see the python isn't PID 1
Expected behavior:
Signals to get propagated + python script to be PID 1, (unless signals can get propagated otherwise)
Additional context:
The below as a training script, when running with small value of max_run, or just stopping the training job from the console can show the problem.
`
import signal
import sys
import time
def handler(signum, frame):
print("Signal handler called with signal", signum)
print(frame)
sys.exit(0)
for sigName in [signal.SIGTERM, signal.SIGHUP, signal.SIGINT]:
signal.signal(sigName, handler)
print("Waiting for a signal...")
while True:
time.sleep(1)
`
The text was updated successfully, but these errors were encountered: