Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Signals don't get propagated to the training script #632

Open
2 tasks done
shiftan opened this issue Sep 27, 2020 · 3 comments
Open
2 tasks done

[bug] Signals don't get propagated to the training script #632

shiftan opened this issue Sep 27, 2020 · 3 comments
Labels
pending research Pending research...

Comments

@shiftan
Copy link

shiftan commented Sep 27, 2020

Checklist

Concise Description:
Images (I've tested the PyTorch one) aren't built as suggested here:https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html.
Specifically, the entrypoint is set to a bash script, instead of directly to the python code, hence the python code isn't running as pid 1 and signals don't get propagated.

DLC image/dockerfile:
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-cpu-py3

Current behavior:
Registering to the SIGTERM signal using "signal.signal(sigName, handler)" in a training script, doesn't get called e.g. when setting max_run to 60, and waiting enough time.
Also, running "ps -elf" by using subprocess.run("ps -elf", shell=True) from a training script shows the below:
4 S root 1 0 0 80 0 - 4941 - 07:57 ? 00:00:00 bash -m start_with_right_hostname.sh train
4 S root 15 1 2 80 0 - 56741 - 07:57 ? 00:00:00 /opt/conda/bin/python /opt/conda/bin/train
4 S root 26 15 0 80 0 - 7630 - 07:57 ? 00:00:00 /opt/conda/bin/python shell_launcher.py --SSM_CMD_LINE ps -elf

0 S root 27 26 0 80 0 - 1641 - 07:57 ? 00:00:00 /bin/sh -c ps -elf
0 R root 28 27 0 80 0 - 9041 - 07:57 ? 00:00:00 ps -elf
And as you can see the python isn't PID 1

Expected behavior:
Signals to get propagated + python script to be PID 1, (unless signals can get propagated otherwise)

Additional context:
The below as a training script, when running with small value of max_run, or just stopping the training job from the console can show the problem.

`
import signal
import sys
import time

def handler(signum, frame):
print("Signal handler called with signal", signum)
print(frame)
sys.exit(0)

for sigName in [signal.SIGTERM, signal.SIGHUP, signal.SIGINT]:
signal.signal(sigName, handler)

print("Waiting for a signal...")
while True:
time.sleep(1)
`

@saimidu saimidu added the bug Something isn't working label Sep 28, 2020
@SergTogul
Copy link
Contributor

Good evening.
Thank you for your question. So here it is:

  1. Are you using BYO container or a pre-built PyTorch container?
  1. As per the script noted, the script is used in the PyTorch training container to update hostname to algo-1, algo-2, ... instead aws so that NCCL and MPI knows when there are multiple hosts as the behavior of socket.gethostname (https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/master/src/sagemaker_pytorch_container/training.py#L80). In the case of single host, the host name will be algo-1 then.

@SergTogul SergTogul removed the bug Something isn't working label Oct 20, 2020
@shiftan
Copy link
Author

shiftan commented Oct 20, 2020

I'm using a pre-built pytorch image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-cpu-py3
As per my understanding, registering to signals is the way to know the instance is about to be stopped, e.g. to save the state before.
@SergTogul Are you saying this isn't supported with the pre-built images? Is there any other way?

I've just tested it again with the code below and max_run=60 and got that output:


Invoking script with the following command:

/opt/conda/bin/python sig_test.py


Listing processes
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S root         1     0  0  80   0 -  4941 -      07:09 ?        00:00:00 bash -m start_with_right_hostname.sh train
4 S root        14     1  1  80   0 - 56767 -      07:09 ?        00:00:00 /opt/conda/bin/python /opt/conda/bin/train
0 S root        25    14  0  80   0 -  6869 -      07:10 ?        00:00:00 /opt/conda/bin/python sig_test.py
0 S root        26    25  0  80   0 -  1641 -      07:10 ?        00:00:00 /bin/sh -c ps -elf
0 R root        27    26  0  80   0 -  9041 -      07:10 ?        00:00:00 ps -elf
Waiting for a signal...

2020-10-20 07:13:55 Stopping - Stopping the training job
2020-10-20 07:16:16 Uploading - Uploading generated training model
2020-10-20 07:16:16 MaxRuntimeExceeded - Training job runtime exceeded MaxRuntimeInSeconds provided
Training seconds: 440
Billable seconds: 440

The python code:

import signal
import sys
import time
import subprocess

def handler(signum, frame):
    print("Signal handler called with signal", signum)
    print(frame)
    sys.exit(0)

for sigName in [signal.SIGTERM, signal.SIGHUP, signal.SIGINT]:
    signal.signal(sigName, handler)

print("Listing processes")
subprocess.run("ps -elf", shell=True)
print("Waiting for a signal...")
while True:
    time.sleep(1)

@arjkesh arjkesh added bug Something isn't working pending research Pending research... and removed bug Something isn't working labels Jul 13, 2021
@theo-rogers
Copy link

Is there any update on this? I am running into the same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending research Pending research...
Projects
None yet
Development

No branches or pull requests

5 participants