Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop responding in some condition #744

Closed
henry42 opened this issue Sep 27, 2014 · 36 comments
Closed

stop responding in some condition #744

henry42 opened this issue Sep 27, 2014 · 36 comments

Comments

@henry42
Copy link

henry42 commented Sep 27, 2014

I use pm2 to manage my node server, pm2 is running as root but node server as nobody. I found that sometimes it will stop responding without a reason, but pm2 list tells me online and every thing is fine.Even if I delete or re start, the request will be pending all the time(all request besides the statics)

The only way to solve this problem is killing the pm2 and restart it.

It happens not usual but sometimes, and I am trying to replicate the problem.Does anyone have the same situation and can give me some suggestion?

pm2 v0.10.7
node v0.10.31

@henry42
Copy link
Author

henry42 commented Sep 27, 2014

Maybe it is just because of cluster mode in node 0.10???

@Unitech
Copy link
Owner

Unitech commented Sep 29, 2014

In fork mode does this happens ?

@atomantic
Copy link

I think I'm having the same with PM2 but I'm running everything as the same non-root user.

Here's my process.json

{
  "apps" : [{
    "env"                : {
        "APP_ENV"        : "QA",
        "SERVICE_ENV"    : "STAGE",
        "API_VERSION"    : "v2"
    },
    "error_file"         : "/tmp/api-err.log",
    "exec_interpreter"   : "node",
    "exec_mode"          : "custer_mode",
    "instances"          : "max",
    "log_date_format"    : "YYYY-MM-DD HH:mm Z",
    "max_memory_restart" : "160",
    "merge_logs"         : "true",
    "name"               : "did-shim",
    "out_file"           : "/tmp/child-out.log",
    "pid_file"           : "/tmp/child.pid",
    "script"             : "app.js"
  }]
}
pm2 start process.json
┌──────────┬────┬─────────┬───────┬────────┬───────────┬────────┬─────────────┬─────────────┐
│ App name │ id │ mode    │ PID   │ status │ restarted │ uptime │      memory │    watching │
├──────────┼────┼─────────┼───────┼────────┼───────────┼────────┼─────────────┼─────────────┤
│ did-shim │ 0  │ cluster │ 18821 │ online │         0 │ 92s    │ 69.250 MB   │ unactivated │
│ did-shim │ 1  │ cluster │ 18823 │ online │         0 │ 92s    │ 66.078 MB   │ unactivated │
└──────────┴────┴─────────┴───────┴────────┴───────────┴────────┴─────────────┴─────────────┘
pm2 gracefulReload all;

[PM2] [WARNING] did-shim app can't be reloaded - restarting it
[PM2] Process did-shim successfully restarted
[PM2] Process did-shim successfully restarted
[PM2] [WARNING] did-shim app can't be reloaded - restarting it
[PM2] Process did-shim successfully restarted
[PM2] Process did-shim successfully restarted
┌──────────┬────┬─────────┬───────┬────────┬───────────┬────────┬─────────────┬─────────────┐
│ App name │ id │ mode    │ PID   │ status │ restarted │ uptime │      memory │    watching │
├──────────┼────┼─────────┼───────┼────────┼───────────┼────────┼─────────────┼─────────────┤
│ did-shim │ 0  │ cluster │ 18790 │ online │         2 │ 0s     │ 26.816 MB   │ unactivated │
│ did-shim │ 1  │ cluster │ 18796 │ online │         2 │ 0s     │ 20.941 MB   │ unactivated │
└──────────┴────┴─────────┴───────┴────────┴───────────┴────────┴─────────────┴─────────────┘

Note: I don't expect to have any open connections with this app so I have this within my app:

process.on('message', function(msg) {
  if (msg == 'shutdown') {
    // immediately exit
    process.exit(0);
  }
});

So, there are two problems:

  1. Not sure why reload isn't working
  2. After the restart, my app is unresponsive. I get a long polling timeout on any request to the api with no error or access logs.

The only way to fix the issue is

pm2 kill; pm2 start process.json

I get this same behavior if I try pm2 restart process.json, pm2 reload process.json, or pm2 gracefulReload all

node -v
v0.10.29
pm2 -v
0.10.8

@williansabiao
Copy link

I had the same problem. I donwgraded pm2 to version 0.10.4 and it works.

My problem happens when I start a second process, when I do this, the first process does not work.

I called the debug mode and nothing different happens.

@raitucarp
Copy link

node 0.10.32 here with fork version has the same issue
when you got some error, it wont restart automatically, what is the solution?

@atomantic
Copy link

I don't have any resolution yet. Looks like cluster_mode reload (and gracefulReload) are non-functional. I've only been able to pm2 kill; pm2 start process.json, which means no zero-downtime deployment :( -- watch experiences the same hang in cluster_mode so that doesn't work either.

@Unitech
Copy link
Owner

Unitech commented Oct 6, 2014

Please give a try of this branch:

$ npm install git://github.com/Unitech/pm2#hotfix -g

It has fixes about the reload function. Tell us if it fixes these problems.

@atomantic
Copy link

npm install git://github.com/Unitech/pm2#hotfix -g;
pm2 kill; pm2 start process.json

checked site to make sure it's running (working fine)

pm2 gracefulReload all
[PM2] [WARNING] did-shim app can't be reloaded - restarting it
[PM2] restartProcessId process id 0
┌──────────┬────┬─────────┬───────┬────────┬───────────┬────────┬─────────────┬─────────────┐
│ App name │ id │ mode    │ PID   │ status │ restarted │ uptime │      memory │    watching │
├──────────┼────┼─────────┼───────┼────────┼───────────┼────────┼─────────────┼─────────────┤
│ did-shim │ 0  │ cluster │ 29130 │ online │         1 │ 0s     │ 17.465 MB   │ unactivated │
└──────────┴────┴─────────┴───────┴────────┴───────────┴────────┴─────────────┴─────────────┘

now any requests to the API are hanging/timeout.

only resolution is pm2 kill; pm2 start process.json

node -v
v0.10.30

pm2 -v
0.11.0-beta2

@atomantic
Copy link

I just realized something... in order to gracefulReload, do I need to configure the instances to (number of CPU) -1 in order to allow spin up of a new instance or will PM2 allow more instances than CPU for the purpose of gracefulReload?

@atomantic
Copy link

OK, I fixed my custer_mode -> cluster_mode config snafu, but now, even though reload succeeds, I still get the app hang.

@Unitech
Copy link
Owner

Unitech commented Oct 7, 2014

Can you use the 0.11.14?
On Oct 7, 2014 2:37 AM, "Adam Eivy" [email protected] wrote:

OK, I fixed my custer_mode -> cluster_mode config snafu, but now, even
though reload succeeds, I still get the app hang.


Reply to this email directly or view it on GitHub
#744 (comment).

@atomantic
Copy link

I think I discovered the problem... I tailed my child-out log and noticed this:

[cloud-user@n7-z01-0a2a2412 ~/did_groups_api] tail -f /tmp/child-out.log& pm2 gracefulReload all
[1] 4340

2014-10-06 17:47 -07:00: Server running at: http://n7-z01-0a2a2412:80 version:  1.5.3
2014-10-06 17:47 -07:00: Hapi Server stopped
[PM2] Process did-shim succesfully reloaded
┌──────────┬────┬─────────┬──────┬────────┬───────────┬────────┬─────────────┬─────────────┐
│ App name │ id │ mode    │ PID  │ status │ restarted │ uptime │      memory │    watching │
├──────────┼────┼─────────┼──────┼────────┼───────────┼────────┼─────────────┼─────────────┤
│ did-shim │ 0  │ cluster │ 4344 │ online │         1 │ 1s     │ 45.340 MB   │ unactivated │
└──────────┴────┴─────────┴──────┴────────┴───────────┴────────┴─────────────┴─────────────┘
 Use `pm2 desc[ribe] <id>` to get more details

in my hapi.js app, I had this:

process.on('message', function(msg) {
    if (msg == 'shutdown') {
        server.stop({}, function () {
            console.log('Hapi Server stopped');
            process.exit(0);
        });
    }
});

but it appears the shutdown message is being handled after the graceful reload--so that the hapi server is stopping the new gracefully reloaded instance.

I removed my process.on event listener and now gracefulReload works :)

node v0.11.14
pm2 0.11.0-beta2

Note: this does not fix it for
node v0.10.32
pm2 0.11.0-beta2

or for
node v0.10.32
pm2 0.10.8

@jshkurti
Copy link
Contributor

jshkurti commented Oct 7, 2014

I tried to replicate the issue with this sample code:

var Hapi = require('hapi');

var server = new Hapi.Server('localhost', 8000);

server.route({
  method: 'GET',
  path: '/',
  handler: function (request, reply) {
    reply('hello world');
  }
});

server.start();

process.on('message', function(msg) {
  if (msg == 'shutdown') {
    console.log('Hapi Server received shutdown event, waiting for close');

    var timer = setTimeout(function () {
      console.log('Hapi Server killed anyway after timeout');
      process.exit(1);
    }, 1000);

    server.stop({}, function () {
      clearTimeout(timer);
      console.log('Hapi Server successfully stopped');
      process.exit(0);
    });
  }
});

But it didn't happen.
gracefulReload worked just fine everytime and the server was still up and responding to new requests.

I tried with Node 0.10.30/0.11.13/0.11.14.

Can you please try one last time with this pm2 version :
npm install git://github.com/Unitech/pm2#development -g

@atomantic
Copy link

Now it's hanging....

I'm running this repo:
https://github.com/atomantic/hapi_playground

with:
node 0.10.32 (I'm hoping to get this working with node stable)
pm2 0.11.0-beta2

⇒  pm2 start process.json
Starting PM2 daemon...
Process launched
┌─────────────────┬────┬─────────┬───────┬────────┬───────────┬────────┬─────────────┬───────────┐
│ App name        │ id │ mode    │ PID   │ status │ restarted │ uptime │      memory │  watching │
├─────────────────┼────┼─────────┼───────┼────────┼───────────┼────────┼─────────────┼───────────┤
│ hapi_playground │ 0  │ cluster │ 95682 │ online │         0 │ 0s     │ 24.465 MB   │ activated │
└─────────────────┴────┴─────────┴───────┴────────┴───────────┴────────┴─────────────┴───────────┘

then

⇒  curl localhost:3666
{"data":"Hello Hapi"}%

then when I try gracefulReload, pm2 just hangs:

⇒  pm2 gracefulReload all
^C% (after waiting 5 minutes)

After aborting the gracefulReload attempt:

⇒  pm2 -v
Starting PM2 daemon...

"starting pm2 daemon" sounds like it wasn't running, but that's all the output that came from the command. Run it again, and now I get the version:

⇒  pm2 -v
Starting PM2 daemon...
0.11.0-beta2

if I try to run pm2 kill after aborting the gracefulReload, I get this:

⇒  pm2 kill
Starting PM2 daemon...
(and infinite hang)

@jshkurti
Copy link
Contributor

jshkurti commented Oct 7, 2014

What does it print in ~/.pm2/pm2.log ?

@atomantic
Copy link

OK, odd. I just switched back to pm2 stable (0.10.8) with node 0.10.32

and now gracefulReload is working with that app.

In another app, I'm tailing the logs and I see this with gracefulReload (multiple instances):

App name:did-shim id:7 online
_old_2 : id unknown
Trace: [Error: _old_2 : id unknown]
    at Object.God.logAndGenerateError (/usr/local/lib/node_modules/pm2/lib/God/Methods.js:27:15)
    at /usr/local/lib/node_modules/pm2/lib/God/ActionMethods.js:269:30
    at Object.God.stopProcessId (/usr/local/lib/node_modules/pm2/lib/God/ActionMethods.js:170:14)
    at Object.God.deleteProcessId (/usr/local/lib/node_modules/pm2/lib/God/ActionMethods.js:268:9)
    at null._onTimeout (/usr/local/lib/node_modules/pm2/lib/God/Reload.js:93:15)
    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)
Worker NaN is already disconnected
Worker NaN is already disconnected
Starting execution sequence in -cluster mode- for app name:did-shim id:5

and then what follows is a constant neverending stream of this:

Starting execution sequence in -cluster mode- for app name:did-shim id:1
App name:did-shim id:1 online
App name:did-shim id:2 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:2
App name:did-shim id:2 online
App name:did-shim id:4 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:4
App name:did-shim id:4 online
App name:did-shim id:3 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:3
App name:did-shim id:6 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:6
App name:did-shim id:3 online
App name:did-shim id:6 online
App name:did-shim id:0 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:0
App name:did-shim id:0 online
App name:did-shim id:1 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:1
App name:did-shim id:1 online
App name:did-shim id:2 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:2
App name:did-shim id:2 online
App name:did-shim id:4 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:4
App name:did-shim id:4 online
App name:did-shim id:3 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:3
App name:did-shim id:6 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:6
App name:did-shim id:3 online
App name:did-shim id:6 online
App name:did-shim id:0 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:0
App name:did-shim id:0 online
App name:did-shim id:1 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:1
App name:did-shim id:1 online
App name:did-shim id:2 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:2
App name:did-shim id:2 online
App name:did-shim id:4 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:4
App name:did-shim id:4 online
App name:did-shim id:3 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:3
App name:did-shim id:6 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:6
App name:did-shim id:3 online
App name:did-shim id:6 online
App name:did-shim id:0 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:0
App name:did-shim id:0 online
App name:did-shim id:1 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:1
App name:did-shim id:1 online
App name:did-shim id:2 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:2
App name:did-shim id:2 online
App name:did-shim id:4 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:4
App name:did-shim id:4 online
App name:did-shim id:3 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:3
App name:did-shim id:6 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:6
App name:did-shim id:3 online
App name:did-shim id:6 online
App name:did-shim id:0 exited
Starting execution sequence in -cluster mode- for app name:did-shim id:0
App name:did-shim id:0 online
App name:did-shim id:1 exited

it appears it's caught in a reload loop.

@atomantic
Copy link

I just switched my hapi_playground repo to use max instances and I'm not seeing a problem with gracefulReload. Currently investigating delta between that app and the one I'm having problems with. I'll post and update when I have a repro on hapi_playground.

@jshkurti
Copy link
Contributor

jshkurti commented Oct 7, 2014

Ok thanks, we're investigating this issue as well.

@atomantic
Copy link

argh. Heisenbug.
Now it's working fine in the app I was having trouble with. Haven't changed node or pm2 versions. Just done several pm2 start process.json + pm2 gracefulReload all + pm2 kill attempts.

Once it even gave me that infinite loop of online + exited logs when I just did a pm2 start process.json... I'll pop onto some of my servers and see if I can grab repo + logs from there.

@atomantic
Copy link

So, it appears to be working fine on OSX with port 46100, but not on CentOS 6.5, running on port 80.

Note, I've enabled port 80 for node via:

sudo setcap 'cap_net_bind_service=+ep' `which node`

I'm tailing child-out.log, api-error.log and ~/.pm2/pm2.log and this is the flow:

⇒  pm2 start process.json 
Starting execution sequence in -cluster mode- for app name:did-shim id:0
App name:did-shim id:0 online
Process launched
┌──────────┬────┬─────────┬───────┬────────┬───────────┬────────┬─────────────┬─────────────┐
│ App name │ id │ mode    │ PID   │ status │ restarted │ uptime │      memory │    watching │
├──────────┼────┼─────────┼───────┼────────┼───────────┼────────┼─────────────┼─────────────┤
│ did-shim │ 0  │ cluster │ 18074 │ online │         0 │ 0s     │ 16.887 MB   │ unactivated │
└──────────┴────┴─────────┴───────┴────────┴───────────┴────────┴─────────────┴─────────────┘
 Use `pm2 desc[ribe] <id>` to get more details
2014-10-07 10:11 -07:00: Server running at: http://n7-z01-0a2a2412:80 version:  1.5.5
⇒  pm2 gracefulReload all
Starting execution sequence in -cluster mode- for app name:did-shim id:0
App name:did-shim id:0 online
2014-10-07 10:11 -07:00: Server running at: http://n7-z01-0a2a2412:80 version:  1.5.5
-reload- New worker listening
2014-10-07 10:11 -07:00: Hapi Server received shutdown event, waiting for close
2014-10-07 10:11 -07:00: Hapi Server killed anyway after timeout
-reload- Old worker disconnected
App name:did-shim id:_old_0 exited
Process with pid 18074 killed
[PM2] Process did-shim succesfully reloaded
┌──────────┬────┬─────────┬───────┬────────┬───────────┬────────┬─────────────┬─────────────┐
│ App name │ id │ mode    │ PID   │ status │ restarted │ uptime │      memory │    watching │
├──────────┼────┼─────────┼───────┼────────┼───────────┼────────┼─────────────┼─────────────┤
│ did-shim │ 0  │ cluster │ 18098 │ online │         1 │ 2s     │ 49.266 MB   │ unactivated │
└──────────┴────┴─────────┴───────┴────────┴───────────┴────────┴─────────────┴─────────────┘
 Use `pm2 desc[ribe] <id>` to get more details

The logs aren't reporting anything odd looking but the app isn't listening on port 80.
Oddly, any following attempts to do pm2 gracefulReload all also appear to succeed but they take longer.

⇒  pm2 gracefulReload all
Starting execution sequence in -cluster mode- for app name:did-shim id:0
App name:did-shim id:0 online
2014-10-07 10:17 -07:00: Server running at: http://n7-z01-0a2a2412:80 version:  1.5.5
-reload- New worker listening

hangs here for about 45 seconds, then:

App name:did-shim id:_old_0 exited
Process with pid 18140 killed
[PM2] Process did-shim succesfully reloaded
┌──────────┬────┬─────────┬───────┬────────┬───────────┬────────┬─────────────┬─────────────┐
│ App name │ id │ mode    │ PID   │ status │ restarted │ uptime │      memory │    watching │
├──────────┼────┼─────────┼───────┼────────┼───────────┼────────┼─────────────┼─────────────┤
│ did-shim │ 0  │ cluster │ 18164 │ online │         4 │ 11s    │ 59.000 MB   │ unactivated │
└──────────┴────┴─────────┴───────┴────────┴───────────┴────────┴─────────────┴─────────────┘
 Use `pm2 desc[ribe] <id>` to get more details

Note though that the following is missing from future reload attempts:

2014-10-07 10:11 -07:00: Hapi Server received shutdown event, waiting for close
2014-10-07 10:11 -07:00: Hapi Server killed anyway after timeout
-reload- Old worker disconnected

@atomantic
Copy link

It's not port 80 that's the problem. Switched my remote CentOS node to use :46100 and same problem.

@atomantic
Copy link

OK. I think I found the problem...

I tried deploying my hapi_playground app to my CentOS node and changed the log file paths to /tmp/... (as they are in my other app)
I also enabled "watch":true (I'm experiencing the same problem with watch and with gracefulReload)

{
  "apps" : [{
    "error_file"         : "/tmp/api-err.log",
    "exec_interpreter"   : "node",
    "exec_mode"          : "cluster_mode",
    "instances"          : "max",
    "log_date_format"    : "YYYY-MM-DD HH:mm Z",
    "max_memory_restart" : "160",
    "merge_logs"         : "true",
    "name"               : "hapi_playground",
    "out_file"           : "/tmp/child-out.log",
    "pid_file"           : "/tmp/child.pid",
    "script"             : "index.js",
    "watch"              : true
  }]
}

If I launch it with the log files in /tmp, and edit a file, I get the broken app, not listening on the port.
If I change it back to the same directory, it's in an infinite reload, so I updated the process.json again to put the log files in ./log and ignore them with watch:

{
  "apps" : [{
    "error_file"         : "./log/api-err.log",
    "exec_interpreter"   : "node",
    "exec_mode"          : "custer_mode",
    "instances"          : "1",
    "log_date_format"    : "YYYY-MM-DD HH:mm Z",
    "max_memory_restart" : "160",
    "merge_logs"         : "true",
    "name"               : "hapi_playground",
    "out_file"           : "./log/child-out.log",
    "pid_file"           : "./log/child.pid",
    "script"             : "index.js",
    "ignoreWatch"        : ["[\\/\\\\]\\./", "log"],
    "watch"              : true
  }]
}

And now it works!

By the way, I'm not sure what "[/\]./" is matching. I just saw that as an example in the doc. What does that translate too?

@atomantic
Copy link

dang. I just realized I had copied a process.json config lacking "cluster_mode" again. But at least it works in "custer_mode" :(

still an issue....

funny thing is that even with the misspelling, it shows up in pm2 list with mode as cluster

@atomantic
Copy link

In trying to repro it again today, watch:true + edit index.js worked in safely reloading it once, but then follow up file changes cuase this:

pm2 list
┌─────────────────┬────┬─────────┬───────┬────────┬───────────┬────────┬─────────────┬───────────┐
│ App name        │ id │ mode    │ PID   │ status │ restarted │ uptime │      memory │  watching │
├─────────────────┼────┼─────────┼───────┼────────┼───────────┼────────┼─────────────┼───────────┤
│ hapi_playground │ 0  │ cluster │ 17635 │ online │         6 │ 10h    │ 26.496 MB   │ activated │
└─────────────────┴────┴─────────┴───────┴────────┴───────────┴────────┴─────────────┴───────────┘
 Use `pm2 desc[ribe] <id>` to get more details
[cloud-user@n7-z01-0a2a2412 ~/hapi_playground] vim index.js
[cloud-user@n7-z01-0a2a2412 ~/hapi_playground] pm2 list
Starting PM2 daemon...
┌──────────┬────┬──────┬─────┬────────┬───────────┬────────┬────────┬──────────┐
│ App name │ id │ mode │ PID │ status │ restarted │ uptime │ memory │ watching │
└──────────┴────┴──────┴─────┴────────┴───────────┴────────┴────────┴──────────┘

process just vanished....

This is with current stable pm2 version. I'm going to try development version again (though I was having trouble with that one last night too).

@atomantic
Copy link

Well, with pm2 -v 0.11.0, it doesn't even work once:

⇒  npm install git://github.com/Unitech/pm2#development -g
⇒  pm2 -v
0.11.0
⇒  pm2 start process.json
Starting PM2 daemon...
Process launched
┌─────────────────┬────┬─────────┬──────┬────────┬───────────┬────────┬─────────────┬───────────┐
│ App name        │ id │ mode    │ PID  │ status │ restarted │ uptime │      memory │  watching │
├─────────────────┼────┼─────────┼──────┼────────┼───────────┼────────┼─────────────┼───────────┤
│ hapi_playground │ 0  │ cluster │ 2499 │ online │         0 │ 0s     │ 24.285 MB   │ activated │
└─────────────────┴────┴─────────┴──────┴────────┴───────────┴────────┴─────────────┴───────────┘
 Use `pm2 desc[ribe] <id>` to get more details

now running, and curl is good:

⇒  curl -m 5 localhost:46000
{"data":"Hello Hapi 8"}

edit the app:

⇒ sed -i 's/Hapi 7/Hapi 8/' index.js
⇒ pm2 list
┌─────────────────┬────┬─────────┬──────┬────────┬───────────┬────────┬─────────────┬───────────┐
│ App name        │ id │ mode    │ PID  │ status │ restarted │ uptime │      memory │  watching │
├─────────────────┼────┼─────────┼──────┼────────┼───────────┼────────┼─────────────┼───────────┤
│ hapi_playground │ 0  │ cluster │ 2533 │ online │         1 │ 2s     │ 28.078 MB   │ activated │
└─────────────────┴────┴─────────┴──────┴────────┴───────────┴────────┴─────────────┴───────────┘
 Use `pm2 desc[ribe] <id>` to get more details
⇒ curl -m 5 localhost:46000
curl: (28) Operation timed out after 5000 milliseconds with 0 bytes received

~/.pm2/pm2.log seems healthy:

Process with pid 2670 killed
Starting execution sequence in -cluster mode- for app name:hapi_playground id:0
App name:hapi_playground id:0 online
App name:hapi_playground id:0 exited
Process with pid 2733 killed
Starting execution sequence in -cluster mode- for app name:hapi_playground id:0
App name:hapi_playground id:0 online

but the app is dead.

@atomantic
Copy link

I got this to work briefly with these versions and a minimal process.json file:

⇒ pm2 -v
0.11.0
⇒ node -v
v0.11.14
⇒ pm2 start process.json
Starting PM2 daemon...
Process launched
┌─────────────────┬────┬─────────┬──────┬────────┬───────────┬────────┬─────────────┬───────────┐
│ App name        │ id │ mode    │ PID  │ status │ restarted │ uptime │      memory │  watching │
├─────────────────┼────┼─────────┼──────┼────────┼───────────┼────────┼─────────────┼───────────┤
│ hapi_playground │ 0  │ cluster │ 3321 │ online │         0 │ 0s     │ 24.246 MB   │ activated │
└─────────────────┴────┴─────────┴──────┴────────┴───────────┴────────┴─────────────┴───────────┘
 Use `pm2 desc[ribe] <id>` to get more details
⇒ curl -m 5 localhost:46000
{"data":"Hello Hapi 12"}
⇒ sed -i 's/Hapi 12/Hapi 13/' index.js
⇒ curl -m 5 localhost:46000
{"data":"Hello Hapi 13"}
⇒ curl -m 5 localhost:46000
{"data":"Hello Hapi 13"}
⇒ curl -m 5 localhost:46000
{"data":"Hello Hapi 13"}
⇒ sed -i 's/Hapi 13/Hapi 14/' index.js
⇒ curl -m 5 localhost:46000
{"data":"Hello Hapi 14"}

But then I brought my process.json file back to full and pm2 0.11.0 started going nuts:

⇒  pm2 start process.json
Starting PM2 daemon...
Process launched
┌─────────────────┬────┬─────────┬──────┬────────┬───────────┬────────┬─────────────┬───────────┐
│ App name        │ id │ mode    │ PID  │ status │ restarted │ uptime │      memory │  watching │
├─────────────────┼────┼─────────┼──────┼────────┼───────────┼────────┼─────────────┼───────────┤
│ hapi_playground │ 0  │ cluster │ 3611 │ online │         0 │ 0s     │ 26.387 MB   │ activated │
└─────────────────┴────┴─────────┴──────┴────────┴───────────┴────────┴─────────────┴───────────┘
 Use `pm2 desc[ribe] <id>` to get more details

⇒  curl -m 5 localhost:46000
{"data":"Hello Hapi"}

⇒  sed -i 's/Hapi/Hapi 1/' index.js
⇒  curl -m 5 localhost:46000
curl: (7) couldn't connect to host

⇒  pm2 list
┌─────────────────┬────┬─────────┬──────┬─────────┬───────────┬────────┬────────┬───────────┐
│ App name        │ id │ mode    │ PID  │ status  │ restarted │ uptime │ memory │  watching │
{
├─────────────────┼────┼─────────┼──────┼─────────┼───────────┼────────┼────────┼───────────┤
│ hapi_playground │ 0  │ cluster │ 3869 │ errored │        17 │ 0      │ 0 B    │ activated │
└─────────────────┴────┴─────────┴──────┴─────────┴───────────┴────────┴────────┴───────────┘
 Use `pm2 desc[ribe] <id>` to get more details
⇒  tail -f ~/.pm2/pm2.log
Starting execution sequence in -cluster mode- for app name:hapi_playground id:0
App name:hapi_playground id:0 online
App name:hapi_playground id:0 exited
Starting execution sequence in -cluster mode- for app name:hapi_playground id:0
App name:hapi_playground id:0 online
App name:hapi_playground id:0 exited
Starting execution sequence in -cluster mode- for app name:hapi_playground id:0
App name:hapi_playground id:0 online
App name:hapi_playground id:0 exited
Script /home/cloud-user/hapi_playground/index.js had too many unstable restarts (15). Stopped. "errored"
:q
^C
⇒  pm2 kill; pm2 start process.json
Starting PM2 daemon...
Starting PM2 daemon...
⇒  pm2 list
Starting PM2 daemon...
^C
⇒  pm2 kill
Starting PM2 daemon...
⇒  ps -Af | grep pm2
501       1969     1  0 09:40 ?        00:00:00 pm2: hapi_playground
501       2112     1  0 09:43 ?        00:00:00 pm2: hapi_playground
501       4100  1780  0 10:14 pts/1    00:00:00 grep pm2
501      16418     1  0 Oct08 ?        00:00:00 pm2: hapi_playground
⇒  kill -9 1969
⇒  kill -9 2112
⇒  kill -9 16418
⇒  pm2 kill
Starting PM2 daemon...

and there it hangs :(

~/.pm2/pm2.log

BUS system [READY] on port /home/cloud-user/.pm2/pub.sock
Error: ENOENT, unlink '/home/cloud-user/.pm2/pub.sock'
    at Error (native)
[[[[ PM2/God daemon launched ]]]]
RPC interface [READY] on port /home/cloud-user/.pm2/rpc.sock
[[[[ PM2/God daemon launched ]]]]
BUS system [READY] on port /home/cloud-user/.pm2/pub.sock
RPC interface [READY] on port /home/cloud-user/.pm2/rpc.sock
Error: ENOENT, unlink '/home/cloud-user/.pm2/rpc.sock'
    at Error (native)

This is my full process.json

{
  "apps" : [{
    "error_file"         : "./log/api-err.log",
    "exec_interpreter"   : "node",
    "node_args"        : "--harmony",
    "exec_mode"          : "cluster_mode",
    "instances"          : "max",
    "log_date_format"    : "YYYY-MM-DD HH:mm Z",
    "max_memory_restart" : "160",
    "merge_logs"         : true,
    "name"               : "hapi_playground",
    "out_file"           : "./log/child-out.log",
    "pid_file"           : "./log/child.pid",
    "script"             : "index.js",
    "ignoreWatch"        : ["[\\/\\\\]\\./", "log"],
    "watch"              : true
  }]
}

if I kill the pm2 daemon proc and then manually rm ~/.pm2/pub.sock, I can get pm2 to start again, but then it freezes up again.

Going back to stable pm2...

@jshkurti
Copy link
Contributor

jshkurti commented Oct 9, 2014

Thanks for all the feedback man, we will do our best to resolve this :)

@raitucarp
Copy link

thanks @jskurti, for me, pm2 is the best deployment process manager, because I use pm2 in production. I hope all of contributor fix this issue.. thanks all

@jshkurti
Copy link
Contributor

@atomantic

We just published pm2 v0.11.0 which should resolve this issue.
Can you guys give it a try ? :)

npm install git://github.com/Unitech/pm2#master -g

@atomantic
Copy link

I'm working from this project: https://github.com/atomantic/hapi_playground#testing-pm2-issue-744

on CentOS 6.5.

The first time I tried it, it worked. Then I did pm2 kill and removed my hapi_playground app, re-downloaded and ran it... and then it started failing... It seems like I'm getting intermittent results where pm2 gets into a bad state and once it's failing, all I can do is pm2 kill and try again (but sometimes works and sometimes not).

Here are my full repro steps:

# setup
npm install git://github.com/Unitech/pm2#master -g
node -v # should be 0.10.32
pm2 -v # should be 0.11.0
tail -f ~/.pm2/pm2.log&

# app
git clone git://github.com/atomantic/hapi_playground.git
cd hapi_playground
npm install
pm2 start process.json

# test
curl -m 5 http://localhost:46100
sed -i 's/Hello Hapi/Hello Hapi 1/' index.js
curl -m 5 http://localhost:46100
sed -i 's/Hello Hapi 1/Hello Hapi 2/' index.js
curl -m 5 http://localhost:46100

So far, all good...

But now, let's wipe it and try again...

pm2 kill;
cd ../;
rm -rf hapi_playground;
git clone git://github.com/atomantic/hapi_playground.git
cd hapi_playground
npm install
pm2 start process.json
curl -m 5 http://localhost:46100
sed -i 's/Hello Hapi/Hello Hapi 1/' index.js
curl -m 5 http://localhost:46100

This time the curl fails:

curl: (28) Operation timed out after 5000 milliseconds with 0 bytes received

If it doesn't, do pm2 kill; pm2 start process.json and try some more. I'm getting into a bad state with pm2 sporadically.

@atomantic
Copy link

I just pushed an update to hapi_playground with it running on :46100 instead of 80 to make it a simpler repro. I still get the same behavior on 46100.

@Unitech
Copy link
Owner

Unitech commented Oct 11, 2014

We managed to reproduce the bug and confirm that there is a regression between the 0.10.8 and the 0.11.0. It looks like it's a port that is not closed on stop in cluster mode.

We are going to develop a test suit to verify connectivity to avoid this kind of regression.

Thanks one more time for your feedback, we come back to you once it's fixed

@Unitech
Copy link
Owner

Unitech commented Oct 11, 2014

OK so after deep investigations with PM2 0.10.8 / 0.11.0 and Node 0.10.32 / 0.11.14 on CentOS 7 / Ubuntu, making curl requests, doing lsof to see how connections behaves we conclude that this doesn't come from PM2 but from Node.js 0.10.x.

With Node.js 0.11.14 everything run fine.

As stated in the Node.js documentation, the cluster module is marked as experimental in the 0.10.x and unstable in the 0.11.x (but it works very well).

I made this test file from your bash commands: https://gist.github.com/Unitech/32ebb78f4ce079bcab97
Here is a asciinema (it's weekend so I add a bit of fun :p): https://asciinema.org/a/12903

Here are some results of our analysis: https://gist.github.com/Unitech/35c5e7a21e363099f3c6
As you can see there are ports still open with Node 0.10 once a stop is done:

[root@pm2-test-centos examples]# pm2 stop all
[PM2] Stopping all
[PM2] stopProcessId process id 0
┌──────────┬────┬─────────┬──────┬─────────┬───────────┬────────┬────────┬─────────────┐
│ App name │ id │ mode    │ PID  │ status  │ restarted │ uptime │ memory │    watching │
├──────────┼────┼─────────┼──────┼─────────┼───────────┼────────┼────────┼─────────────┤
│ child    │ 0  │ cluster │ 9874 │ stopped │         0 │ 0      │ 0 B    │ unactivated │
└──────────┴────┴─────────┴──────┴─────────┴───────────┴────────┴────────┴─────────────┘
 Use `pm2 desc[ribe] <id>` to get more details

[root@pm2-test-centos examples]# lsof -i | grep "pm2"
pm2:    9872 root   13u  IPv4 419571940      0t0  TCP *:irdmi (LISTEN)

But with Node 0.11.14 everything is fine:

[root@pm2-test-centos examples]# pm2 stop all
[PM2] Stopping all
[PM2] stopProcessId process id 0
┌──────────┬────┬─────────┬───────┬─────────┬───────────┬────────┬────────┬─────────────┐
│ App name │ id │ mode    │ PID   │ status  │ restarted │ uptime │ memory │    watching │
├──────────┼────┼─────────┼───────┼─────────┼───────────┼────────┼────────┼─────────────┤
│ child    │ 0  │ cluster │ 10189 │ stopped │         0 │ 0      │ 0 B    │ unactivated │
└──────────┴────┴─────────┴───────┴─────────┴───────────┴────────┴────────┴─────────────┘
 Use `pm2 desc[ribe] <id>` to get more details

[root@pm2-test-centos examples]# lsof -i | grep "pm2"
[root@pm2-test-centos examples]# 

So yes this is fixed ! Use 0.11.14 :) or use the fork mode if you are stick to the 0.10.32

PS: you should ignore the node_modules folder while watching, it's taking too much CPU usage

Have a good week-end!

@atomantic
Copy link

Thanks for the followup. So, in fork mode, I can't do zero downtime deployments, right?

How will watch behave in fork_mode? will that still bring a new proc up before killing the rest?

And thanks for the tip on ignoreWatch node_modules. Can you translate this ignoreWatch from the docs for me: "[/\]./" -- is that ignoring all subfolders of the current directory?

Thanks!

@soyuka
Copy link
Collaborator

soyuka commented Oct 11, 2014

It's skipping dotfiles only.

Watch should behave the same way with or without the cluster mode.

With fork_mode there is indeed no zero downtime deployments (or hot reload).

@hefangshi
Copy link

Is node v0.11 stable enough to be used in production? And why pm2 0.8.x was fine to work with node v0.10?

I found that before pm2 10.0.0, everything is fine, and pm2 0.10.x the reload was broken.

and in 0.10.x, the App Name was wrong

┌───────────────────────────────────┬────┬─────────┬───────┬────────┬───────────┬────────┬─────────────┬─────────────┐
│ App name                          │ id │ mode    │ PID   │ status │ restarted │ uptime │      memory │    watching │
├───────────────────────────────────┼────┼─────────┼───────┼────────┼───────────┼────────┼─────────────┼─────────────┤
│ function bold() { [native code] } │ 0  │ cluster │ 14444 │ online │         0 │ 2m     │ 31.449 MB   │ unactivated │
│ function bold() { [native code] } │ 1  │ cluster │ 14446 │ online │         0 │ 2m     │ 31.453 MB   │ unactivated │
│ function bold() { [native code] } │ 2  │ cluster │ 14452 │ online │         0 │ 2m     │ 31.473 MB   │ unactivated │
│ function bold() { [native code] } │ 3  │ cluster │ 14458 │ online │         0 │ 2m     │ 31.449 MB   │ unactivated │
└───────────────────────────────────┴────┴─────────┴───────┴────────┴───────────┴────────┴─────────────┴─────────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants