Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't use string ("eof") as a SCALAR ref while "strict refs" in use at .../Gearman/ResponseParser.pm line 175. #50

Open
esabol opened this issue Jul 14, 2024 · 7 comments

Comments

@esabol
Copy link
Contributor

esabol commented Jul 14, 2024

Hi, Alexei!

Job server failure: Can't use string ("eof") as a SCALAR ref while "strict refs" in use at .../Gearman/ResponseParser.pm line 175.

I saw this error twice recently when I was testing adding a large number of tasks to a task set, each with "on_complete" callbacks. The jobs when processed by workers also "set_status" to update a completion percentage. It happens very rarely, and I think I've only ever seen this error with jobs that "set_status".

Gearman::Util::_read_sock can return the string "eof", of course, but I don't see how $err in Gearman::ResponseParser::parse_sock() can be anything but a reference to a SCALAR? I'm not seeing the bug! Do you?

But just to prevent crashes, I've changed Gearman/ResponseParser.pm line 175 to this:

        $self->on_error("read_error: " . (ref($err) eq 'SCALAR' ? ${$err} : $err));
@esabol
Copy link
Contributor Author

esabol commented Jul 14, 2024

Hmmm... After making the above change, I've eliminated the crashes, but I think $tasks->wait hangs indefinitely now when this error is encountered. 😞

@p-alik
Copy link
Owner

p-alik commented Jul 15, 2024

Hi Ed, it's a while I've read Perl code. I suppose the problem occurs if Gearman::Util::_read_sock ends with return (0, "eof") unless $rv; because sysread returns a valid length of 0 and subsequently Gearman::Util::read_res_packet ends in

elsif ($ok == 0) {
            return $err->($err_code);
}

Do I see that correctly that the condition of elsif should be enhanced $err_code ne "eof" by?

elsif ($ok == 0 && $err_code ne "eof") {
            return $err->($err_code);
}

@esabol
Copy link
Contributor Author

esabol commented Jul 15, 2024

Thanks for your reply, Alexei. I realize you've moved on from Perl programming, so I appreciate you taking the time.

Let's say we change

elsif ($ok == 0) {
            return $err->($err_code);
}

to

elsif ($ok == 0 && $err_code ne "eof") {
            return $err->($err_code);
}

as you suggested. What happens in the case of $ok == 0 && $err_code eq "eof"? It just keeps going?

Without testing, my thinking is we need to get out of the loop? Maybe with a "next" or "last"?

elsif ($ok == 0) {
            next if $err_code eq 'eof'; # or last?
            return $err->($err_code);
}

What do you think?

@esabol
Copy link
Contributor Author

esabol commented Jul 16, 2024

Getting back to Gearman/ResponseParser.pm line 175, I've done various tests, and I'm convinced it should just be:

        $self->on_error("read_error: $err");

$err will never be a reference to anything there.

Whether or not 'eof' should be considered an error condition and how 'eof' should be handled are separate issues.

I've tested all of the following prospective changes in Gearman/Util.pm:

elsif ($ok == 0 && $err_code ne "eof") {
            return $err->($err_code);
}

This definitely doesn't work. Every task hangs.

Next, I tried this:

elsif ($ok == 0) {
            next if $err_code eq "eof";
            return $err->($err_code);
}

This resulted in an infinite loop when "eof" is encountered.

This seems to work well, but I'm not sure it's actually an improvement:

elsif ($ok == 0) {
            last if $err_code eq "eof";
            return $err->($err_code);
}

I still get the occasional hang (about one in 500 tasks). I turned on DEBUG and added a bunch of additional DEBUG statements to Gearman::Util::read_res_packet() and _read_sock(). I'm not seeing any "eof" (or other) error codes when these hangs occur, so I'm not sure what's going on. I'm not sure where it's hanging? It doesn't seem to be in Gearman::Util::read_res_packet(). I think it's elsewhere. I'll keep digging. It's probably due to a gearmand bug, I'm guessing.

@p-alik
Copy link
Owner

p-alik commented Jul 16, 2024

Hi Ed,

What happens in the case of $ok == 0 && $err_code eq "eof"? It just keeps going?
Without testing, my thinking is we need to get out of the loop? Maybe with a "next" or "last"?
I'm not quite sure about loop interruption. I guess in "eof" case Gearman::Util::read_res_packet should proceed til Gearman/Util.pm#L196-L208.

I'm not sure where it's hanging?
Something changed indeed. I've seen two different error patterns:

  • for Perl < 5.38
unexpected packet type: error [JOB_NOT_FOUND -- Job given in work result not found]
unexpected packet type: error [JOB_NOT_FOUND -- Job given in work result not found]
  • for Perl 5.38
t/15-priority.t ................. ok
  ERROR 2024-07-16 11:58:51.000000 [     3 ] realloc(char, count: 1844674407370[9](https://github.com/p-alik/perl-Gearman/actions/runs/9956350145/job/27506063338#step:11:10)551615 size: 1) -> libgearman-server/io.cc:835

But independently from the Perl version the tests are hanging with gearman-job-server amd64 1.1.19.1

@esabol
Copy link
Contributor Author

esabol commented Jul 17, 2024

I've seen two different error patterns:

Do both of these happen all the time? Or is it just a rare random failure?

I'm just seeing the rare random hang. Since yesterday, my testing would seem to indicate that this is due to a Perl worker getting pegged near 100% CPU. So it's not the client where the hang is happening, apparently. The client is just waiting for the job result as it should.

But independently from the Perl version the tests are hanging with gearman-job-server amd64 1.1.19.1

Well, 1.1.21 is the latest version of gearmand. There have been a lot of commits since 1.1.19.1, but I'm not sure what might be the cause there.

@esabol
Copy link
Contributor Author

esabol commented Jul 17, 2024

Are you using localhost in your GitHub Actions CI? Some of those errors I'm seeing in the logs of your GitHub Actions CI might be the fault of the GitHub Actions runner, actually. This might explain why you're seeing new failures. GitHub Actions has been a thorn in my side ever since we had to stop using Travis CI. Things fail in GitHub Actions that don't fail on any real system, especially networking things. For example, the GitHub Actions runners don't support IPv6 when running inside a Docker container, but the /etc/hosts on most Linux distros assign ::1 the name localhost. This causes problems for any networking code that uses localhost. For gearmand, I have to hack /etc/hosts to remove localhost from the definition of::1 in order to get the gearmand CI to pass the tests. It was so annoying to figure that out! You might need to do something similar if you're running gearmand inside your GitHub Actions CI. Here's what I used:

           case "${GITHUB_ACTIONS_CONTAINER}" in
              ubuntu*)
                 apt-get -o update && DEBIAN_FRONTEND=noninteractive apt-get -o Acquire::Retries=3 -y install iproute2
                 echo "==="
                 echo "Before: /etc/hosts"
                 cat /etc/hosts
                 echo "==="
                 echo "Removing localhost name from ::1 entry in /etc/hosts..."
                 sed 's/^::1\s\s*localhost\s\(.*\)/::1 \1/' /etc/hosts > /tmp/hosts.temp
                 cp /tmp/hosts.temp /etc/hosts
                 rm /tmp/hosts.temp
                 echo "After: /etc/hosts"
                 cat /etc/hosts
                 echo "==="
                 ip addr
                 echo "==="
                 echo "'hostname -i' shows '$(hostname -i)'"
                 echo "'hostname -I' shows '$(hostname -I)'"
                 echo "'hostname -s' shows '$(hostname -s)'"
                 echo "'hostname -f' shows '$(hostname -f)'"
                 ;;
              *)
                 ;;
           esac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants