Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk requests stuck on STARTED state #7744

Open
elenamplanas opened this issue Feb 5, 2025 · 4 comments
Open

bulk requests stuck on STARTED state #7744

elenamplanas opened this issue Feb 5, 2025 · 4 comments

Comments

@elenamplanas
Copy link

Runnig dCache version 9.2.25 and Enstore.

When a recall is failed due to a problem reading the tape, not due a missing file, checksum error, etc. the bulk request remain STARTED, with RUNNING state on the file, and without any "rh" request on any pool.

Example:

[dccore12] (local) admin > \s bulk request ls 2eda9e78-fbaa-4f5a-b285-787dc7f29bec
ID | ARRIVED | MODIFIED | OWNER | STATUS | UID
3182252 | 2025/02/04-10:40:14 | 2025/02/04-10:40:14 | 31101:1399 | STARTED | 2eda9e78-fbaa-4f5a-b285-787dc7f29bec
dccore12] (local) admin > \s bulk request info 2eda9e78-fbaa-4f5a-b285-787dc7f29bec
2eda9e78-fbaa-4f5a-b285-787dc7f29bec:
status: STARTED
arrived at: 2025-02-04 10:40:14.781
started at: 2025-02-04 10:40:14.793
last modified at: 2025-02-04 10:40:14.793
target prefix: /
targets:
CREATED | STARTED | COMPLETED | STATE | TARGET
2025-02-04 10:40:14.782 | 2025-02-04 10:40:14.782 | ? | RUNNING | /pnfs/pic.es/data/cms/store/data/Run2024G/ZeroBias/AOD/PromptReco-v1/000/384/202/00000/f58968b7-5890-4970-abcc-b5ace5d645e5.root
2025-02-04 10:40:14.782 | 2025-02-04 10:40:14.782 | 2025-02-04 10:40:14.808 | FAILED | /pnfs/pic.es/data/cms/store/test/loadtest/source/T1_ES_PIC_Tape/urandom.270MB.file0000 -- (ERROR: diskCacheV111.util.CacheException : File not on tape.)

[dccore12] (local) admin > \sn pnfsidof /pnfs/pic.es/data/cms/store/data/Run2024G/ZeroBias/AOD/PromptReco-v1/000/384/202/00000/f58968b7-5890-4970-abcc-b5ace5d645e5.root
0000AFE37C9682A641AE99358012553B0CE8

$ echo "\s dc* rh ls"| ssh -p 22224 dccore.pic.es|grep 0000AFE37C9682A641AE99358012553B0CE8

In this example after running \bulk request reset the rh process doesn't appear, but when we faced the problem the first time, the new rh for the stuck file, was launched.

Don't hesitate on request any information you need.

Cheers,
Elena

@DmitryLitvintsev
Copy link
Member

Hi Elena.

Make sure you set:

rc onerror fail
rc set max retries 3

(max retries 3 kind of means "smaller number", you do not want to have this number to be large)

@DmitryLitvintsev
Copy link
Member

As for currenr request - I suggest to cancel it via bulk admin api.

@elenamplanas
Copy link
Author

Hi Dmitry,
the parameters you suggested are related the recall processes on poolmanager, but the ones stuck have entered through bulk and have no entries on the poolmanager, they are managed directly by the bulk service, sending the requests to the pool, bypassing the poolmanager. Or maybe I'm wrong?

@DmitryLitvintsev
Copy link
Member

Hi Dmitry, the parameters you suggested are related the recall processes on poolmanager, but the ones stuck have entered through bulk and have no entries on the poolmanager, they are managed directly by the bulk service, sending the requests to the pool, bypassing the poolmanager. Or maybe I'm wrong?

This is how it works:

bulk -> PinManager -> PoolManager -> pool 

All staging requests in dCache are handled in PoolManager

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants