-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Malformed or missing NCT ID causes clinicaltrials_download to throw error #12
Comments
That error occurs when the gather_results function can't find a field for the results that it is expecting, like a category list or measurement list. Can you say specifically which trial is causing the error? Is it NCT03004287? If so, then that trial has no results posted, and the gather_results function should not be called. |
I don't think your tframe is the problem here. It appears that the zip file is being extracted correctly, and the xml being parsed, but then a field that the gather_results function expects is missing from the xml list. Can you send me a specific trial or list of trials that throw this error? I can't reproduce it exactly but I've seen something similar (and I thought I fixed it). It may be the case that there is a small set of trials that doesn't follow the precise format that I expect. |
Here's a CSV dump of the tframe I'm passing to this command: missing.NCTs.csv.zip I tried each of the first five nct_ids individually, and all threw the error. Let me know if you need additional information. |
I think I finally resolved this issue. If the NCT ID isn't valid or doesn't otherwise produce a download result, the xml_list variable has a bunch of NAs at the end. The parse function can't open/read the file named NA. So, the solution is to omit NAs from the xml_list. I've added that solution to clinicaltrials_download.r: Old: New: I am submitting a pull request for this fix. |
Fix for Issue sachsmc#12: gather_results expected complete values for baseline/measure_list in XML file, which isn’t necessarily there. Trapped missing value. Fixed Issue sachsmc#14: In clinicaltrials_download, number of downloaded files can be shorter than the list of NCT numbers submitted in a tframe (e.g., where bad NCT IDs are introduced from an external source). The expected number of files was a list the same length as the submitted NCT IDs in tframe. Omitted missing values.
This error:
was corrected in my workflow with the above commit. Unfortunately, the results test failed, but worked with a different query. I am wondering whether a new study in the result set made the pull requests #15 and #16 fail as well. I changed the state name in the query from California to Arkansas and it performed correctly. This little bug is hard to find and squash. |
I've identified the XML file that fails in the testing. Here's some code to identify where it fails:
In gather_results, category_list is expected to be a recursive object, not atomic. |
I looked at the XML file, and this is the XML source of the measure that causes the error:
target <- lank$category_list is generated from the items in the class_list; however, in this particular measure, the expected class_list tag is not present. The solution will likely be to test for the presence of the class_list tag; if it doesn't exist, the return value for the function will simply the fillout object, rather than cbind(fillout, target). |
Test for existence of class_list and return appropriate values. Fix for issue sachsmc#12.
I'm comparing a list from our local clinical trial database to CT.gov. Some entries in this list didn't get downloaded using a search, so I'm trying to download them separately. Apparently, some of these locally-entered NCT IDs are malformed or simply don't exist. Running the clinicaltrials_download function against a list containing malformed or missing NCT IDs throws the following error:
The file 'NA' does not exist in the referenced directory. The missing tmp folder + NA is a result of how the xml_list is built.
The NCT ID that threw the error appears to be 'NCT11122334'. The GET command used to download the NCT id is of the following format:
The result of that command is empty:
In comparison, a valid NCT ID yields the following result:
So, xml_list uses the list.files function, and it expects a list of XML files that is either count or tcount long; however, if the result of any of the downloads is empty, there won't be an associated XML file, and the expected length of the list will be too long -- hence the NA values.
It appears the clinicaltrials_download function needs to check to see if the result exists before attempting to unzip the result as an XML file. Also, the expected length of the xml_list needs to be changed to just use the full list of XML files unzipped by the function.
I've made the necessary changes.
Unfortunately, I'm now seeing the following error from the gather_results function, which I've been unable to debug:
Any suggestions on the gather_results function would be appreciated.
The text was updated successfully, but these errors were encountered: