Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed or missing NCT ID causes clinicaltrials_download to throw error #12

Closed
titaniumtroop opened this issue Jan 12, 2017 · 8 comments

Comments

@titaniumtroop
Copy link
Contributor

I'm comparing a list from our local clinical trial database to CT.gov. Some entries in this list didn't get downloaded using a search, so I'm trying to download them separately. Apparently, some of these locally-entered NCT IDs are malformed or simply don't exist. Running the clinicaltrials_download function against a list containing malformed or missing NCT IDs throws the following error:

Error: XML content does not seem to be XML: '/var/folders/3h/nmq5tj2s0ms2138012_9kwcc0000gp/T//Rtmp1fnKcv/NA'
5.
(if (isHTML) warning else stop)(e)
4.
XML::xmlParse(file) at parse_study_xml.R#22
3.
FUN(X[[i]], ...)
2.
lapply(xml_list, parse_study_xml, include_textblocks) at clinicaltrials_download.R#131
1.
clinicaltrials_download(tframe = missing.NCTs, count = NULL, include_results = TRUE)

The file 'NA' does not exist in the referenced directory. The missing tmp folder + NA is a result of how the xml_list is built.

The NCT ID that threw the error appears to be 'NCT11122334'. The GET command used to download the NCT id is of the following format:

result <- httr::GET("http://clinicaltrials.gov/ct2/results?id=NCT11122334&resultsxml=true", httr::write_disk('~/Desktop/out.txt'))

The result of that command is empty:

result
Response [https://clinicaltrials.gov/ct2/results?id=NCT11122334&resultsxml=true]
  Date: 2017-01-12 16:48
  Status: 200
  Content-Type: application/zip;charset=UTF-8
<EMPTY BODY>

In comparison, a valid NCT ID yields the following result:

result
Response [https://clinicaltrials.gov/ct2/results?id=NCT03004287&resultsxml=true]
  Date: 2017-01-12 16:51
  Status: 200
  Content-Type: application/zip;charset=UTF-8
  Size: 4.72 kB
<ON DISK>  /Users/XXXXXXXX/Desktop/out.existing.txt

So, xml_list uses the list.files function, and it expects a list of XML files that is either count or tcount long; however, if the result of any of the downloads is empty, there won't be an associated XML file, and the expected length of the list will be too long -- hence the NA values.

It appears the clinicaltrials_download function needs to check to see if the result exists before attempting to unzip the result as an XML file. Also, the expected length of the xml_list needs to be changed to just use the full list of XML files unzipped by the function.

I've made the necessary changes.

      # OLD LINE utils::unzip(tmpzip, exdir = tmpdir)

      # If the NCT ID is not found, the content length will be 0
      # Blank zip file will cause an error, so don't unzip it
      if (length(result$content) > 0) {
        utils::unzip(tmpzip, exdir = tmpdir)
      }
      Sys.sleep(0.1) # sleep 0.1 sec as requested by Crawl-delay parameter in http://www.clinicaltrials.gov/robots.txt
    }

    # get files list

    # OLD LINE xml_list <- paste(tmpdir, list.files(path = tmpdir, pattern = "xml$")[1:min(tcount, count)], sep = "/")
    xml_list <- paste(tmpdir, list.files(path = tmpdir, pattern = "xml$")[], sep = "/")

Unfortunately, I'm now seeing the following error from the gather_results function, which I've been unable to debug:

Error in lank$category_list : $ operator is invalid for atomic vectors
9.
FUN(X[[i]], ...)
8.
lapply(kids, FUN, ...)
7.
xmlApply.XMLInternalNode(measures, function(node) { lank <- XML::xmlSApply(node, function(n) { if (XML::xmlName(n) == "category_list") { do.call(plyr::rbind.fill, XML::xmlApply(n, function(n0) { ...
6.
XML::xmlApply(measures, function(node) { lank <- XML::xmlSApply(node, function(n) { if (XML::xmlName(n) == "category_list") { do.call(plyr::rbind.fill, XML::xmlApply(n, function(n0) { ...
5.
do.call(plyr::rbind.fill, XML::xmlApply(measures, function(node) { lank <- XML::xmlSApply(node, function(n) { if (XML::xmlName(n) == "category_list") { do.call(plyr::rbind.fill, XML::xmlApply(n, function(n0) { ... at gather_results.R#58
4.
gather_results(XML::xmlParse(file)) at clinicaltrials_download.R#139
3.
FUN(X[[i]], ...)
2.
lapply(xml_list, function(file) gather_results(XML::xmlParse(file))) at clinicaltrials_download.R#139
1.
clinicaltrials_download(tframe = missing.CLARA.NCTs, count = NULL, include_results = TRUE)

Any suggestions on the gather_results function would be appreciated.

@sachsmc
Copy link
Owner

sachsmc commented Jan 16, 2017

That error occurs when the gather_results function can't find a field for the results that it is expecting, like a category list or measurement list.

Can you say specifically which trial is causing the error? Is it NCT03004287? If so, then that trial has no results posted, and the gather_results function should not be called.

@titaniumtroop
Copy link
Contributor Author

I'd say a majority of the studies in the original search (~940 studies) didn't have results. That's one of the reasons I was downloading the study information. To illustrate, this is one of the plots I made with the data:
study_results_reported

The error seems to be independent of the study being downloaded. It may relate to how I'm generating the tframe used by gather_results. I did a full join of the search results with the list of nct_id from our internal system, and then culled anything with a study title from that list. So, the dataframe contains the nct_id and the remaining variables are NA.

Do you know whether the gather_results function expects other valid variables from the tframe? If so, that could be the reason for the error; what do you suggest for creating an appropriate tframe from the list of nct_ids I need for the clinicaltrials_download function?

If not, I'll keep working on the debug. Thanks for the help!

@sachsmc
Copy link
Owner

sachsmc commented Jan 23, 2017

I don't think your tframe is the problem here. It appears that the zip file is being extracted correctly, and the xml being parsed, but then a field that the gather_results function expects is missing from the xml list.

Can you send me a specific trial or list of trials that throw this error? I can't reproduce it exactly but I've seen something similar (and I thought I fixed it). It may be the case that there is a small set of trials that doesn't follow the precise format that I expect.

@titaniumtroop
Copy link
Contributor Author

Here's a CSV dump of the tframe I'm passing to this command:

missing.NCTs.csv.zip
download.missing.NCTs <- clinicaltrials_download(tframe = missing.NCTs, count = NULL, include_results = TRUE)

I tried each of the first five nct_ids individually, and all threw the error.

Let me know if you need additional information.

@titaniumtroop
Copy link
Contributor Author

titaniumtroop commented Sep 19, 2017

I think I finally resolved this issue. If the NCT ID isn't valid or doesn't otherwise produce a download result, the xml_list variable has a bunch of NAs at the end. The parse function can't open/read the file named NA. So, the solution is to omit NAs from the xml_list. I've added that solution to clinicaltrials_download.r:

Old:
xml_list <- unzipped.files[1:min(tcount, count)]

New:
xml_list <- stats::na.omit(unzipped.files[1:min(tcount, count)])

I am submitting a pull request for this fix.

titaniumtroop added a commit to titaniumtroop/rclinicaltrials that referenced this issue Sep 19, 2017
Fix for Issue sachsmc#12:
gather_results expected complete values for baseline/measure_list in
XML file, which isn’t necessarily there. Trapped missing value.

Fixed Issue sachsmc#14: In clinicaltrials_download, number of downloaded files
can be shorter than the list of NCT numbers submitted in a tframe
(e.g., where bad NCT IDs are introduced from an external source). The
expected number of files was a list the same length as the submitted
NCT IDs in tframe. Omitted missing values.
@titaniumtroop
Copy link
Contributor Author

This error:

Error in lank$category_list : $ operator is invalid for atomic vectors

was corrected in my workflow with the above commit. Unfortunately, the results test failed, but worked with a different query. I am wondering whether a new study in the result set made the pull requests #15 and #16 fail as well. I changed the state name in the query from California to Arkansas and it performed correctly. This little bug is hard to find and squash.

@titaniumtroop
Copy link
Contributor Author

I've identified the XML file that fails in the testing. Here's some code to identify where it fails:

library(rclinicaltrials)
nores <- clinicaltrials_download(query = 'heart disease AND stroke AND California', count = 5)

clinicaltrials_download(tframe = nores$study_info[1,], include_results = TRUE)
# Error in lank$category_list : $ operator is invalid for atomic vectors

nores$study_info[1,]$nct_id
# NCT02326649

lank
# title                                                    population 
# "Race and Ethnicity Not Collected" "Race and Ethnicity were not collected from any participant." 
# units                                                         param 
# "Participants"                                       "Count of Participants" 

class(lank)
# [1] "character"

In gather_results, category_list is expected to be a recursive object, not atomic.

@titaniumtroop
Copy link
Contributor Author

I looked at the XML file, and this is the XML source of the measure that causes the error:

        <measure>
          <title>Race and Ethnicity Not Collected</title>
          <population>Race and Ethnicity were not collected from any participant.</population>
          <units>Participants</units>
          <param>Count of Participants</param>
        </measure>

target <- lank$category_list is generated from the items in the class_list; however, in this particular measure, the expected class_list tag is not present. The solution will likely be to test for the presence of the class_list tag; if it doesn't exist, the return value for the function will simply the fillout object, rather than cbind(fillout, target).

titaniumtroop added a commit to titaniumtroop/rclinicaltrials that referenced this issue Sep 20, 2017
Test for existence of class_list and return appropriate values. Fix for
issue sachsmc#12.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants