Malformed or missing NCT ID causes clinicaltrials_download to throw error #12

titaniumtroop · 2017-01-12T18:41:50Z

I'm comparing a list from our local clinical trial database to CT.gov. Some entries in this list didn't get downloaded using a search, so I'm trying to download them separately. Apparently, some of these locally-entered NCT IDs are malformed or simply don't exist. Running the clinicaltrials_download function against a list containing malformed or missing NCT IDs throws the following error:

Error: XML content does not seem to be XML: '/var/folders/3h/nmq5tj2s0ms2138012_9kwcc0000gp/T//Rtmp1fnKcv/NA'
5.
(if (isHTML) warning else stop)(e)
4.
XML::xmlParse(file) at parse_study_xml.R#22
3.
FUN(X[[i]], ...)
2.
lapply(xml_list, parse_study_xml, include_textblocks) at clinicaltrials_download.R#131
1.
clinicaltrials_download(tframe = missing.NCTs, count = NULL, include_results = TRUE)

The file 'NA' does not exist in the referenced directory. The missing tmp folder + NA is a result of how the xml_list is built.

The NCT ID that threw the error appears to be 'NCT11122334'. The GET command used to download the NCT id is of the following format:

result <- httr::GET("http://clinicaltrials.gov/ct2/results?id=NCT11122334&resultsxml=true", httr::write_disk('~/Desktop/out.txt'))

The result of that command is empty:

result
Response [https://clinicaltrials.gov/ct2/results?id=NCT11122334&resultsxml=true]
  Date: 2017-01-12 16:48
  Status: 200
  Content-Type: application/zip;charset=UTF-8
<EMPTY BODY>

In comparison, a valid NCT ID yields the following result:

result
Response [https://clinicaltrials.gov/ct2/results?id=NCT03004287&resultsxml=true]
  Date: 2017-01-12 16:51
  Status: 200
  Content-Type: application/zip;charset=UTF-8
  Size: 4.72 kB
<ON DISK>  /Users/XXXXXXXX/Desktop/out.existing.txt

So, xml_list uses the list.files function, and it expects a list of XML files that is either count or tcount long; however, if the result of any of the downloads is empty, there won't be an associated XML file, and the expected length of the list will be too long -- hence the NA values.

It appears the clinicaltrials_download function needs to check to see if the result exists before attempting to unzip the result as an XML file. Also, the expected length of the xml_list needs to be changed to just use the full list of XML files unzipped by the function.

I've made the necessary changes.

      # OLD LINE utils::unzip(tmpzip, exdir = tmpdir)

      # If the NCT ID is not found, the content length will be 0
      # Blank zip file will cause an error, so don't unzip it
      if (length(result$content) > 0) {
        utils::unzip(tmpzip, exdir = tmpdir)
      }
      Sys.sleep(0.1) # sleep 0.1 sec as requested by Crawl-delay parameter in http://www.clinicaltrials.gov/robots.txt
    }

    # get files list

    # OLD LINE xml_list <- paste(tmpdir, list.files(path = tmpdir, pattern = "xml$")[1:min(tcount, count)], sep = "/")
    xml_list <- paste(tmpdir, list.files(path = tmpdir, pattern = "xml$")[], sep = "/")

Unfortunately, I'm now seeing the following error from the gather_results function, which I've been unable to debug:

Error in lank$category_list : $ operator is invalid for atomic vectors
9.
FUN(X[[i]], ...)
8.
lapply(kids, FUN, ...)
7.
xmlApply.XMLInternalNode(measures, function(node) { lank <- XML::xmlSApply(node, function(n) { if (XML::xmlName(n) == "category_list") { do.call(plyr::rbind.fill, XML::xmlApply(n, function(n0) { ...
6.
XML::xmlApply(measures, function(node) { lank <- XML::xmlSApply(node, function(n) { if (XML::xmlName(n) == "category_list") { do.call(plyr::rbind.fill, XML::xmlApply(n, function(n0) { ...
5.
do.call(plyr::rbind.fill, XML::xmlApply(measures, function(node) { lank <- XML::xmlSApply(node, function(n) { if (XML::xmlName(n) == "category_list") { do.call(plyr::rbind.fill, XML::xmlApply(n, function(n0) { ... at gather_results.R#58
4.
gather_results(XML::xmlParse(file)) at clinicaltrials_download.R#139
3.
FUN(X[[i]], ...)
2.
lapply(xml_list, function(file) gather_results(XML::xmlParse(file))) at clinicaltrials_download.R#139
1.
clinicaltrials_download(tframe = missing.CLARA.NCTs, count = NULL, include_results = TRUE)

Any suggestions on the gather_results function would be appreciated.

The text was updated successfully, but these errors were encountered:

sachsmc · 2017-01-16T14:47:47Z

That error occurs when the gather_results function can't find a field for the results that it is expecting, like a category list or measurement list.

Can you say specifically which trial is causing the error? Is it NCT03004287? If so, then that trial has no results posted, and the gather_results function should not be called.

titaniumtroop · 2017-01-17T15:09:01Z

I'd say a majority of the studies in the original search (~940 studies) didn't have results. That's one of the reasons I was downloading the study information. To illustrate, this is one of the plots I made with the data:

The error seems to be independent of the study being downloaded. It may relate to how I'm generating the tframe used by gather_results. I did a full join of the search results with the list of nct_id from our internal system, and then culled anything with a study title from that list. So, the dataframe contains the nct_id and the remaining variables are NA.

Do you know whether the gather_results function expects other valid variables from the tframe? If so, that could be the reason for the error; what do you suggest for creating an appropriate tframe from the list of nct_ids I need for the clinicaltrials_download function?

If not, I'll keep working on the debug. Thanks for the help!

sachsmc · 2017-01-23T15:04:08Z

I don't think your tframe is the problem here. It appears that the zip file is being extracted correctly, and the xml being parsed, but then a field that the gather_results function expects is missing from the xml list.

Can you send me a specific trial or list of trials that throw this error? I can't reproduce it exactly but I've seen something similar (and I thought I fixed it). It may be the case that there is a small set of trials that doesn't follow the precise format that I expect.

titaniumtroop · 2017-01-25T15:28:38Z

Here's a CSV dump of the tframe I'm passing to this command:

missing.NCTs.csv.zip
download.missing.NCTs <- clinicaltrials_download(tframe = missing.NCTs, count = NULL, include_results = TRUE)

I tried each of the first five nct_ids individually, and all threw the error.

Let me know if you need additional information.

titaniumtroop · 2017-09-19T19:55:02Z

I think I finally resolved this issue. If the NCT ID isn't valid or doesn't otherwise produce a download result, the xml_list variable has a bunch of NAs at the end. The parse function can't open/read the file named NA. So, the solution is to omit NAs from the xml_list. I've added that solution to clinicaltrials_download.r:

Old:
xml_list <- unzipped.files[1:min(tcount, count)]

New:
xml_list <- stats::na.omit(unzipped.files[1:min(tcount, count)])

I am submitting a pull request for this fix.

Fix for Issue sachsmc#12: gather_results expected complete values for baseline/measure_list in XML file, which isn’t necessarily there. Trapped missing value. Fixed Issue sachsmc#14: In clinicaltrials_download, number of downloaded files can be shorter than the list of NCT numbers submitted in a tframe (e.g., where bad NCT IDs are introduced from an external source). The expected number of files was a list the same length as the submitted NCT IDs in tframe. Omitted missing values.

titaniumtroop · 2017-09-19T21:33:09Z

This error:

Error in lank$category_list : $ operator is invalid for atomic vectors

was corrected in my workflow with the above commit. Unfortunately, the results test failed, but worked with a different query. I am wondering whether a new study in the result set made the pull requests #15 and #16 fail as well. I changed the state name in the query from California to Arkansas and it performed correctly. This little bug is hard to find and squash.

titaniumtroop · 2017-09-20T14:53:36Z

I've identified the XML file that fails in the testing. Here's some code to identify where it fails:

library(rclinicaltrials)
nores <- clinicaltrials_download(query = 'heart disease AND stroke AND California', count = 5)

clinicaltrials_download(tframe = nores$study_info[1,], include_results = TRUE)
# Error in lank$category_list : $ operator is invalid for atomic vectors

nores$study_info[1,]$nct_id
# NCT02326649

lank
# title                                                    population 
# "Race and Ethnicity Not Collected" "Race and Ethnicity were not collected from any participant." 
# units                                                         param 
# "Participants"                                       "Count of Participants" 

class(lank)
# [1] "character"

In gather_results, category_list is expected to be a recursive object, not atomic.

titaniumtroop · 2017-09-20T16:02:20Z

I looked at the XML file, and this is the XML source of the measure that causes the error:

        <measure>
          <title>Race and Ethnicity Not Collected</title>
          <population>Race and Ethnicity were not collected from any participant.</population>
          <units>Participants</units>
          <param>Count of Participants</param>
        </measure>

target <- lank$category_list is generated from the items in the class_list; however, in this particular measure, the expected class_list tag is not present. The solution will likely be to test for the presence of the class_list tag; if it doesn't exist, the return value for the function will simply the fillout object, rather than cbind(fillout, target).

Test for existence of class_list and return appropriate values. Fix for issue sachsmc#12.

titaniumtroop mentioned this issue Sep 19, 2017

Bugfixes for Issues 12, 14 #17

Merged

titaniumtroop added a commit to titaniumtroop/rclinicaltrials that referenced this issue Sep 20, 2017

Fix for "Error in lank$category_list"

1c9019d

Test for existence of class_list and return appropriate values. Fix for issue sachsmc#12.

titaniumtroop closed this as completed Jul 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Malformed or missing NCT ID causes clinicaltrials_download to throw error #12

Malformed or missing NCT ID causes clinicaltrials_download to throw error #12

titaniumtroop commented Jan 12, 2017

sachsmc commented Jan 16, 2017

titaniumtroop commented Jan 17, 2017

sachsmc commented Jan 23, 2017

titaniumtroop commented Jan 25, 2017

titaniumtroop commented Sep 19, 2017 •

edited

Loading

titaniumtroop commented Sep 19, 2017

titaniumtroop commented Sep 20, 2017

titaniumtroop commented Sep 20, 2017

Malformed or missing NCT ID causes clinicaltrials_download to throw error #12

Malformed or missing NCT ID causes clinicaltrials_download to throw error #12

Comments

titaniumtroop commented Jan 12, 2017

sachsmc commented Jan 16, 2017

titaniumtroop commented Jan 17, 2017

sachsmc commented Jan 23, 2017

titaniumtroop commented Jan 25, 2017

titaniumtroop commented Sep 19, 2017 • edited Loading

titaniumtroop commented Sep 19, 2017

titaniumtroop commented Sep 20, 2017

titaniumtroop commented Sep 20, 2017

titaniumtroop commented Sep 19, 2017 •

edited

Loading