cwe119_cgd.txt and cwe399_cgd.txt with right format #8

hungryfoolou · 2018-08-02T04:09:06Z

I found some mistakes in file cwe119_cgd.txt and cwe399_cgd.txt.Could you present the file cwe119_cgd.txt and cwe399_cgd.txt with right format?

VulDeePecker · 2018-08-03T11:36:13Z

Can you explain the details of the problem?

hungryfoolou · 2018-08-03T11:58:02Z

For example,second line of file cwe119_cgd.txt :
ZIP_FILENAME_LEN, NULL, 0, NULL, 0 )
It seem to lose a left bracket,so I guess the file might be wrong.

VulDeePecker · 2018-08-05T03:58:32Z

Thanks. We take advantage of the commercial product Checkmarx to extract the program slices which are then assembled into code gadgets. The problem you mentioned in some code gadgets is caused by the imperfect result of Checkmarx. For example, "ZIP_FILENAME_LEN, NULL, 0, NULL, 0 )" is from the following code which Checkmarx can only extract line 440, but cannot extract line 439 and 441.

439 if( unzGetCurrentFileInfo( file, p_fileInfo, psz_fileName,
440 ZIP_FILENAME_LEN, NULL, 0, NULL, 0 )
441 != UNZ_OK )
442 {
443 msg_Warn( p_this, "can't get info about file in zip" );
444 return VLC_EGENERIC;
445 }

hungryfoolou · 2018-08-06T07:41:38Z

Thank you.I would like to ask you a question.each token or all tokens of a code gadget will be converted to a vector?
In the chapter of III-E Step III: Transforming code gadgets into vectors of your paper VulDeePecker A Deep Learning-Based System for Vulnerability Detection,the sentence **This tool is based on the idea of distributed representation, which maps a token to an integer that is then converted to a fixedlength vector [43]** seems to mean that each token will be converted to a vector.
But the next sentence **Since code gadgets may have different numbers of tokens,the corresponding vectors may have different lengths** seems to mean that all tokens of a code gadget will be only converted to a vector.
Could you explain each token or all tokens of a code gadget will be converted to a vector?

dengelt · 2018-09-10T08:51:30Z

@hungryfoolou each token is converted to a vector, which means for each code gadget you will get a vector of vectors.

The size of each token-vector is fixed (these come from the word2vec tool), while the number of token-vectors in each gadgets naturally depends on how many tokens are in each gadget. These vectors are padded or truncated as described in the paper.

hungryfoolou · 2018-09-11T01:42:24Z

@dengelt I highly appreciate your help,thanks.

jumormt · 2018-10-23T02:09:34Z

@hungryfoolou each token is converted to a vector, which means for each code gadget you will get a vector of vectors.

The size of each token-vector is fixed (these come from the word2vec tool), while the number of token-vectors in each gadgets naturally depends on how many tokens are in each gadget. These vectors are padded or truncated as described in the paper.

What is the length of each token-vector? As the paper said, it maps a token to an integer that is then converted to a fixed-length vector. Does it mean that each token-vector is represented by its index in the token list?

dengelt · 2018-10-23T02:44:04Z

What is the length of each token-vector?

I don't think this is specified in the paper, or at least I couln't find it. It would be interesting if @VulDeePecker could tell us which value they used.

Does it mean that each token-vector is represented by its index in the token list?

If you have a list of token-vectors and you know the index of each token, then you can convert your token to its vector representation in this way.
But in the word2vec implementation they reference you can directly map your token strings to their vector representations, as this is implemented as a dictionary in python.

jumormt · 2018-10-23T02:57:14Z

What is the length of each token-vector?

I don't think this is specified in the paper, or at least I couln't find it. It would be interesting if @VulDeePecker could tell us which value they used.

Does it mean that each token-vector is represented by its index in the token list?

If you have a list of token-vectors and you know the index of each token, then you can convert your token to its vector representation in this way.
But in the word2vec implementation they reference you can directly map your token strings to their vector representations, as this is implemented as a dictionary in python.

I totally agree with you. It would be interesting if @VulDeePecker could tell us which value they used. If using index there is no need to use word2vec. But I guess it is close to the answer because I cannot another way to map a token to an integer as mentioned in 3-E2 STEP3.2 encoding the symbolic representations into vectors.

yanyinzhao · 2019-06-07T18:16:18Z

Thanks. We take advantage of the commercial product Checkmarx to extract the program slices which are then assembled into code gadgets. The problem you mentioned in some code gadgets is caused by the imperfect result of Checkmarx. For example, "ZIP_FILENAME_LEN, NULL, 0, NULL, 0 )" is from the following code which Checkmarx can only extract line 440, but cannot extract line 439 and 441.

439 if( unzGetCurrentFileInfo( file, p_fileInfo, psz_fileName,
440 ZIP_FILENAME_LEN, NULL, 0, NULL, 0 )
441 != UNZ_OK )
442 {
443 msg_Warn( p_this, "can't get info about file in zip" );
444 return VLC_EGENERIC;
445 }

Due to the imperfect result of Checkmarx, the extracted code gadgets are in the wrong format. May I know are you @VulDeePecker using the wrong format of code gadgets (as provided in cwe119_cgd.txt) to generate the final result? If it is the case, will the final result be trustable?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cwe119_cgd.txt and cwe399_cgd.txt with right format #8

cwe119_cgd.txt and cwe399_cgd.txt with right format #8

hungryfoolou commented Aug 2, 2018

VulDeePecker commented Aug 3, 2018

hungryfoolou commented Aug 3, 2018 •

edited

Loading

VulDeePecker commented Aug 5, 2018

hungryfoolou commented Aug 6, 2018

dengelt commented Sep 10, 2018

hungryfoolou commented Sep 11, 2018

jumormt commented Oct 23, 2018

dengelt commented Oct 23, 2018

jumormt commented Oct 23, 2018

yanyinzhao commented Jun 7, 2019

cwe119_cgd.txt and cwe399_cgd.txt with right format #8

cwe119_cgd.txt and cwe399_cgd.txt with right format #8

Comments

hungryfoolou commented Aug 2, 2018

VulDeePecker commented Aug 3, 2018

hungryfoolou commented Aug 3, 2018 • edited Loading

VulDeePecker commented Aug 5, 2018

hungryfoolou commented Aug 6, 2018

dengelt commented Sep 10, 2018

hungryfoolou commented Sep 11, 2018

jumormt commented Oct 23, 2018

dengelt commented Oct 23, 2018

jumormt commented Oct 23, 2018

yanyinzhao commented Jun 7, 2019

hungryfoolou commented Aug 3, 2018 •

edited

Loading