Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cwe119_cgd.txt and cwe399_cgd.txt with right format #8

Open
hungryfoolou opened this issue Aug 2, 2018 · 10 comments
Open

cwe119_cgd.txt and cwe399_cgd.txt with right format #8

hungryfoolou opened this issue Aug 2, 2018 · 10 comments

Comments

@hungryfoolou
Copy link

I found some mistakes in file cwe119_cgd.txt and cwe399_cgd.txt.Could you present the file cwe119_cgd.txt and cwe399_cgd.txt with right format?

@VulDeePecker
Copy link
Collaborator

Can you explain the details of the problem?

@hungryfoolou
Copy link
Author

hungryfoolou commented Aug 3, 2018

For example,second line of file cwe119_cgd.txt :
ZIP_FILENAME_LEN, NULL, 0, NULL, 0 )
It seem to lose a left bracket,so I guess the file might be wrong.

@VulDeePecker
Copy link
Collaborator

Thanks. We take advantage of the commercial product Checkmarx to extract the program slices which are then assembled into code gadgets. The problem you mentioned in some code gadgets is caused by the imperfect result of Checkmarx. For example, "ZIP_FILENAME_LEN, NULL, 0, NULL, 0 )" is from the following code which Checkmarx can only extract line 440, but cannot extract line 439 and 441.

439 if( unzGetCurrentFileInfo( file, p_fileInfo, psz_fileName,
440 ZIP_FILENAME_LEN, NULL, 0, NULL, 0 )
441 != UNZ_OK )
442 {
443 msg_Warn( p_this, "can't get info about file in zip" );
444 return VLC_EGENERIC;
445 }

@hungryfoolou
Copy link
Author

Thank you.I would like to ask you a question.each token or all tokens of a code gadget will be converted to a vector?
In the chapter of III-E Step III: Transforming code gadgets into vectors of your paper VulDeePecker A Deep Learning-Based System for Vulnerability Detection,the sentence **This tool is based on the idea of distributed representation, which maps a token to an integer that is then converted to a fixedlength vector [43]** seems to mean that each token will be converted to a vector.
But the next sentence **Since code gadgets may have different numbers of tokens,the corresponding vectors may have different lengths** seems to mean that all tokens of a code gadget will be only converted to a vector.
Could you explain each token or all tokens of a code gadget will be converted to a vector?

@dengelt
Copy link

dengelt commented Sep 10, 2018

@hungryfoolou each token is converted to a vector, which means for each code gadget you will get a vector of vectors.

The size of each token-vector is fixed (these come from the word2vec tool), while the number of token-vectors in each gadgets naturally depends on how many tokens are in each gadget. These vectors are padded or truncated as described in the paper.

@hungryfoolou
Copy link
Author

@dengelt I highly appreciate your help,thanks.

@jumormt
Copy link

jumormt commented Oct 23, 2018

@hungryfoolou each token is converted to a vector, which means for each code gadget you will get a vector of vectors.

The size of each token-vector is fixed (these come from the word2vec tool), while the number of token-vectors in each gadgets naturally depends on how many tokens are in each gadget. These vectors are padded or truncated as described in the paper.

What is the length of each token-vector? As the paper said, it maps a token to an integer that is then converted to a fixed-length vector. Does it mean that each token-vector is represented by its index in the token list?

@dengelt
Copy link

dengelt commented Oct 23, 2018

What is the length of each token-vector?

I don't think this is specified in the paper, or at least I couln't find it. It would be interesting if @VulDeePecker could tell us which value they used.

Does it mean that each token-vector is represented by its index in the token list?

If you have a list of token-vectors and you know the index of each token, then you can convert your token to its vector representation in this way.
But in the word2vec implementation they reference you can directly map your token strings to their vector representations, as this is implemented as a dictionary in python.

@jumormt
Copy link

jumormt commented Oct 23, 2018

What is the length of each token-vector?

I don't think this is specified in the paper, or at least I couln't find it. It would be interesting if @VulDeePecker could tell us which value they used.

Does it mean that each token-vector is represented by its index in the token list?

If you have a list of token-vectors and you know the index of each token, then you can convert your token to its vector representation in this way.
But in the word2vec implementation they reference you can directly map your token strings to their vector representations, as this is implemented as a dictionary in python.

I totally agree with you. It would be interesting if @VulDeePecker could tell us which value they used. If using index there is no need to use word2vec. But I guess it is close to the answer because I cannot another way to map a token to an integer as mentioned in 3-E2 STEP3.2 encoding the symbolic representations into vectors.

@yanyinzhao
Copy link

Thanks. We take advantage of the commercial product Checkmarx to extract the program slices which are then assembled into code gadgets. The problem you mentioned in some code gadgets is caused by the imperfect result of Checkmarx. For example, "ZIP_FILENAME_LEN, NULL, 0, NULL, 0 )" is from the following code which Checkmarx can only extract line 440, but cannot extract line 439 and 441.

439 if( unzGetCurrentFileInfo( file, p_fileInfo, psz_fileName,
440 ZIP_FILENAME_LEN, NULL, 0, NULL, 0 )
441 != UNZ_OK )
442 {
443 msg_Warn( p_this, "can't get info about file in zip" );
444 return VLC_EGENERIC;
445 }

Due to the imperfect result of Checkmarx, the extracted code gadgets are in the wrong format. May I know are you @VulDeePecker using the wrong format of code gadgets (as provided in cwe119_cgd.txt) to generate the final result? If it is the case, will the final result be trustable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants