Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System calls classification #25

Open
KahinaBELAIDI opened this issue Feb 20, 2019 · 0 comments
Open

System calls classification #25

KahinaBELAIDI opened this issue Feb 20, 2019 · 0 comments

Comments

@KahinaBELAIDI
Copy link

KahinaBELAIDI commented Feb 20, 2019

Hello, I'm trying to do task for system calls classification. The code bellow is inspired from a text classification project. My system calls are represented as sequences of integers between 1 and 340. The error I got is: valueError: input arrays should have the same number of samples as target arrays. Find 1 input samples and 0 target samples.
I don't know how to resolve it !!

`

       df = pd.read_csv("data.txt") 
       df_test = pd.read_csv("validation.txt")
      #split arrays into train and test data (cross validation)
        train_text, test_text, train_y, test_y = train_test_split(df,df,test_size = 0.2)
    MAX_NB_WORDS = 5700
   # get the raw text data
    texts_train = train_text.astype(str)
    texts_test = test_text.astype(str)
    # finally, vectorize the text samples into a 2D integer tensor
   tokenizer = Tokenizer(nb_words=MAX_NB_WORDS, char_level=False)
   tokenizer.fit_on_texts(texts_train)
   sequences = tokenizer.texts_to_sequences(texts_train)
   sequences_test = tokenizer.texts_to_sequences(texts_test)

  word_index = tokenizer.word_index
  type(tokenizer.word_index), len(tokenizer.word_index)
  index_to_word = dict((i, w) for w, i in tokenizer.word_index.items()) 
 " ".join([index_to_word[i] for i in sequences[0]])
  seq_lens = [len(s) for s in sequences]

  MAX_SEQUENCE_LENGTH = 100
 # pad sequences with 0s
 x_train = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) 
 x_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)
 #print('Shape of data train:', x_train.shape)  #cela a donnée (1,100)
 #print('Shape of data test tensor:', x_test.shape)
 y_train = train_y
 y_test = test_y
 print('Shape of label tensor:', y_train.shape)
 EMBEDDING_DIM = 32
 N_CLASSES = 2

y_train = keras.utils.to_categorical( y_train , N_CLASSES )
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='float32')

embedding_layer = Embedding(MAX_NB_WORDS, EMBEDDING_DIM,
                        input_length=MAX_SEQUENCE_LENGTH,
                        trainable=True)
embedded_sequences = embedding_layer(sequence_input)

average = GlobalAveragePooling1D()(embedded_sequences)
 predictions = Dense(N_CLASSES, activation='softmax')(average)

 model = Model(sequence_input, predictions)
 model.compile(loss='categorical_crossentropy',
          optimizer='adam', metrics=['acc'])
 model.fit(x_train, y_train, validation_split=0.1,
      nb_epoch=10, batch_size=1)
  output_test = model.predict(x_test)
  print("test auc:", roc_auc_score(y_test,output_test[:,1]))

`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant