Skip to content

Extract individual (natural-language) words from source code

License

Notifications You must be signed in to change notification settings

PASimmons/code-words

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code Words

Get a handle on unfamiliar code by extracting and visualising the natural language programmers used when writing it.

Board Game Example

An example generated from a multiplayer boardgame written in Java.

Usage

<language>-code <source-file-or-directory>* | code-to-words -k <keyword-file> ... -s <stop-word-file> ... | wordcloud -o <output-file>.png

E.g.

java-code project/src/ | code-to-words -k java-keywords -s cargo-cult-java-stop-words | wordcloud -o project.png

The stop-keyword files and stop-word files must have a single word per line.

The words in keyword-files are filtered out after identifiers have been extracted from the language but before any further processing.

The words in stop-word-files are filtered out after the identifiers have been split into separate words at underscores or camel-case boundaries and normalised to lowercase.

The wordcloud command has the following options:

  • -o output-file: output file name (image type is determined from the extension)
  • -s widthxheight : width of the output image

Languages supported

  • C: c-code
    • c-keywords: most C keywords
    • c-primitive-type-keywords: ignores basic C types (int, char, etc.)
  • C++: c++-code
    • c++-keywords: most C++ keywords
    • c-primitive-type-keywords: ignores basic C types (int, char, etc.)
  • Haskell: haskell-code
    • haskell-keywords
  • HTML: html-text
    • no stop words file provided. Stop words files for various natural languages can be found on the web.
  • Java: java-code.
    • java-keywords: most keywords
    • java-primitive-type-keywords: ignores primitive types
    • cargo-cult-java-stop-words: ignores get, set, bean etc. Use with the -s flag.
  • JavaScript: javascript-code.
    • javascript-keywords: ignores keywords and reserved words (from ECMA-262 Edition 3)
    • java-primitive-type-keywords: ignores primitive types
    • nodejs-globals-keywords: ignores node.js globals
  • Python: python-code
    • python-keywords: most keywords
  • Ruby: ruby-code
    • ruby-keywords
  • Scala: scala-code
    • scala-keywords
  • PHP: php-code
    • php-keywords: shows some keywords that may be the result of poor programming practice.
    • php-strict-keywords: ignores all keywords
  • Smalltalk: smalltalk-code
    • smalltalk-keywords: ignores keywords

Examples

Example visualisations of various applications are in the examples/ directory.

Dependencies

To extract text from source code:

  • Bash
  • Gnu Sed
  • Grep
  • Awk

To extract text from HTML:

  • w3m

To visualise the results

  • Java 1.6

It should work on any desktop Linux. It does not yet work on MacOS unless you install the Gnu command-line tools.

To compile the Java wordcloud generator:

  • JDK 1.6
  • Gnu Make

About

Extract individual (natural-language) words from source code

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 54.9%
  • Java 45.1%