Skip to content

Latest commit

 

History

History

NLP

NLP

NLP is a Datagrok package for natural language processing. The package provides integration with AWS Translate, a neural machine translation service, and extends Datagrok with info panels for text files.

Natural Language Processing, or NLP for short, is a branch of artificial intelligence that builds a bridge between computers and human languages. This field has many applications, including:

Text extraction

It all starts with extracting text. This is a building block for other, more complex tasks. Due to the high demand, it is essential to support as many popular text file formats as possible. The platform comes with a built-in file browser for easy file management. The package extends it by processing text from pdf, doc, docx, odt, and other text formats.

Extract text from PDF

Language identification

Determining the language of a document is an important preprocessing step for many language-related tasks. Automatic language detection may be part of applications that perform machine translation or semantic analysis. Datagrok's language identification is powered by Google's Compact Language Detector v3 (CLD3) and supports over 100 languages. As with text extraction, this functionality is used in the Translation info panel.

Neural machine translation

The package creates a new info panel for text files. It uses AWS Translate service, which supports over 70 languages.

To translate a text, navigate to the file browser and select one of the demo files (see the texts folder). Alternatively, open your personal folder and drag-and-drop your file to the platform. Now, whenever you click on the file, you will see a suggestion to translate it in the context panel on the right.

Translate text files

The language is identified automatically, but you always have a chance to change it manually. The default target language is English, so be sure to choose another option if the original text is in English.

Text statistics

Increasingly often texts are analyzed for readability. Readability scores take into account various parameters: the average number of words per sentence or syllables per word, percentage of long words, etc.

The Text Statistics info panel calculates two common formulas:

Calculate text statistics

Search

The package has search tools for similar texts.

Open table and select a cell of text column. If not specified, set the Text quality in properties of the selected column:

  • Right-click on the column and select Column Properties.... A dialog opens
  • Press + in Tags and add the quality tag with the value Text. Now, a tooltip of the column contains quality: Text

Add quality text

Select any cell of the column and expand Similar in Context Panel. You will get a set of similar elements of the column. Search results are separated with a line, and common words are in bold:

Similar panel

Explore the obtained search results in the Similar panel:

  • Click to navigate directly to the grid cell containing the text of interest
  • Right-click to add a word to filters

Navigate and filters

Videos

User Meeting 9: Natural Language Processing

Developer notes

The package demonstrates two ways of developing info panels for Datagrok: with panel scripts and with JavaScript panel functions.

To write a panel script in any of the languages supported by the platform, you should indicate the panel tag and specify conditions for the panel to be shown (in the condition header parameter):

# name: language detection
# language: python
# input: file file {semtype: text} [a text to analyze]
# output: string language {semtype: lang} [detected language]
# tags: nlp, panel
# condition: file.isfile && file.size < 1e6 && supportedext(file.name)

The scripts folder contains more examples of such panel scripts, which are written in Python and work specifically on text files.

A different approach is used to add an info panel from a JavaScript file. The panel function should be properly annotated to return a widget. A simplified example is shown below:

//name: Translation
//tags: panel, widgets
//input: file textfile
//output: widget result
//condition: isTextFile(textfile)
export function translationPanel(textfile) {
    return new DG.Widget(ui.divText("Lost in Translation"));
}

Refer to src/package.js to see the panel's complete code.

See also: