NLP is a Datagrok package for natural language processing. The package provides integration with AWS Translate, a neural machine translation service, and extends Datagrok with info panels for text files.
Natural Language Processing, or NLP for short, is a branch of artificial intelligence that builds a bridge between computers and human languages. This field has many applications, including:
- language identification
- machine translation
- sentiment analysis
- text summarization
- topic modeling
- entity extraction
It all starts with extracting text. This is a building block for other, more
complex tasks. Due to the high demand, it is essential to support as many
popular text file formats as possible. The platform comes with a built-in
file browser
for easy file management. The package extends it by processing text from
pdf
, doc
, docx
, odt
, and other text formats.
Determining the language of a document is an important preprocessing step for many language-related tasks. Automatic language detection may be part of applications that perform machine translation or semantic analysis. Datagrok's language identification is powered by Google's Compact Language Detector v3 (CLD3) and supports over 100 languages. As with text extraction, this functionality is used in the Translation info panel.
The package creates a new info panel for text files. It uses AWS Translate service, which supports over 70 languages.
To translate a text, navigate to the file browser and select one of the demo files (see the texts
folder). Alternatively, open your personal folder and drag-and-drop your file to the platform. Now, whenever you click
on the file, you will see a suggestion to translate it in the context panel on the right.
The language is identified automatically, but you always have a chance to change it manually. The default target language is English, so be sure to choose another option if the original text is in English.
Increasingly often texts are analyzed for readability. Readability scores take into account various parameters: the average number of words per sentence or syllables per word, percentage of long words, etc.
The Text Statistics
info panel calculates two common formulas:
- Flesch reading-ease test for English
- LIX formula for other languages
The package has search tools for similar texts.
Open table and select a cell of text column. If not specified, set the Text
quality in properties of the selected column:
- Right-click on the column and select
Column Properties...
. A dialog opens - Press
+
inTags
and add the quality tag with the value Text. Now, a tooltip of the column containsquality: Text
Select any cell of the column and expand Similar
in Context Panel
. You will get a set of similar elements of the column. Search results are separated with a line, and common words are in bold:
Explore the obtained search results in the Similar
panel:
- Click to navigate directly to the grid cell containing the text of interest
- Right-click to add a word to filters
User Meeting 9: Natural Language Processing
The package demonstrates two ways of developing info panels for Datagrok: with panel scripts and with JavaScript panel functions.
To write a panel script in any of the languages supported by the platform, you should indicate the panel
tag and specify conditions for the panel to be shown (in the condition
header parameter):
# name: language detection
# language: python
# input: file file {semtype: text} [a text to analyze]
# output: string language {semtype: lang} [detected language]
# tags: nlp, panel
# condition: file.isfile && file.size < 1e6 && supportedext(file.name)
The scripts folder contains more examples of such panel scripts, which are written in Python and work specifically on text files.
A different approach is used to add an info panel from a JavaScript file. The panel function should be properly annotated to return a widget. A simplified example is shown below:
//name: Translation
//tags: panel, widgets
//input: file textfile
//output: widget result
//condition: isTextFile(textfile)
export function translationPanel(textfile) {
return new DG.Widget(ui.divText("Lost in Translation"));
}
Refer to src/package.js to see the panel's complete code.
See also: