Amazon Comprehend is a service that uses natural language processing (NLP) to extract insights about the content of documents. Comprehend takes text (UTF-8) as input and recognize entities, key phrases, language and sentiments.
In this lab, we will use the text previously extracted with Textract and apply Comprehend on it, in order to retrieve entities.
In step 6, we call the DetectEntities
API from Comprehend in the Lambda function we created (documentTextract-xyz).
The function needs persmissions to invoke Comprehend. Let's update the role automatically created during the function creation. Click on the documentTextract-xyz function, then click the Configuration tab, then choose Permissions and click textract-index-stack-LambdaExecutionRole-xyz:
In the new window, click on Attach policies, search for ComprehendReadOnly, check it and click Attach policy:
Back to the Lambda function screen, refresh the page, you should now see Amazon Comprehend in the Permissions tab. Our Lambda function is now able to call Comprehend APIs:
In the source code of your Lambda function (index.py), add the following line after import boto3
:
comprehend = boto3.client('comprehend')
And the following at the end of the handler function:
text = page[:5000]
languages = comprehend.detect_dominant_language(
Text=text
)
dominant_language = sorted(languages['Languages'], key=lambda k: k['LanguageCode'])[0]['LanguageCode']
if dominant_language not in ['en','es','fr','de','it','pt']:
# optional: call Amazon translate to get it in english
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/translate.html#Translate.Client.translate_text
dominant_language = "en"
detected_entities = comprehend.detect_entities(
Text=text,
LanguageCode=dominant_language
)
selected_entity_types = ["ORGANIZATION", "PERSON", "LOCATION", "DATE"]
selected_entities = [x for x in detected_entities['Entities'] if x['Score'] > 0.9 and x['Type'] in selected_entity_types]
print(selected_entities)
Click Deploy. Few details to notice here:
-
Before calling the
DetectEntities
API, we need to be aware of few limits of the service:- First, the text must contain less that 5000 bytes (UTF-8 encoded characters). That's why we cut the text to 5000 on the first line.
- Second, the text must be in a supported language, for example German ("de"), English ("en") or Spanish ("es"). Comprehend provides an API to detect the language of a document. We could associate this with another service, Amazon Translate, to get the translation of the text in the desired language, if not already.
-
We then call the
DetectEntities
API to retrieve entities in the document. The API will return a list of Entity: the text, their position (offsets), a score for the level of confidence, and a type (PERSON, LOCATION, ORGANIZATION, DATE, ...). In our code, we select only a subset of all types and only the entities with a score higher than 0.9 over 1. You can test without that filter if you want to see everything that is returned by Comprehend.
Proceed as in the previous lab. Upload a document to the workshop-textract-xyz S3 bucket and open the Cloudwatch logs. Observe the result:
[{'Score': 0.9834662079811096, 'Type': 'PERSON', 'Text': 'Neil A. Armstrong', 'BeginOffset': 140, 'EndOffset': 156},{'Score': 0.9590600728988647, 'Type': 'LOCATION', 'Text': 'Houston, Texas', 'BeginOffset': 4530, 'EndOffset': 4544}, ...]
In Lab 3, we will index both the content text and the entities in Elasticsearch.