Amazon Comprehend Developer Guide Amazon Comprehend Developer Guide Amazon Comprehend: Developer Guide Copyright © 2022 Amazon Web Services, Inc. and/or its affiliates. All rights reserved. Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon. Amazon Comprehend Developer Guide Table of Contents What is Amazon Comprehend? ............................................................................................................ 1 Amazon Comprehend insights..................................................................................................... 1 Amazon Comprehend Custom...................................................................................................... 2 Document clustering (topic modeling).......................................................................................... 2 Examples................................................................................................................................... 2 Benefits.................................................................................................................................... 2 Amazon Comprehend pricing....................................................................................................... 3 Are you a first-time user of Amazon Comprehend? ......................................................................... 3 How it works..................................................................................................................................... 4 Insights..................................................................................................................................... 4 Entities............................................................................................................................. 5 Events............................................................................................................................... 6 Key phrases..................................................................................................................... 11 Dominant language.......................................................................................................... 12 Sentiment....................................................................................................................... 16 Targeted sentiment.......................................................................................................... 17 Syntax analysis................................................................................................................. 28 Amazon Comprehend Custom.................................................................................................... 31 Topic modeling ........................................................................................................................ 31 Document processing modes..................................................................................................... 33 Single-document processing.............................................................................................. 34 Multiple document synchronous processing ......................................................................... 34 Asynchronous batch processing .......................................................................................... 36 Supported languages........................................................................................................................ 37 Supported languages................................................................................................................ 37 Languages supported by Amazon Comprehend features................................................................ 38 Setting up....................................................................................................................................... 39 Sign up for an AWS account ...................................................................................................... 39 Create an administrative user.................................................................................................... 39 Set up the AWS CLI .................................................................................................................. 40 Grant programmatic access........................................................................................................ 40 Getting started ................................................................................................................................ 42 Using the console ............................................................................................................................. 43 Real-time analysis.................................................................................................................... 43 Entities............................................................................................................................ 44 Key phrases..................................................................................................................... 44 Language........................................................................................................................ 45 Personally identifiable information (PII) ............................................................................... 46 Sentiment....................................................................................................................... 48 Targeted sentiment.......................................................................................................... 49 Syntax............................................................................................................................. 50 Analysis jobs (console).............................................................................................................. 51 Using the API.................................................................................................................................. 54 Real-time analysis (API) ............................................................................................................. 54 Detecting the dominant language ...................................................................................... 54 Detecting named entities .................................................................................................. 57 Detecting key phrases ....................................................................................................... 59 Determining sentiment..................................................................................................... 61 Real-time analysis for targeted sentiment ........................................................................... 64 Detecting syntax.............................................................................................................. 65 Real-time batch APIs ........................................................................................................ 69 Async analysis jobs (API)........................................................................................................... 73 Amazon Comprehend insights ............................................................................................ 74 Targeted sentiment.......................................................................................................... 78 iii Amazon Comprehend Developer Guide Event detection................................................................................................................ 79 Topic modeling................................................................................................................ 82 Personally identifiable information (PII) ............................................................................................... 89 Detecting PII entities................................................................................................................ 89 Locate PII entities............................................................................................................. 89 Redact PII entities ............................................................................................................ 90 PII universal entity types ................................................................................................... 90 Country-specific PII entity types ......................................................................................... 92 Labeling PII entities.................................................................................................................. 94 Real-time analysis (Console) ...................................................................................................... 94 Offsets............................................................................................................................ 46 Labels............................................................................................................................. 47 Async analysis jobs (Console) ..................................................................................................... 96 Real-time analysis (API) ............................................................................................................. 97 Locating PII real-time entities (API) ..................................................................................... 98 Labeling PII real-time entities (API) ..................................................................................... 98 Async analysis jobs (API)........................................................................................................... 99 Locating PII entities.......................................................................................................... 99 Redacting PII entities ...................................................................................................... 103 Document processing...................................................................................................................... 106 Inputs for real-time analysis .................................................................................................... 106 Plain text documents ...................................................................................................... 106 Semi-structured documents............................................................................................. 106 Image files and scanned PDF files..................................................................................... 107 Amazon Textract output .................................................................................................. 107 Maximum document sizes ................................................................................................ 107 Errors in semi-structured documents ................................................................................. 107 Inputs for async analysis......................................................................................................... 108 Plain text documents ...................................................................................................... 108 Semi-structured documents............................................................................................. 109 Image files and scanned PDF files..................................................................................... 109 Amazon Textract output JSON files .................................................................................. 109 Setting text extraction options ................................................................................................. 110 Best practices for images ......................................................................................................... 110 Custom classification....................................................................................................................... 112 Preparing training data ........................................................................................................... 112 Multi-class mode............................................................................................................ 113 Multi-label mode............................................................................................................ 114 Training classification models ................................................................................................... 117 Train custom classifiers (console) ...................................................................................... 117 Train custom classifiers (API) ............................................................................................ 119 Test the training data ..................................................................................................... 122 Metrics.......................................................................................................................... 122 Running real-time analysis ....................................................................................................... 128 Real-time analysis (console) ............................................................................................. 128 Real-time analysis (API) ................................................................................................... 130 Outputs for real-time analysis .......................................................................................... 134 Running async analysis jobs..................................................................................................... 135 Input file formats........................................................................................................... 135 Analysis jobs (console) .................................................................................................... 136 Analysis jobs (API).......................................................................................................... 137 Outputs for analysis jobs................................................................................................. 140 Custom entity recognition............................................................................................................... 144 Preparing the training data ...................................................................................................... 144 When to use annotations vs entity lists ............................................................................. 145 Entity lists..................................................................................................................... 146 Annotations................................................................................................................... 147 iv Amazon Comprehend Developer Guide Training custom recognizers ..................................................................................................... 157 Train custom recognizers (console) .................................................................................... 158 Train custom recognizers (API) ......................................................................................... 162 Metrics.......................................................................................................................... 164 Running real-time analysis ....................................................................................................... 167 Real-time analysis (console) ............................................................................................. 167 Real-time analysis (API) ................................................................................................... 169 Outputs for real-time analysis .......................................................................................... 170 Running async analysis jobs..................................................................................................... 175 Analysis jobs (console) .................................................................................................... 175 Analysis jobs (API).......................................................................................................... 176 Outputs for analysis jobs................................................................................................. 180 Managing custom models................................................................................................................ 184 Model versioning with Amazon Comprehend .............................................................................. 184 Copying custom models between AWS accounts ......................................................................... 186 Sharing a custom model................................................................................................. 187 Importing a custom model .............................................................................................. 193 Managing endpoints....................................................................................................................... 198 Endpoints overview................................................................................................................ 198 Using endpoints..................................................................................................................... 199 Monitoring endpoints.............................................................................................................. 199 Updating endpoints................................................................................................................ 202 Using Trusted Advisor ............................................................................................................. 203 Amazon Comprehend underutilized endpoints ................................................................... 203 Amazon Comprehend endpoint access risk ......................................................................... 204 Deleting endpoints................................................................................................................. 205 Auto scaling with endpoints..................................................................................................... 206 Target tracking............................................................................................................... 206 Scheduled scaling........................................................................................................... 209 Tagging......................................................................................................................................... 212 Tagging a new resource ........................................................................................................... 212 Viewing, editing, and deleting tags........................................................................................... 213 Security......................................................................................................................................... 215 Data protection...................................................................................................................... 215 KMS encryption in Amazon Comprehend ........................................................................... 216 Cross-service confused deputy prevention ......................................................................... 218 Using a Virtual Private Cloud (VPC) .................................................................................. 220 VPC endpoints (AWS PrivateLink) ..................................................................................... 223 Identity and Access Management .............................................................................................. 225 Audience....................................................................................................................... 225 Authenticating with identities.......................................................................................... 225 Managing access using policies ......................................................................................... 228 How Amazon Comprehend works with IAM ........................................................................ 229 Identity-based policy examples ........................................................................................ 234 AWS managed policies .................................................................................................... 242 Troubleshooting............................................................................................................. 245 Logging Amazon Comprehend API calls with AWS CloudTrail ....................................................... 246 Amazon Comprehend information in CloudTrail .................................................................. 246 Examples: Amazon Comprehend log file entries .................................................................. 248 Compliance validation............................................................................................................. 255 Resilience.............................................................................................................................. 255 Infrastructure security............................................................................................................. 256 Guidelines and quotas..................................................................................................................... 257 Supported Regions................................................................................................................. 257 General guidelines and quotas................................................................................................. 257 Synchronous operations.................................................................................................. 257 Throttling for single transactions ...................................................................................... 258 v Amazon Comprehend Developer Guide Multiple document operations.......................................................................................... 258 Concurrent active asynchronous jobs ................................................................................. 258 Asynchronous jobs.......................................................................................................... 258 Insights................................................................................................................................. 259 Targeted sentiment......................................................................................................... 259 Language detection........................................................................................................ 259 Events........................................................................................................................... 259 Topic modeling ...................................................................................................................... 260 Custom analysis..................................................................................................................... 260 Inputs for custom analysis............................................................................................... 260 Document classification................................................................................................... 262 Entity recognition........................................................................................................... 263 Tutorials........................................................................................................................................ 265 Analyzing insights from reviews ............................................................................................... 265 Prerequisites.................................................................................................................. 266 Step 1: Adding documents to Amazon S3 .......................................................................... 268 Step 2: (CLI only) creating an IAM role .............................................................................. 271 Step 3: Running analysis jobs........................................................................................... 273 Step 4: Preparing the output ........................................................................................... 276 Step 5: Visualizing the output.......................................................................................... 284 Using S3 object Lambda access points for PII ............................................................................. 288 Controlling access to documents with PII ........................................................................... 289 Redacting PII from documents ......................................................................................... 290 Analyzing text with OpenSearch ............................................................................................... 292 API reference................................................................................................................................. 293 Document history........................................................................................................................... 294 AWS glossary................................................................................................................................. 302 vi Amazon Comprehend Developer Guide Amazon Comprehend insights What is Amazon Comprehend? Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. Use Amazon Comprehend to create new products based on understanding the structure of documents. For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases. You can access Amazon Comprehend document analysis capabilities using the Amazon Comprehend console or using the Amazon Comprehend APIs. You can run real-time analysis for small workloads or you can start asynchronous analysis jobs for large document sets. You can use the pre-trained models that Amazon Comprehend provides, or you can train your own custom models for classification and entity recognition. All of the Amazon Comprehend features accept UTF-8 text documents as the input. In addition, custom classification and custom entity recognition accept image files, PDF files, and Word files as input. Amazon Comprehend can examine and analyze documents in a variety of languages, depending on the specific feature. For more information, see Languages supported in Amazon Comprehend (p. 37). Amazon Comprehend's Dominant language (p. 12) capability can examine documents and determine the dominant language for a far wider selection of languages. Topics • Amazon Comprehend insights (p. 1) • Amazon Comprehend Custom (p. 2) • Document clustering (topic modeling) (p. 2) • Examples (p. 2) • Benefits (p. 2) • Amazon Comprehend pricing (p. 3) • Are you a first-time user of Amazon Comprehend? (p. 3) Amazon Comprehend insights Amazon Comprehend uses a pre-trained model to examine and analyze a document or set of documents to gather insights about it. This model is continuously trained on a large body of text so that there is no need for you to provide training data. Amazon Comprehend analyzes the following types of insights: • Entities – References to the names of people, places, items, and locations contained in a document. • Key phrases – Phrases that appear in a document. For example, a document about a basketball game might return the names of the teams, the name of the venue, and the final score. • Personally Identifiable Information (PII) – Personal data that can identify an individual, such as an address, bank account number, or phone number. • Language – The dominant language of a document. • Sentiment – The dominant sentiment of a document, which can be positive, neutral, negative, or mixed. • Targeted sentiment – The sentiments associated with specific entities in a document. The sentiment for each entity occurrence can be positive, negative, neutral or mixed. • Syntax – The parts of speech for each word in the document. 1 Amazon Comprehend Developer Guide Amazon Comprehend Custom For more information, see Insights (p. 4). Amazon Comprehend Custom You can customize Amazon Comprehend for your specific requirements without the skillset required to build machine learning-based NLP solutions. Using automatic machine learning, or AutoML, Amazon Comprehend Custom builds customized NLP models on your behalf, using data you already have. Custom classification – Create custom classification models (classifiers) to organize your documents into your own categories. Custom entity recognition – Create custom entity recognition models (recognizers) that can analyze text for your specific terms and noun-based phrases. For more information, see Amazon Comprehend Custom (p. 31). Document clustering (topic modeling) You can also use Amazon Comprehend to examine a corpus of documents to organize them based on similar keywords within them. Document clustering (topic modeling) is useful to organize a large corpus of documents into topics or clusters that are similar based on word frequency. For more information, see Topic modeling (p. 31). Examples The following examples show how you might use the Amazon Comprehend operations in your applications. Example 1: Find documents about a subject Find the documents about a particular subject using Amazon Comprehend topic modeling. Scan a set of documents to determine the topics discussed, and to find the documents associated with each topic. You can specify the number of topics that Amazon Comprehend should return from the document set. Example 2: Find out how customers feel about your products If your company publishes a catalog, let Amazon Comprehend tell you what customers think of your products. Send each customer comment to the DetectSentiment operation and it will tell you whether customers feel positive, negative, neutral, or mixed about a product. Example 3: Discover what matters to your customers Use Amazon Comprehend topic modeling to discover the topics that your customers are talking about on your forums and message boards, then use entity detection to determine the people, places, and things that they associate with the topic. Finally, use sentiment analysis to determine how your customers feel about a topic. Benefits Some of the benefits of using Amazon Comprehend include: 2 Amazon Comprehend Developer Guide Amazon Comprehend pricing • Integrate powerful natural language processing into your apps – Amazon Comprehend removes the complexity of building text analysis capabilities into your applications by making powerful and accurate natural language processing available with a simple API. You don't need textual analysis expertise to take advantage of the insights that Amazon Comprehend produces. • Deep learning based natural language processing – Amazon Comprehend uses deep learning technology to accurately analyze text. Our models are constantly trained with new data across multiple domains to improve accuracy. • Scalable natural language processing – Amazon Comprehend enables you to analyze millions of documents so that you can discover the insights that they contain. • Integrated with other AWS services – Amazon Comprehend is designed to work seamlessly with other AWS services like Amazon S3, AWS KMS, and AWS Lambda. Store your documents in Amazon S3, or analyze real-time data with Kinesis Data Firehose. Support for AWS Identity and Access Management (IAM) makes it easy to securely control access to Amazon Comprehend operations. Using IAM, you can create and manage AWS users and groups to grant the appropriate access to your developers and end users. • Encryption of output results and volume data – Amazon S3 already enables you to encrypt your input documents, and Amazon Comprehend extends this even farther. By using your own KMS key, you can not only encrypt the output results of your job, but also the data on the storage volume attached to the compute instance that processes the analysis job. The result is significantly enhanced security. • Low cost – With Amazon Comprehend, there are no minimum fees or upfront commitments. You pay for the documents that you analyze and custom models that you train. Amazon Comprehend pricing With Amazon Comprehend, you pay only for the resources that you use. If you are a new AWS customer, you can get started with Amazon Comprehend for free. For more information, see AWS free usage tier. There is a usage charge for running real-time or asynchronous analysis jobs. You pay to train custom models, and you pay for custom model management. For real-time requests using custom models, you pay for the endpoint from the time that you start your endpoint until you delete the endpoint. For the rates and additional detailed information, see http://aws.amazon.com/comprehend/pricing. Are you a first-time user of Amazon Comprehend? If you are a first-time user of Amazon Comprehend, we recommend that you read the following sections in order: 1. How it works (p. 4) – This section introduces Amazon Comprehend concepts. 2. Setting up (p. 39) – In this section, you create an account and set up the AWS CLI. 3. Getting started with Amazon Comprehend (p. 42) – In this section, you run a Amazon Comprehend analysis job. 4. Tutorial: Analyzing insights from customer reviews with Amazon Comprehend (p. 265) – In this section, you perform sentiment and entities analysis and visualize the results. 5. Amazon Comprehend API Reference – Reference documentation for Amazon Comprehend operations. AWS provides the following resources for learning about the Amazon Comprehend service: • The AWS Machine Learning Blog includes useful articles about Amazon Comprehend. • Amazon Comprehend Resources provides useful videos and tutorials about Amazon Comprehend. 3 Amazon Comprehend Developer Guide Insights How it works Amazon Comprehend uses a pre-trained model to gather insights about a document or a set of documents. This model is continuously trained on a large body of text so that there is no need for you to provide training data. You can use Amazon Comprehend to build your own custom models for custom classification and custom entity recognition. Amazon Comprehend provides topic modeling using a built-in model. Topic modeling examines a corpus of documents and organizes the documents based on similar keywords within them. Amazon Comprehend provides synchronous and asynchronous document processing modes. Use synchronous mode for processing one document or a batch of up to 25 documents. Use an asynchronous job to process a large number of documents. Amazon Comprehend works with AWS Key Management Service (AWS KMS) to provide enhanced encryption for your data. For more information, see KMS encryption in Amazon Comprehend (p. 216). Key concepts • Insights (p. 4) • Amazon Comprehend Custom (p. 31) • Topic modeling (p. 31) • Document processing modes (p. 33) Insights Amazon Comprehend can analyze a document or set of documents to gather insights about it. Some of the insights that Amazon Comprehend develops about a document include: • Entities (p. 5) – Amazon Comprehend returns a list of entities, such as people, places, and locations, identified in a document. • Events (p. 6) – Amazon Comprehend detects specific types of events and related details. • Key phrases (p. 11) – Amazon Comprehend extracts key phrases that appear in a document. For example, a document about a basketball game might return the names of the teams, the name of the venue, and the final score. • Personally identifiable information (PII) (p. 89) – Amazon Comprehend analyzes documents to detect personal data that identify an individual, such as an address, bank account number, or phone number. • Dominant language (p. 12) – Amazon Comprehend identifies the dominant language in a document. Amazon Comprehend can identify 100 languages. • Sentiment (p. 16) – Amazon Comprehend determines the dominant sentiment of a document. Sentiment can be positive, neutral, negative, or mixed. • Targeted Sentiment (p. 17) – Amazon Comprehend determines the sentiment of specific entities mentioned in a document. The sentiment of each mention can be positive, neutral, negative, or mixed. • Syntax analysis (p. 28) – Amazon Comprehend parses each word in your document and determines the part of speech for the word. For example, in the sentence "It is raining today in Seattle," "it" is identified as a pronoun, "raining" is identified as a verb, and "Seattle" is identified as a proper noun. 4
Description: