Skip to content

Latest commit

 

History

History
185 lines (97 loc) · 8.88 KB

File metadata and controls

185 lines (97 loc) · 8.88 KB

SageMaker Ground Truth

The Ground Truth problem

To train a machine learning model using a supervised approach, you need a large, high-quality, labeled dataset. The term "Ground truth" may be seen as a conceptual term relative to the knowledge of the truth concerning a specific question. It is the ideal expected result. This is used in statistical models to prove or disprove research hypotheses. The term "ground truthing" refers to the process of gathering the proper objective (provable) data for this test.

SageMaker Ground Truth helps you build high-quality training datasets for your machine learning models. With Ground Truth, you can use workers from either Amazon Mechanical Turk, a vendor company that you choose, or an internal, private workforce along with machine learning to enable you to create a labeled dataset. You can use the labeled dataset output from Ground Truth to train your own models. You can also use the output as a training dataset for an Amazon SageMaker model.

Workers

A workforce is the group of workers that you have selected to label your dataset. You can choose either the Amazon Mechanical Turk workforce, a vendor-managed workforce, or you can create your own private workforce to label or review your dataset. Whichever workforce type you choose, Amazon SageMaker takes care of sending tasks to workers.

When you use a private workforce, you also create work teams, a group of workers from your workforce that are assigned to specific jobs— Amazon SageMaker Ground Truth labeling jobs or Amazon Augmented AI human review tasks. You can have multiple work teams and can assign one or more work teams to each job.

Step 1: Creating a private workforce

The first step is to define who is going to classify our images. We have a team of doctors able to properly classify and label our mammography pictures, so we are about to create a private team and add our doctors.

Login to AWS Console:

Image

Search for SageMaker and open Ground Truth:

Image

Open Labeling workforces and select Private:

Image

Click on Create private team, give your team a cool name, add some people, give an organization name, an email for contact, and click Create private team.

When inviting new workers, don't forget to add your own e-mail address. You will need to receive the invitation e-mail in order to continue with the lab. Image

You should see a successful green message like the one below.

Image

After a few minutes, each member of the list will receive an e-mail like this:

Image

Follow the e-mail steps, change the password and this will be your first login screen:

Image

Step 2: Defining a labeling job

Now that we have a workforce, we need to upload the images we need to classify to Amazon S3. After that, we will create a labeling job and assign it to a private team.

Ground Truth expects the data that is going to be classified to be stored in S3, so let's upload them.

Part 1: Uploading the images

There is a zip file that can be downloaded from here. Download it locally and unzip it.

Go to the Amazon S3 Console and open the bucket whose name begins with mammography-workshop-files:

Image

Upload the mammography images you want to label, i.e, the ones you unzipped in the previous step:

Image

Record this bucket's name for the next part.

Part 2: Creating a job

Open the SageMaker Console, go to Ground Truth and then Labeling jobs, then click on Create labeling job:

Image

Define a name for your job, something very cool like mammography-views.

A labeling job requires a manifest.json file containing the location of your images.

This file can be automatically generated by AWS. For that, click on Create manifest file, and then paste the S3 folder name where we uploaded the mammography files in the previous step. Click Create and then Use this manifest.

Image

Define a place to save the output results. In our case, we chose s3://<<bucket>>/output

In the IAM Role drop down, choose Create a new role :

Image

If you've done everything correctly, you should see something like this:

Image

Scroll down to reach Task Type section. Select Image Classification:

Image

Click Next.

In Workers, select Private, and find our team in Private Teams:

Image

Scroll down to reach Image classification labeling tool.

Enter a brief description of the task. For instance: "How would you classify this image?"

In the left description field, provide some guidance for the workers. For instance: "Classify the mammography images in Right-CC, Left-CC, Right-MLO, Left-MLO. If the image is not a mammography, classify it as Not a Mammography".

Finally, create 5 different labels that your workers will use to classify:

  • Right-CC
  • Left-CC
  • Right-MLO
  • Left-MLO
  • Not a mammography

When finished, you should see something like this:

Image

Finally, create the job by clicking in the button Create:

Image

The state In Progress indicates that you job is active, waiting for your labeling team. So, let's continue!

Step 3: Classifying the images

It is time to classify!

Go back to the page you received by e-mail when you created the Labeling Workforce.

It might take some minutes in order for the Labeling Job to appear. Keep refreshing the page until it does.

Image

Click on Start working. Some images will be presented and you need to choose the correct classification.
Click on the label and then on Submit until you've finished all the images.

Image

When you finish, a message will be displayed:

Image

The job status will change to complete in SageMaker Ground Truth console:

Image

Results: The output JSON

The output folder was set to s3://mammography-workshop-files-<<region>>-<<account id>>/output/. After the classification, new folders were created, and inside <<job_name>>/manifests/output/ there is a file named output.manifest. Download it and let's analyze its lines.

{"source-ref":"s3://mammography-workshop-files-<<region>>-<<account-id>>/resize_00006585_009.png","mammography-views":4,"mammography-views-metadata":{"confidence":0.9,"job-name":"labeling-job/mammography-views","class-name":"Not a mammography","human-annotated":"yes","creation-date":"2020-02-01T14:47:00.585263","type":"groundtruth/image-classification"}}

{"source-ref":"s3://mammography-workshop-files-<<region>>-<<account-id>>/resize_RIGHT_CC_A_0290_1.jpg","mammography-views":0,"mammography-views-metadata":{"confidence":0.75,"job-name":"labeling-job/mammography-views","class-name":"Right-CC","human-annotated":"yes","creation-date":"2020-02-01T14:48:04.189444","type":"groundtruth/image-classification"}}

{"source-ref":"s3://mammography-workshop-files-<<region>>-<<account-id>>/resize_LEFT_CC_A_0296_1.jpg","mammography-views":1,"mammography-views-metadata":{"confidence":0.75,"job-name":"labeling-job/mammography-views","class-name":"Left-CC","human-annotated":"yes","creation-date":"2020-02-01T14:48:04.189469","type":"groundtruth/image-classification"}}

{"source-ref":"s3://mammography-workshop-files-<<region>>-<<account-id>>/resize_RIGHT_MLO_D_4501_1.jpg","mammography-views":2,"mammography-views-metadata":{"confidence":0.9,"job-name":"labeling-job/mammography-views","class-name":"Right-MLO","human-annotated":"yes","creation-date":"2020-02-01T14:45:57.449584","type":"groundtruth/image-classification"}}

{"source-ref":"s3://mammography-workshop-files-<<region>>-<<account-id>>/resize_LEFT_MLO_A_0431_1.jpg","mammography-views":3,"mammography-views-metadata":{"confidence":0.75,"job-name":"labeling-job/mammography-views","class-name":"Left-MLO","human-annotated":"yes","creation-date":"2020-02-01T14:48:04.189483","type":"groundtruth/image-classification"}}

Each line refers to a classified image. It shows the image's location in S3, its label and other useful information.

Final considerations

The generated file can be used in SageMaker, but you can also consume it in Python notebooks (imported as a simple JSON file).

Since some of us are not real doctors, if we by any chance miss-evaluated some image, using this file might harm the accuracy of our model, so we will not use this manifest during our training, but we will explain how to use it later on in this lab.

Let's go back to the lab by clicking here