Amazon SageMaker Introduction - Try Machine Learning with Built-in Algorithms

Amazon SageMaker Introduction - Try Machine Learning with Built-in Algorithms

Takahiro Iwasa
Takahiro Iwasa
12 min read
Machine Learning SageMaker

I held a study meeting to introduce Amazon SageMaker, “Amazon SageMaker Introduction - Try Machine Learning with Built-in Algorithms”.

I would like to share the content of the presentation, in the hopes that it will provide an opportunity for you to know SageMaker. You can pull an example code used in this post from my GitHub repository.

Prerequisites

Participants are expected to:

  • Be interested in learning about Amazon SageMaker.
  • Have basic knowledge of machine learning and AWS.

Goals in this post:

  • Provide an overview of machine learning and SageMaker ecosystem.
  • Demonstrate supervised machine learning using a SageMaker built-in algorithm.

Machine Learning Overview

Supervised Learning

Supervised learning is often used for classification and regression tasks. It needs both training and labeled data. Labelling data may incur additional burdens.

Unsupervised Learning

Unsupervised learning is often used for tasks such as clustering and dimensionality reduction. It needs training data. In most cases, this method would not provide a clear explanation for inference results.

Reinforcement Learning

Reinforcement learning is not purely a statistical analysis method, as it may also incorporate psychological approaches. It needs training data and rewards. Rewards, or weights, are assigned based on the algorithm’s prediction results, allowing the algorithm to prioritize actions that lead to greater rewards.

Deep Learning

Deep learning is a machine learning method that has gained significant attention in recent years. It uses multiple layers of neural networks to achieve high levels of accuracy, which can often surpass human prediction results.

Automatic feature selection is now likely feasible with deep learning. Despite producing accurate predictions, deep learning may not provide a clear explanation for inference results. Training deep learning models often requires significant computing resources.

Neural Network

One of the indicators to neural network complexity is sum of the weights. In the example below, the complexity is 24.

4 (Features) * 3 (Hidden layer 1) = 12
3 (Hidden layer 1) * 3 (Hidden layer 2) = 9
3 (Hidden layer 2) * 1 (Output) = 3
12 + 9 + 3 = 24

ValueDescription
yResult
xFeatures
wWeights
h1, h2Hidden layers
h1[0], h1[1], … h2[2]Hidden units

Features

Rows are called samples or data points. Columns are called features. Feature engineering is important to improve inference accuracy. It includes extraction, conversion or scaling of features. Preprocessing removes unnecessary information which raw data often contains. There are various methods used for feature engineering, such as one-hot encoding, rescaling, and principal component analysis (PCA).

Model Evaluation

Method (k-Fold Cross Validation)

The following diagram describes k-Fold Cross Validation. Green is training data and blue is test data.

Each set has a different combination of training and test data, and the accuracy of the set can be calculated. The average accuracy of all sets can be considered the model’s average. If the accuracy varies heavily between each set, the model may not be able to generalize well, which could result in the model overfitting to specific datasets.

Indicators (Confusion Matrix)

The following table is an example of binary classification.

Prediction
Positive Negative
Result Positive True Positive (TP) False Negative (FN)
Negative False Positive (FP) True Negative (TN)

The following describes each indicator.

NameExpressionDescriptionProblem
Accuracy(TP + TN) / (TP + TN + FP + FN)Ratio of correct predictionIf the model predicts all samples as negative, and there are 100 samples with 99 being negative, the accuracy is 99%.
PrecisionTP / (TP + FP)Ratio of correct true positive predictionOnly using precision cannot be optimal in case false negative is important, such as cancer diagnosing.
RecallTP / (TP + FN)Ratio of actually positive dataOnly using recall cannot be optimal in case false positive is important, such as thorough cancer examination is very expensive.
When all samples are predicted as positive, recall will be 100%.
f-score2 * (Precision * Recall / (Precision + Recall))A harmonic mean of precision and recallWhen all samples are true negative, an error will occur because of zero division

Example of Machine Learning Workflow

General workflow includes Generate example data, Train a model and Deploy the model.

Learning Materials

SageMaker Introduction

SageMaker is a managed service for machine learning. AWS offers many services related to the machine learning workflow, and the SageMaker ecosystem has been rapidly advancing. SageMaker provides 17 built-in algorithms as of Dec. 2021. Models trained outside SageMaker can be used with AWS-provided containers, called BYOM - Bring Your Own Model. Major frameworks such as TensorFlow or PyTorch are supported.

Workflow and Ecosystem

The following diagram describes AWS services related to machine learning workflow. For more information on each service, please refer to the bottom of this post.

Inference Endpoint

There are several endpoint types.

SageMaker Studio Setup

You can choose SageMaker environment from the following choices. This post uses SageMaker Studio.

Onboard to SageMaker Domain

To use SageMaker Studio, you must complete the onboarding processes.

Choose Quick Setup. If you are to create a domain for a production environment, Custom Setup is recommended.

Choose any VPC.

Launch SageMaker Studio from a menu of an automatically created user.

SageMaker Studio will be launched.

AWS Design

Quick Setup

Custom Setup

Custom Setup allows network traffic to be confined within AWS internal network by VPC Endpoints.

Playing with Built-in Algorithms

This post uses SageMaker’s built-in k-NN (k-Nearest Neighbor) algorithm with a well-known Iris dataset.

First, click the Notebook button at the center of SageMaker Studio IDE.

Preparing Dataset

Create a cell with the following content and specify your own values to <YOUR_S3_BUCKET>, <YOUR_SAGEMAKER_ROLE>.

%env S3_DATASET_BUCKET=<YOUR_S3_BUCKET>
%env S3_DATASET_TRAIN=knn/input/iris_train.csv
%env S3_DATASET_TEST=knn/input/iris_test.csv
%env S3_TRAIN_OUTPUT=knn/output
%env SAGEMAKER_ROLE=<YOUR_SAGEMAKER_ROLE>

Create a cell with the following content. “Python3 Data Science” instance has initially these packages installed.

import os
import random
import string

import boto3
import matplotlib.pyplot as plt
import pandas as pd
import sagemaker
from IPython.display import display
from sagemaker import image_uris
from sagemaker.deserializers import JSONDeserializer
from sagemaker.estimator import Estimator, Predictor
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer
from sklearn.model_selection import train_test_split

Define the following constants and variables used over the notebook.

# Define constants
CSV_PATH = './tmp/iris.csv'
S3_DATASET_BUCKET = os.getenv('S3_DATASET_BUCKET')
S3_DATASET_TRAIN = os.getenv('S3_DATASET_TRAIN')
S3_DATASET_TEST = os.getenv('S3_DATASET_TEST')
S3_TRAIN_OUTPUT = os.getenv('S3_TRAIN_OUTPUT')
SAGEMAKER_ROLE = os.getenv('SAGEMAKER_ROLE')
ESTIMATOR_INSTANCE_COUNT = 1
ESTIMATOR_INSTANCE_TYPE = 'ml.m5.large'
PREDICTOR_INSTANCE_TYPE = 'ml.t2.medium'
PREDICTOR_ENDPOINT_NAME = f'sagemaker-knn-{PREDICTOR_INSTANCE_TYPE}'.replace('.', '-')

# Define variables
bucket = boto3.resource('s3').Bucket(S3_DATASET_BUCKET)
train_df = None
test_df = None
train_object_path = None
test_object_path = None
knn = None
predictor = None

Download the iris dataset from the AWS official amazon-sagemaker-examples repository. For more information about the iris dataset, please visit the official page.

# Download a sample csv
!mkdir -p tmp
!curl -o "$(pwd)/tmp/iris.csv" -L https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/master/hyperparameter_tuning/r_bring_your_own/iris.csv

After downloading, write the following code to load the csv. SageMaker treats the first column in the CSV as a target label or class, so Species must be moved to the first column. The target labels must be converted to int.

def load_csv(path: str) -> pd.DataFrame:
    df = pd.read_csv(path)
    # Move the last label column to the first
    # See https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html#cdf-csv-format
    df = df[['Species', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']]
    # Convert target string to int
    df['Species'] = df['Species'].map({'setosa': 0, 'versicolor': 1, 'virginica': 2})
    return df

You can create a scatter plot diagram with the following code.

def plot(df: pd.DataFrame) -> None:
    pd.plotting.scatter_matrix(df, figsize=(15, 15), c=df['Species'])
    plt.show()

X-axis means Species, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width from left to right. Y-axis means Petal.Width, Petal.Length, Sepal.Width, Sepal.Length, Species from bottom to upper. Reading the plot, it is likely that the species can be predicted from the features because data points are clearly classified.

Write the following code to upload the transformed CSV to the S3 bucket.

def upload_csv_to_s3(df: pd.DataFrame, object_path: str) -> str:
    filename = ''.join([random.choice(string.digits + string.ascii_lowercase) for i in range(10)])
    path = os.path.abspath(os.path.join('./tmp', filename))
    df.to_csv(path, header=False, index=False)
    # Change content-type because the default is binary/octet-stream
    bucket.upload_file(path, object_path, ExtraArgs={'ContentType': 'text/csv'})
    return f's3://{bucket.name}/{object_path}'

Finally, you can complete the above steps with the following code.

if __name__ == '__main__':
    # Prepare data
    df = load_csv(CSV_PATH)
    display(df)
    plot(df)
    train_df, test_df = train_test_split(df, shuffle=True, random_state=0)  # type: (pd.DataFrame, pd.DataFrame)

    train_object_path = upload_csv_to_s3(train_df, S3_DATASET_TRAIN)
    test_object_path = upload_csv_to_s3(test_df, S3_DATASET_TEST)

Training

Write the following code to configure an estimator.

The image_uri is an ECR container URI of k-NN training provided by AWS.

def get_estimator(**hyperparams) -> Estimator:
    estimator = Estimator(
        image_uri=image_uris.retrieve('knn', boto3.Session().region_name),  # AWS provided container in ECR,
        role=SAGEMAKER_ROLE,
        instance_count=ESTIMATOR_INSTANCE_COUNT,
        instance_type=ESTIMATOR_INSTANCE_TYPE,
        input_mode='Pipe',
        output_path=f's3://{S3_DATASET_BUCKET}/{S3_TRAIN_OUTPUT}',
        sagemaker_session=sagemaker.Session(),
    )
    hyperparams.update({'predictor_type': 'classifier'})
    estimator.set_hyperparameters(**hyperparams)
    return estimator

You can start training with the following code.

def train(estimator: Estimator, train_object_path: str, test_object_path: str) -> None:
    # Specify content-type because the default is application/x-recordio-protobuf
    train_input = TrainingInput(train_object_path, content_type='text/csv', input_mode='Pipe')
    test_input = TrainingInput(test_object_path, content_type='text/csv', input_mode='Pipe')
    estimator.fit({'train': train_input, 'test': test_input})


if __name__ == '__main__':
    knn = get_estimator(k=1, sample_size=1000)
    train(knn, train_object_path, test_object_path)

The channel name for built-in algorithms is fixed to train. If the training job is created with the test channel, your ML model will be also tested.

You can use Pipe mode by specifying Pipe for input_mode argument of TrainingInput, which will stream data directly into the SageMaker instance from the S3 bucket.

Visit the official documentation for information on k-NN Hyperparameters.

After running, you can see the training log like the following.

2022-01-08 13:38:34 Starting - Starting the training job...
2022-01-08 13:38:57 Starting - Launching requested ML instancesProfilerReport-1641649113: InProgress
......
[01/08/2022 13:43:00 INFO 140667182901056] #test_score (algo-1) : ('accuracy', 0.9736842105263158)
[01/08/2022 13:43:00 INFO 140667182901056] #test_score (algo-1) : ('macro_f_1.000', 0.97170347)

Inference

Write the following code to deploy the inference endpoint. The serializer and deserializer mean an input and output format which you want.

def deploy(estimator: Estimator) -> Predictor:
    return estimator.deploy(
        initial_instance_count=1,
        instance_type=PREDICTOR_INSTANCE_TYPE,
        serializer=CSVSerializer(),
        deserializer=JSONDeserializer(),
        endpoint_name=PREDICTOR_ENDPOINT_NAME
    )

You can display an inference result as a table.

def validate(predictor: Predictor, test_df: pd.DataFrame) -> pd.DataFrame:
    rows = []

    for i, data in test_df.iterrows():
        predict = predictor.predict(
            pd.DataFrame([data.drop('Species')]).to_csv(header=False, index=False),
            initial_args={'ContentType': 'text/csv'}
        )
        predicted_label = predict['predictions'][0]['predicted_label']

        row = data.tolist()
        row.append(predicted_label)
        row.append(data['Species'] == predicted_label)
        rows.extend([row])

    return pd.DataFrame(rows, columns=('Species', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Prediction', 'Result'))

The CSV data with Prediction, Result columns will be displayed as a table.

Run inference with the following code.

if __name__ == '__main__':
    predictor = deploy(knn)
    predictions = validate(predictor, test_df)
    display(predictions)

Deleting Resources

Delete the resources with the following code. The real-time inference endpoint should incur additional costs unless it is deleted.

def delete_model(predictor: Predictor) -> None:
    try:
        predictor.delete_model()
        print(f'Deleted a model')
    except BaseException as e:
        print(e)


def delete_endpoint(predictor: Predictor) -> None:
    try:
        predictor.delete_endpoint(delete_endpoint_config=True)
        print(f'Deleted {predictor.endpoint_name}')
    except BaseException as e:
        print(e)


if __name__ == '__main__':
    delete_model(predictor)
    delete_endpoint(predictor)

Conclusion

SageMaker provides developers with many features to build machine learning systems, enabling us to focus more on important parts of applications.

I hope you will find this post useful.

References for SageMaker Ecosystem

IDE

CI/CD

Jupyter Notebook

Preprocessing

AutoML

Endpoint

Others

Takahiro Iwasa

Takahiro Iwasa

Software Developer at KAKEHASHI Inc.
Involved in the requirements definition, design, and development of cloud-native applications using AWS. Now, building a new prescription data collection platform at KAKEHASHI Inc. Japan AWS Top Engineers 2020-2023.