Understanding Amazon SageMaker: Built-in Algorithms for Machine Learning

Takahiro Iwasa

Jan 24, 2022

12 min read

Machine Learning SageMaker

It has been more than 3 years since this post was published.

Introduction

I held a study meeting titled “Amazon SageMaker Introduction - Try Machine Learning with Built-in Algorithms”.

In this blog post, I aim to share the presentation content, offering insights into Amazon SageMaker and its powerful capabilities. You can access the example code used in this post from my GitHub repository.

Prerequisites

Target Readers

This post is intended for readers who:

Are interested in Amazon SageMaker.
Have a basic understanding of machine learning and AWS.

Goals

Provide an overview of machine learning and the SageMaker ecosystem.
Demonstrate supervised machine learning using SageMaker’s built-in algorithms.

Machine Learning Overview

Supervised Learning

Supervised learning is commonly used for classification and regression tasks. It relies on labeled training data. However, the need for labeled data can increase preparation efforts.

Unsupervised Learning

Unsupervised learning focuses on tasks like clustering and dimensionality reduction. It uses unlabeled data but often lacks interpretability for inference results.

Reinforcement Learning

Reinforcement learning combines statistical methods and psychological approaches, emphasizing actions that yield higher rewards. Training requires reward mechanisms and appropriate datasets.

Deep Learning

Deep learning utilizes multi-layered neural networks to achieve remarkable accuracy, often surpassing human capabilities. While feature selection is largely automated, the interpretability of results may be limited. It also demands substantial computational resources.

Neural Networks

The complexity of a neural network can be measured by the sum of weights. Below is an example where the complexity is calculated as 24:

4 (Features) × 3 (Hidden Layer 1) = 12
3 (Hidden Layer 1) × 3 (Hidden Layer 2) = 9
3 (Hidden Layer 2) × 1 (Output) = 3
Total: 12 + 9 + 3 = 24

Neural Network

Value	Description
y	Output (Result)
x	Feature
w	Weight
h1, h2	Hidden Layers
h1[0], h1[1], … h2[2]	Hidden Units

Features

Features represent the attributes of the data. Rows are referred to as samples or data points, while columns are features. Feature engineering, including scaling, encoding, and preprocessing, enhances accuracy.

Model Evaluation

k-Fold Cross Validation

This method splits the data into multiple subsets for training and testing, ensuring better generalization and reduced risk of overfitting.

k-Fold Cross Validation

Confusion Matrix

The confusion matrix provides metrics such as accuracy, precision, recall, and f-score, helping evaluate classification models comprehensively.

		Prediction
		Positive	Negative
Result	Positive	True Positive (TP)	False Negative (FN)
Result	Negative	False Positive (FP)	True Negative (TN)

Name	Expression	Description	Problem
Accuracy	(TP + TN) / (TP + TN + FP + FN)	Ratio of correct prediction	If the model predicts all samples as negative, and there are 100 samples with 99 being negative, the accuracy is 99%.
Precision	TP / (TP + FP)	Ratio of correct true positive prediction	Only using precision cannot be optimal in case false negative is important, such as cancer diagnosing.
Recall	TP / (TP + FN)	Ratio of actually positive data	Only using recall cannot be optimal in case false positive is important, such as thorough cancer examination being very expensive. When all samples are predicted as positive, recall will be 100%.
f-score	2 * (Precision * Recall / (Precision + Recall))	A harmonic mean of precision and recall	When all samples are true negative, an error will occur because of zero division.

SageMaker Overview

Amazon SageMaker is a managed service for machine learning that streamlines the ML lifecycle. SageMaker supports 17 built-in algorithms as of Dec 2021, along with BYOM (Bring Your Own Model) capabilities.

Workflow and Ecosystem

The following diagram illustrates AWS services associated with the machine learning workflow. These services cover every stage, from data preparation to model deployment and monitoring.

SageMaker Ecosystem Part 1

SageMaker Ecosystem Part 2

Inference Endpoints

Amazon SageMaker offers several endpoint types to serve machine learning models based on specific requirements:

SageMaker Hosting Services: Provides persistent endpoints that remain active, similar to EC2 instances, ensuring minimal latency for real-time inference.
SageMaker Serverless Endpoints (Preview): A cost-efficient option with endpoints that incur charges only during use. They experience a cold start latency when idle.
Asynchronous Inference: Ideal for batch processing or infrequent requests. SageMaker auto-scales the instance count to zero when there are no active requests, significantly reducing costs.

Each endpoint type caters to different use cases, enabling flexibility in deploying models efficiently.

SageMaker Studio

Amazon SageMaker offers several environments to suit different user requirements. This post primarily focuses on SageMaker Studio, a fully integrated development environment for machine learning. Below are the available environment options:

SageMaker Studio / RStudio on SageMaker:
- A comprehensive machine learning IDE.
- Includes tools like SageMaker JumpStart, ideal for learning and quick experimentation.
SageMaker Notebook Instances:
- Standalone Jupyter Notebook instances.
- Best suited for temporary or ad-hoc use by a small number of data scientists.
Local Environment + AWS SDK + SageMaker SDK:
- Allows developers to interact with SageMaker using their local setup.
- Useful for small-scale development requiring flexibility.
SageMaker Studio Lab:
- Free access to some SageMaker Studio features without needing an AWS account.
- Ideal for beginners exploring machine learning concepts.

Each option provides unique benefits, allowing users to choose the most suitable environment based on their project needs.

Onboard to SageMaker Domain

To use SageMaker Studio, you need to complete the onboarding processes.

Step 1: Choose Setup Type

For a quick start, select Quick Setup.
For production environments requiring advanced configurations, choose Custom Setup.

Step 2: Select a VPC

During the onboarding process, choose any VPC as required for your setup.

Step 3: Launch SageMaker Studio

Use the menu of an automatically created user to launch SageMaker Studio.

Step 4: Access SageMaker Studio

Once launched, the SageMaker Studio interface will appear.

AWS Design

Quick Setup

The Quick Setup is an easy way to get started. Learn more in the official documentation: Quick Setup Documentation

Custom Setup

The Custom Setup option is designed for production environments, allowing network traffic to remain within AWS’s internal network using VPC Endpoints. Learn more: Custom Setup Documentation

Practical Example: Using SageMaker’s Built-in Algorithms

In this example, we use SageMaker’s k-NN algorithm with the popular Iris dataset.

To begin, click the Notebook button in the center of the SageMaker Studio.

Preparing Dataset

Start by setting up your environment variables. Replace <YOUR_S3_BUCKET> and <YOUR_SAGEMAKER_ROLE> with your specific values.

%env S3_DATASET_BUCKET=<YOUR_S3_BUCKET>
%env S3_DATASET_TRAIN=knn/input/iris_train.csv
%env S3_DATASET_TEST=knn/input/iris_test.csv
%env S3_TRAIN_OUTPUT=knn/output
%env SAGEMAKER_ROLE=<YOUR_SAGEMAKER_ROLE>

Next, create a cell with the following Python imports. The Python3 Data Science instance comes pre-installed with these libraries.

import os
import random
import string

import boto3
import matplotlib.pyplot as plt
import pandas as pd
import sagemaker
from IPython.display import display
from sagemaker import image_uris
from sagemaker.deserializers import JSONDeserializer
from sagemaker.estimator import Estimator, Predictor
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer
from sklearn.model_selection import train_test_split

Define constants and variables.

# Define constants
CSV_PATH = './tmp/iris.csv'
S3_DATASET_BUCKET = os.getenv('S3_DATASET_BUCKET')
S3_DATASET_TRAIN = os.getenv('S3_DATASET_TRAIN')
S3_DATASET_TEST = os.getenv('S3_DATASET_TEST')
S3_TRAIN_OUTPUT = os.getenv('S3_TRAIN_OUTPUT')
SAGEMAKER_ROLE = os.getenv('SAGEMAKER_ROLE')
ESTIMATOR_INSTANCE_COUNT = 1
ESTIMATOR_INSTANCE_TYPE = 'ml.m5.large'
PREDICTOR_INSTANCE_TYPE = 'ml.t2.medium'
PREDICTOR_ENDPOINT_NAME = f'sagemaker-knn-{PREDICTOR_INSTANCE_TYPE}'.replace('.', '-')

# Define variables
bucket = boto3.resource('s3').Bucket(S3_DATASET_BUCKET)
train_df = None
test_df = None
train_object_path = None
test_object_path = None
knn = None
predictor = None

Download the Iris dataset from AWS’s SageMaker Examples repository. This dataset is widely used for demonstrating classification models due to its simplicity and well-defined structure. For additional details about the Iris dataset, including its attributes and applications, please refer to the official page.

!mkdir -p tmp
!curl -o "$(pwd)/tmp/iris.csv" -L https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/master/hyperparameter_tuning/r_bring_your_own/iris.csv

After downloading, use the following code to load and preprocess the CSV.

SageMaker requires the first column in the CSV to be the target label or class, so the Species column must be moved to the first position.
The target labels (species names) must also be converted to integers for compatibility.

Refer to the SageMaker documentation for more details: CSV Format Requirements

def load_csv(path: str) -> pd.DataFrame:
    # Load the CSV into a Pandas DataFrame
    df = pd.read_csv(path)
    # Move the label column ('Species') to the first position
    df = df[['Species', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']]
    # Convert target labels ('Species') to integers
    df['Species'] = df['Species'].map({'setosa': 0, 'versicolor': 1, 'virginica': 2})
    return df

This function processes the Iris dataset to meet SageMaker’s requirements, preparing it for training with built-in algorithms.

Visualize Dataset

Create a scatter plot of the dataset to understand its features and distributions.

def plot(df: pd.DataFrame) -> None:
    pd.plotting.scatter_matrix(df, figsize=(15, 15), c=df['Species'])
    plt.show()

The scatter plot generated provides the following axis mappings:

X-axis: Represents Species, Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width from left to right.
Y-axis: Represents Petal.Width, Petal.Length, Sepal.Width, Sepal.Length, and Species from bottom to top.

By observing the plot, it is evident that species can likely be predicted from the features, as the data points are well-classified into distinct groups.

Upload Dataset to S3

Upload the preprocessed dataset to S3:

def upload_csv_to_s3(df: pd.DataFrame, object_path: str) -> str:
    filename = ''.join([random.choice(string.digits + string.ascii_lowercase) for i in range(10)])
    path = os.path.abspath(os.path.join('./tmp', filename))
    df.to_csv(path, header=False, index=False)
    # Change content-type because the default is binary/octet-stream
    bucket.upload_file(path, object_path, ExtraArgs={'ContentType': 'text/csv'})
    return f's3://{bucket.name}/{object_path}'

Execute Preprocessing Steps

if __name__ == '__main__':
    df = load_csv(CSV_PATH)
    display(df)
    plot(df)
    train_df, test_df = train_test_split(df, shuffle=True, random_state=0)
    train_object_path = upload_csv_to_s3(train_df, S3_DATASET_TRAIN)
    test_object_path = upload_csv_to_s3(test_df, S3_DATASET_TEST)

Training

Configure the k-NN estimator and start the training process:

def get_estimator(**hyperparams) -> Estimator:
    estimator = Estimator(
        image_uri=image_uris.retrieve('knn', boto3.Session().region_name),
        role=SAGEMAKER_ROLE,
        instance_count=ESTIMATOR_INSTANCE_COUNT,
        instance_type=ESTIMATOR_INSTANCE_TYPE,
        input_mode='Pipe',
        output_path=f's3://{S3_DATASET_BUCKET}/{S3_TRAIN_OUTPUT}',
        sagemaker_session=sagemaker.Session(),
    )
    hyperparams.update({'predictor_type': 'classifier'})
    estimator.set_hyperparameters(**hyperparams)
    return estimator


def train(estimator: Estimator, train_object_path: str, test_object_path: str) -> None:
    train_input = TrainingInput(train_object_path, content_type='text/csv', input_mode='Pipe')
    test_input = TrainingInput(test_object_path, content_type='text/csv', input_mode='Pipe')
    estimator.fit({'train': train_input, 'test': test_input})


if __name__ == '__main__':
    knn = get_estimator(k=1, sample_size=1000)
    train(knn, train_object_path, test_object_path)

ECR Container URI

The image_uri (line 3) specifies the ECR container URI of the k-NN training algorithm provided by AWS. You can retrieve the appropriate URI for your region using SageMaker’s utility functions.

For detailed information about the container URIs for built-in algorithms, refer to the official documentation: Amazon SageMaker Built-in Algorithm ECR URIs.

Channel Names

The channel name (line 18) for built-in algorithms in Amazon SageMaker is fixed to train. If you include a test channel during the training job creation, your ML model will automatically be evaluated on the test data after training.

Using Pipe Mode

To enhance data streaming efficiency, you can enable Pipe mode by setting the input_mode (line 7) parameter to "Pipe" in the TrainingInput definition. Pipe mode streams data directly from S3 to the SageMaker instance, reducing the latency and memory requirements associated with downloading the entire dataset.

k-NN Hyperparameters

The k-NN algorithm includes several configurable hyperparameters. For complete details on their usage and effects, consult the official documentation: k-NN Hyperparameters.

Training Log

After starting the training job, you will observe logs similar to the following:

2022-01-08 13:38:34 Starting - Starting the training job...
2022-01-08 13:38:57 Starting - Launching requested ML instancesProfilerReport-1641649113: InProgress
......
[01/08/2022 13:43:00 INFO 140667182901056] #test_score (algo-1) : ('accuracy', 0.9736842105263158)
[01/08/2022 13:43:00 INFO 140667182901056] #test_score (algo-1) : ('macro_f_1.000', 0.97170347)

The log provides metrics like accuracy and macro F1 score, offering insights into the model’s performance on the test dataset.
Leveraging both the train and test channels during training ensures the built-in algorithm evaluates the model’s generalization ability automatically.

Inference

Deploy the trained model to an endpoint and validate predictions. The serializer and deserializer in SageMaker are used to specify the formats for input and output data when interacting with a deployed inference endpoint.

def deploy(estimator: Estimator) -> Predictor:
    return estimator.deploy(
        initial_instance_count=1,
        instance_type=PREDICTOR_INSTANCE_TYPE,
        serializer=CSVSerializer(),
        deserializer=JSONDeserializer(),
        endpoint_name=PREDICTOR_ENDPOINT_NAME,
    )


def validate(predictor: Predictor, test_df: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for _, data in test_df.iterrows():
        predict = predictor.predict(
            pd.DataFrame([data.drop('Species')]).to_csv(header=False, index=False),
            initial_args={'ContentType': 'text/csv'},
        )
        predicted_label = predict['predictions'][0]['predicted_label']
        row = data.tolist()
        row.append(predicted_label)
        row.append(data['Species'] == predicted_label)
        rows.append(row)
    return pd.DataFrame(rows, columns=('Species', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Prediction', 'Result'))


if __name__ == '__main__':
    predictor = deploy(knn)
    predictions = validate(predictor, test_df)
    display(predictions)

The inference results will include the Prediction and Result columns, displayed in a tabular format. Here’s how the data is organized:

Prediction: The predicted label for each sample provided to the model.
Result: A boolean value (True or False) indicating whether the prediction matches the actual label.

Clean-Up

Release the resources to avoid unnecessary costs:

def delete_model(predictor: Predictor) -> None:
    predictor.delete_model()


def delete_endpoint(predictor: Predictor) -> None:
    predictor.delete_endpoint(delete_endpoint_config=True)


if __name__ == '__main__':
    delete_model(predictor)
    delete_endpoint(predictor)

Conclusion

By using SageMaker’s built-in algorithms like k-NN, you can simplify the ML workflow, from training to deployment. SageMaker empowers developers to focus on building effective models while handling the underlying infrastructure seamlessly.

Happy Coding! 🚀

Appendix: Setting Serializer and Deserializer in Predictor

When deploying a model, you can define the serializer and deserializer to control the input/output format:

from sagemaker.predictor import Predictor


predictor = Predictor(
    endpoint_name="my-endpoint",
    serializer=CSVSerializer(),  # Input in CSV format
    deserializer=JSONDeserializer()  # Output in JSON format
)

These settings ensure seamless communication with your SageMaker inference endpoint, making it easier to send requests and process responses.

Serializer

The serializer converts input data (e.g., Python objects) into the format expected by the SageMaker endpoint.

For example:

CSVSerializer: Converts data into CSV format.
JSONSerializer: Converts data into JSON format.
NumpySerializer: Converts NumPy arrays into binary format.

Deserializer

The deserializer interprets the output data from the SageMaker endpoint and converts it back into a usable Python object.

For example:

JSONDeserializer: Converts JSON responses into Python dictionaries or lists.
BytesDeserializer: Returns raw bytes.