Getting Started with Amazon SageMaker: Using Built-in Algorithms

Getting Started with Amazon SageMaker: Using Built-in Algorithms

Takahiro Iwasa
Takahiro Iwasa
6 min read
Machine Learning SageMaker

Using SageMaker Studio, a fully integrated development environment for machine learning, this example explains how to use the SageMaker built-in K-Nearest Neighbors (k-NN) algorithm with the popular Iris dataset.

Onboarding to SageMaker

To use SageMaker Studio, you need to complete the onboarding processes.

Choose setup type from Quick Setup and Custom Setup:

Select a VPC.

Use the menu of an automatically created user to launch SageMaker Studio.

Once launched, the SageMaker Studio will appear.

Click the Notebook button in the center of the SageMaker Studio.

Preprocessing

Preparing Dataset

Start by setting up your environment variables.

Replace the following with the actual values:

  • <YOUR_S3_BUCKET>
  • <YOUR_SAGEMAKER_ROLE>
Terminal window
%env S3_DATASET_BUCKET=<YOUR_S3_BUCKET>
%env S3_DATASET_TRAIN=knn/input/iris_train.csv
%env S3_DATASET_TEST=knn/input/iris_test.csv
%env S3_TRAIN_OUTPUT=knn/output
%env SAGEMAKER_ROLE=<YOUR_SAGEMAKER_ROLE>

Next, create a cell with the following Python imports. The Python3 Data Science instance comes pre-installed with these libraries.

import os
import random
import string
import boto3
import matplotlib.pyplot as plt
import pandas as pd
import sagemaker
from IPython.display import display
from sagemaker import image_uris
from sagemaker.deserializers import JSONDeserializer
from sagemaker.estimator import Estimator, Predictor
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer
from sklearn.model_selection import train_test_split

Define constants and variables.

# Define constants
CSV_PATH = './tmp/iris.csv'
S3_DATASET_BUCKET = os.getenv('S3_DATASET_BUCKET')
S3_DATASET_TRAIN = os.getenv('S3_DATASET_TRAIN')
S3_DATASET_TEST = os.getenv('S3_DATASET_TEST')
S3_TRAIN_OUTPUT = os.getenv('S3_TRAIN_OUTPUT')
SAGEMAKER_ROLE = os.getenv('SAGEMAKER_ROLE')
ESTIMATOR_INSTANCE_COUNT = 1
ESTIMATOR_INSTANCE_TYPE = 'ml.m5.large'
PREDICTOR_INSTANCE_TYPE = 'ml.t2.medium'
PREDICTOR_ENDPOINT_NAME = f'sagemaker-knn-{PREDICTOR_INSTANCE_TYPE}'.replace('.', '-')
# Define variables
bucket = boto3.resource('s3').Bucket(S3_DATASET_BUCKET)
train_df = None
test_df = None
train_object_path = None
test_object_path = None
knn = None
predictor = None

Download the Iris dataset from AWS’s SageMaker Examples repository. This dataset is widely used for demonstrating classification models due to its simplicity and well-defined structure. For additional details about the Iris dataset, including its attributes and applications, please refer to the official page.

Terminal window
!mkdir -p tmp
!curl -o "$(pwd)/tmp/iris.csv" -L https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/master/hyperparameter_tuning/r_bring_your_own/iris.csv

After downloading, use the following code to load and preprocess the CSV. This function processes the Iris dataset to meet SageMaker’s requirements, preparing it for training with built-in algorithms.

Refer to the SageMaker documentation for more details.

def load_csv(path: str) -> pd.DataFrame:
# Load the CSV into a Pandas DataFrame
df = pd.read_csv(path)
# Move the label column ('Species') to the first position
df = df[['Species', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']]
# Convert target labels ('Species') to integers
df['Species'] = df['Species'].map({'setosa': 0, 'versicolor': 1, 'virginica': 2})
return df
🔥 Caution

  • SageMaker requires the first column in the CSV to be the target label or class, so the Species column must be moved to the first position.
  • The target labels (species names) must also be converted to integers for compatibility.

Visualizing Dataset

Create a scatter plot of the dataset to understand its features and distributions.

def plot(df: pd.DataFrame) -> None:
pd.plotting.scatter_matrix(df, figsize=(15, 15), c=df['Species'])
plt.show()

The scatter plot generated provides the following axis mappings:

  • X-axis
    • Represents Species, Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width from left to right.
  • Y-axis
    • Represents Petal.Width, Petal.Length, Sepal.Width, Sepal.Length, and Species from bottom to top.

By observing the plot, it is evident that species can likely be predicted from the features, as the data points are well-classified into distinct groups.

Uploading Dataset to S3

Upload the preprocessed dataset to S3:

def upload_csv_to_s3(df: pd.DataFrame, object_path: str) -> str:
filename = ''.join([random.choice(string.digits + string.ascii_lowercase) for i in range(10)])
path = os.path.abspath(os.path.join('./tmp', filename))
df.to_csv(path, header=False, index=False)
# Change content-type because the default is binary/octet-stream
bucket.upload_file(path, object_path, ExtraArgs={'ContentType': 'text/csv'})
return f's3://{bucket.name}/{object_path}'

Executing Preprocessing

To complete the preprocessing above, run the following:

if __name__ == '__main__':
df = load_csv(CSV_PATH)
display(df)
plot(df)
train_df, test_df = train_test_split(df, shuffle=True, random_state=0)
train_object_path = upload_csv_to_s3(train_df, S3_DATASET_TRAIN)
test_object_path = upload_csv_to_s3(test_df, S3_DATASET_TEST)

Training

Configure the k-NN estimator and start the training process:

def get_estimator(**hyperparams) -> Estimator:
estimator = Estimator(
image_uri=image_uris.retrieve('knn', boto3.Session().region_name),
role=SAGEMAKER_ROLE,
instance_count=ESTIMATOR_INSTANCE_COUNT,
instance_type=ESTIMATOR_INSTANCE_TYPE,
input_mode='Pipe',
output_path=f's3://{S3_DATASET_BUCKET}/{S3_TRAIN_OUTPUT}',
sagemaker_session=sagemaker.Session(),
)
hyperparams.update({'predictor_type': 'classifier'})
estimator.set_hyperparameters(**hyperparams)
return estimator
def train(estimator: Estimator, train_object_path: str, test_object_path: str) -> None:
train_input = TrainingInput(train_object_path, content_type='text/csv', input_mode='Pipe')
test_input = TrainingInput(test_object_path, content_type='text/csv', input_mode='Pipe')
estimator.fit({'train': train_input, 'test': test_input})
if __name__ == '__main__':
knn = get_estimator(k=1, sample_size=1000)
train(knn, train_object_path, test_object_path)

After starting the training job, you will observe logs similar to the following:

2022-01-08 13:38:34 Starting - Starting the training job...
2022-01-08 13:38:57 Starting - Launching requested ML instancesProfilerReport-1641649113: InProgress
......
[01/08/2022 13:43:00 INFO 140667182901056] #test_score (algo-1) : ('accuracy', 0.9736842105263158)
[01/08/2022 13:43:00 INFO 140667182901056] #test_score (algo-1) : ('macro_f_1.000', 0.97170347)
  • The log provides metrics like accuracy and macro F1 score, offering insights into the model’s performance on the test dataset.
  • Leveraging both the train and test channels during training ensures the built-in algorithm evaluates the model’s generalization ability automatically.

ECR URI

The image_uri (line 3) specifies the ECR container URI of the k-NN training algorithm provided by AWS.

For detailed information about the container URIs for built-in algorithms, refer to the official documentation.

Channel Names

The channel name (line 18) for built-in algorithms in SageMaker is fixed to train. If you include a test channel during the training job creation, your ML model will automatically be evaluated on the test data after training.

Using Pipe Mode

To enhance data streaming efficiency, you can enable Pipe mode by setting the input_mode (line 7) parameter to "Pipe" in the TrainingInput definition. Pipe mode streams data directly from S3 to the SageMaker instance, reducing the latency and memory requirements associated with downloading the entire dataset.

k-NN Hyperparameters

The k-NN algorithm includes several configurable hyperparameters. For complete details on their usage and effects, consult the official documentation.

Inference

Deploy the trained model to an endpoint and validate predictions.

The serializer and deserializer in SageMaker are used to specify the formats for input and output data when interacting with a deployed inference endpoint.

  • Serializer
    • CSVSerializer: Converts data into CSV format.
    • JSONSerializer: Converts data into JSON format.
    • NumpySerializer: Converts NumPy arrays into binary format.
  • Deserializer
    • JSONDeserializer: Converts JSON responses into Python dictionaries or lists.
    • BytesDeserializer: Returns raw bytes.
def deploy(estimator: Estimator) -> Predictor:
return estimator.deploy(
initial_instance_count=1,
instance_type=PREDICTOR_INSTANCE_TYPE,
serializer=CSVSerializer(),
deserializer=JSONDeserializer(),
endpoint_name=PREDICTOR_ENDPOINT_NAME,
)
def validate(predictor: Predictor, test_df: pd.DataFrame) -> pd.DataFrame:
rows = []
for _, data in test_df.iterrows():
predict = predictor.predict(
pd.DataFrame([data.drop('Species')]).to_csv(header=False, index=False),
initial_args={'ContentType': 'text/csv'},
)
predicted_label = predict['predictions'][0]['predicted_label']
row = data.tolist()
row.append(predicted_label)
row.append(data['Species'] == predicted_label)
rows.append(row)
return pd.DataFrame(rows, columns=('Species', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Prediction', 'Result'))
if __name__ == '__main__':
predictor = deploy(knn)
predictions = validate(predictor, test_df)
display(predictions)

The inference results will include the Prediction and Result columns.

  • Prediction: The predicted label for each sample provided to the model.
  • Result: A boolean value (True or False) indicating whether the prediction matches the actual label.

Cleaning Up

Clean up all the AWS resources provisioned during this example with the following command:

def delete_model(predictor: Predictor) -> None:
predictor.delete_model()
def delete_endpoint(predictor: Predictor) -> None:
predictor.delete_endpoint(delete_endpoint_config=True)
if __name__ == '__main__':
delete_model(predictor)
delete_endpoint(predictor)
Takahiro Iwasa

Takahiro Iwasa

Software Developer
Involved in the requirements definition, design, and development of cloud-native applications using AWS. Japan AWS Top Engineers 2020-2023.