Amazon SageMaker Introduction - Try Machine Learning with Built-in Algorithms
I held a study meeting to introduce Amazon SageMaker, “Amazon SageMaker Introduction - Try Machine Learning with Built-in Algorithms”.
I would like to share the content of the presentation, in the hopes that it will provide an opportunity for you to know SageMaker. You can pull an example code used in this post from my GitHub repository.
Prerequisites
Participants are expected to:
- Be interested in learning about Amazon SageMaker.
- Have basic knowledge of machine learning and AWS.
Goals in this post:
- Provide an overview of machine learning and SageMaker ecosystem.
- Demonstrate supervised machine learning using a SageMaker built-in algorithm.
Machine Learning Overview
Supervised Learning
Supervised learning is often used for classification and regression tasks. It needs both training and labeled data. Labelling data may incur additional burdens.
Unsupervised Learning
Unsupervised learning is often used for tasks such as clustering and dimensionality reduction. It needs training data. In most cases, this method would not provide a clear explanation for inference results.
Reinforcement Learning
Reinforcement learning is not purely a statistical analysis method, as it may also incorporate psychological approaches. It needs training data and rewards. Rewards, or weights, are assigned based on the algorithm’s prediction results, allowing the algorithm to prioritize actions that lead to greater rewards.
Deep Learning
Deep learning is a machine learning method that has gained significant attention in recent years. It uses multiple layers of neural networks to achieve high levels of accuracy, which can often surpass human prediction results.
Automatic feature selection is now likely feasible with deep learning. Despite producing accurate predictions, deep learning may not provide a clear explanation for inference results. Training deep learning models often requires significant computing resources.
Neural Network
One of the indicators to neural network complexity is sum of the weights.
In the example below, the complexity is 24
.
4 (Features) * 3 (Hidden layer 1) = 12
3 (Hidden layer 1) * 3 (Hidden layer 2) = 9
3 (Hidden layer 2) * 1 (Output) = 3
12 + 9 + 3 = 24
Value | Description |
---|---|
y | Result |
x | Features |
w | Weights |
h1, h2 | Hidden layers |
h1[0], h1[1], … h2[2] | Hidden units |
Features
Rows are called samples or data points. Columns are called features. Feature engineering is important to improve inference accuracy. It includes extraction, conversion or scaling of features. Preprocessing removes unnecessary information which raw data often contains. There are various methods used for feature engineering, such as one-hot encoding, rescaling, and principal component analysis (PCA).
Model Evaluation
Method (k-Fold Cross Validation)
The following diagram describes k-Fold Cross Validation. Green is training data and blue is test data.
Each set has a different combination of training and test data, and the accuracy of the set can be calculated. The average accuracy of all sets can be considered the model’s average. If the accuracy varies heavily between each set, the model may not be able to generalize well, which could result in the model overfitting to specific datasets.
Indicators (Confusion Matrix)
The following table is an example of binary classification.
Prediction | |||
---|---|---|---|
Positive | Negative | ||
Result | Positive | True Positive (TP) | False Negative (FN) |
Negative | False Positive (FP) | True Negative (TN) |
The following describes each indicator.
Name | Expression | Description | Problem |
---|---|---|---|
Accuracy | (TP + TN) / (TP + TN + FP + FN) | Ratio of correct prediction | If the model predicts all samples as negative, and there are 100 samples with 99 being negative, the accuracy is 99%. |
Precision | TP / (TP + FP) | Ratio of correct true positive prediction | Only using precision cannot be optimal in case false negative is important, such as cancer diagnosing. |
Recall | TP / (TP + FN) | Ratio of actually positive data | Only using recall cannot be optimal in case false positive is important, such as thorough cancer examination is very expensive. When all samples are predicted as positive, recall will be 100%. |
f-score | 2 * (Precision * Recall / (Precision + Recall)) | A harmonic mean of precision and recall | When all samples are true negative, an error will occur because of zero division |
Example of Machine Learning Workflow
General workflow includes Generate example data
, Train a model
and Deploy the model
.
Learning Materials
- Introduction to Machine Learning with Python
- [JAPANESE] 総務省 ICTスキル総合習得プログラム コース3(データ分析) 3-5 人工知能と機械学習
- Amazon Machine Learning – Machine Learning Concepts
SageMaker Introduction
SageMaker is a managed service for machine learning. AWS offers many services related to the machine learning workflow, and the SageMaker ecosystem has been rapidly advancing. SageMaker provides 17 built-in algorithms as of Dec. 2021. Models trained outside SageMaker can be used with AWS-provided containers, called BYOM - Bring Your Own Model. Major frameworks such as TensorFlow or PyTorch are supported.
Workflow and Ecosystem
The following diagram describes AWS services related to machine learning workflow. For more information on each service, please refer to the bottom of this post.
Inference Endpoint
There are several endpoint types.
- SageMaker Hosting Services provide endpoints running always like EC2.
- SageMaker Serverless Endpoints (Preview) provide endpoints available at lower costs if you endure cold start.
- Asynchronous Inference provides a feature by which SageMaker sets the instance count to zero through auto-scaling when there are no requests to process.
SageMaker Studio Setup
You can choose SageMaker environment from the following choices. This post uses SageMaker Studio.
- SageMaker Studio/RStudio on SageMaker
- Machine learning IDE
- For learning purposes, SageMaker JumpStart is also available.
- SageMaker Notebook Instances
- Jupyter Notebook instance
- This is useful when a few data scientists need to temporarily use SageMaker.
- Local environment + AWS SDK + SageMaker SDK
- This is useful when a few developers need to temporarily use SageMaker.
- SageMaker Studio Lab
- You can try out some of the features of SageMaker Studio for free without an AWS account.
Onboard to SageMaker Domain
To use SageMaker Studio, you must complete the onboarding processes.
Choose Quick Setup
. If you are to create a domain for a production environment, Custom Setup
is recommended.
Choose any VPC.
Launch SageMaker Studio from a menu of an automatically created user.
SageMaker Studio will be launched.
AWS Design
Quick Setup
Custom Setup
Custom Setup
allows network traffic to be confined within AWS internal network by VPC Endpoints.
Playing with Built-in Algorithms
This post uses SageMaker’s built-in k-NN (k-Nearest Neighbor) algorithm with a well-known Iris dataset.
First, click the Notebook
button at the center of SageMaker Studio IDE.
Preparing Dataset
Create a cell with the following content and specify your own values to <YOUR_S3_BUCKET>
, <YOUR_SAGEMAKER_ROLE>
.
%env S3_DATASET_BUCKET=<YOUR_S3_BUCKET>
%env S3_DATASET_TRAIN=knn/input/iris_train.csv
%env S3_DATASET_TEST=knn/input/iris_test.csv
%env S3_TRAIN_OUTPUT=knn/output
%env SAGEMAKER_ROLE=<YOUR_SAGEMAKER_ROLE>
Create a cell with the following content. “Python3 Data Science” instance has initially these packages installed.
import os
import random
import string
import boto3
import matplotlib.pyplot as plt
import pandas as pd
import sagemaker
from IPython.display import display
from sagemaker import image_uris
from sagemaker.deserializers import JSONDeserializer
from sagemaker.estimator import Estimator, Predictor
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer
from sklearn.model_selection import train_test_split
Define the following constants and variables used over the notebook.
# Define constants
CSV_PATH = './tmp/iris.csv'
S3_DATASET_BUCKET = os.getenv('S3_DATASET_BUCKET')
S3_DATASET_TRAIN = os.getenv('S3_DATASET_TRAIN')
S3_DATASET_TEST = os.getenv('S3_DATASET_TEST')
S3_TRAIN_OUTPUT = os.getenv('S3_TRAIN_OUTPUT')
SAGEMAKER_ROLE = os.getenv('SAGEMAKER_ROLE')
ESTIMATOR_INSTANCE_COUNT = 1
ESTIMATOR_INSTANCE_TYPE = 'ml.m5.large'
PREDICTOR_INSTANCE_TYPE = 'ml.t2.medium'
PREDICTOR_ENDPOINT_NAME = f'sagemaker-knn-{PREDICTOR_INSTANCE_TYPE}'.replace('.', '-')
# Define variables
bucket = boto3.resource('s3').Bucket(S3_DATASET_BUCKET)
train_df = None
test_df = None
train_object_path = None
test_object_path = None
knn = None
predictor = None
Download the iris dataset from the AWS official amazon-sagemaker-examples
repository.
For more information about the iris dataset, please visit the official page.
# Download a sample csv
!mkdir -p tmp
!curl -o "$(pwd)/tmp/iris.csv" -L https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/master/hyperparameter_tuning/r_bring_your_own/iris.csv
After downloading, write the following code to load the csv.
SageMaker treats the first column in the CSV as a target label or class, so Species
must be moved to the first column.
The target labels must be converted to int
.
def load_csv(path: str) -> pd.DataFrame:
df = pd.read_csv(path)
# Move the last label column to the first
# See https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html#cdf-csv-format
df = df[['Species', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']]
# Convert target string to int
df['Species'] = df['Species'].map({'setosa': 0, 'versicolor': 1, 'virginica': 2})
return df
You can create a scatter plot diagram with the following code.
def plot(df: pd.DataFrame) -> None:
pd.plotting.scatter_matrix(df, figsize=(15, 15), c=df['Species'])
plt.show()
X-axis means Species
, Sepal.Length
, Sepal.Width
, Petal.Length
, Petal.Width
from left to right.
Y-axis means Petal.Width
, Petal.Length
, Sepal.Width
, Sepal.Length
, Species
from bottom to upper.
Reading the plot, it is likely that the species can be predicted from the features because data points are clearly classified.
Write the following code to upload the transformed CSV to the S3 bucket.
def upload_csv_to_s3(df: pd.DataFrame, object_path: str) -> str:
filename = ''.join([random.choice(string.digits + string.ascii_lowercase) for i in range(10)])
path = os.path.abspath(os.path.join('./tmp', filename))
df.to_csv(path, header=False, index=False)
# Change content-type because the default is binary/octet-stream
bucket.upload_file(path, object_path, ExtraArgs={'ContentType': 'text/csv'})
return f's3://{bucket.name}/{object_path}'
Finally, you can complete the above steps with the following code.
if __name__ == '__main__':
# Prepare data
df = load_csv(CSV_PATH)
display(df)
plot(df)
train_df, test_df = train_test_split(df, shuffle=True, random_state=0) # type: (pd.DataFrame, pd.DataFrame)
train_object_path = upload_csv_to_s3(train_df, S3_DATASET_TRAIN)
test_object_path = upload_csv_to_s3(test_df, S3_DATASET_TEST)
Training
Write the following code to configure an estimator.
The image_uri
is an ECR container URI of k-NN training provided by AWS.
def get_estimator(**hyperparams) -> Estimator:
estimator = Estimator(
image_uri=image_uris.retrieve('knn', boto3.Session().region_name), # AWS provided container in ECR,
role=SAGEMAKER_ROLE,
instance_count=ESTIMATOR_INSTANCE_COUNT,
instance_type=ESTIMATOR_INSTANCE_TYPE,
input_mode='Pipe',
output_path=f's3://{S3_DATASET_BUCKET}/{S3_TRAIN_OUTPUT}',
sagemaker_session=sagemaker.Session(),
)
hyperparams.update({'predictor_type': 'classifier'})
estimator.set_hyperparameters(**hyperparams)
return estimator
You can start training with the following code.
def train(estimator: Estimator, train_object_path: str, test_object_path: str) -> None:
# Specify content-type because the default is application/x-recordio-protobuf
train_input = TrainingInput(train_object_path, content_type='text/csv', input_mode='Pipe')
test_input = TrainingInput(test_object_path, content_type='text/csv', input_mode='Pipe')
estimator.fit({'train': train_input, 'test': test_input})
if __name__ == '__main__':
knn = get_estimator(k=1, sample_size=1000)
train(knn, train_object_path, test_object_path)
The channel name for built-in algorithms is fixed to train
.
If the training job is created with the test
channel, your ML model will be also tested.
You can use Pipe mode by specifying Pipe
for input_mode
argument of TrainingInput
, which will stream data directly into the SageMaker instance from the S3 bucket.
Visit the official documentation for information on k-NN Hyperparameters.
After running, you can see the training log like the following.
2022-01-08 13:38:34 Starting - Starting the training job...
2022-01-08 13:38:57 Starting - Launching requested ML instancesProfilerReport-1641649113: InProgress
......
[01/08/2022 13:43:00 INFO 140667182901056] #test_score (algo-1) : ('accuracy', 0.9736842105263158)
[01/08/2022 13:43:00 INFO 140667182901056] #test_score (algo-1) : ('macro_f_1.000', 0.97170347)
Inference
Write the following code to deploy the inference endpoint.
The serializer
and deserializer
mean an input and output format which you want.
def deploy(estimator: Estimator) -> Predictor:
return estimator.deploy(
initial_instance_count=1,
instance_type=PREDICTOR_INSTANCE_TYPE,
serializer=CSVSerializer(),
deserializer=JSONDeserializer(),
endpoint_name=PREDICTOR_ENDPOINT_NAME
)
You can display an inference result as a table.
def validate(predictor: Predictor, test_df: pd.DataFrame) -> pd.DataFrame:
rows = []
for i, data in test_df.iterrows():
predict = predictor.predict(
pd.DataFrame([data.drop('Species')]).to_csv(header=False, index=False),
initial_args={'ContentType': 'text/csv'}
)
predicted_label = predict['predictions'][0]['predicted_label']
row = data.tolist()
row.append(predicted_label)
row.append(data['Species'] == predicted_label)
rows.extend([row])
return pd.DataFrame(rows, columns=('Species', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Prediction', 'Result'))
The CSV data with Prediction
, Result
columns will be displayed as a table.
Run inference with the following code.
if __name__ == '__main__':
predictor = deploy(knn)
predictions = validate(predictor, test_df)
display(predictions)
Deleting Resources
Delete the resources with the following code. The real-time inference endpoint should incur additional costs unless it is deleted.
def delete_model(predictor: Predictor) -> None:
try:
predictor.delete_model()
print(f'Deleted a model')
except BaseException as e:
print(e)
def delete_endpoint(predictor: Predictor) -> None:
try:
predictor.delete_endpoint(delete_endpoint_config=True)
print(f'Deleted {predictor.endpoint_name}')
except BaseException as e:
print(e)
if __name__ == '__main__':
delete_model(predictor)
delete_endpoint(predictor)
Conclusion
SageMaker provides developers with many features to build machine learning systems, enabling us to focus more on important parts of applications.
I hope you will find this post useful.
References for SageMaker Ecosystem
IDE
CI/CD
Jupyter Notebook
Preprocessing
- SageMaker Data Wrangler
- SageMaker Processing
- SageMaker Ground Truth
- SageMaker Ground Truth Plus
- SageMaker Feature Store