Getting Started with Amazon SageMaker: Using Built-in Algorithms

Using SageMaker Studio, a fully integrated development environment for machine learning, this example explains how to use the SageMaker built-in K-Nearest Neighbors (k-NN) algorithm with the popular Iris dataset.
Onboarding to SageMaker
To use SageMaker Studio, you need to complete the onboarding processes.
Choose setup type from Quick Setup
and Custom Setup
:
Select a VPC.
Use the menu of an automatically created user to launch SageMaker Studio.
Once launched, the SageMaker Studio will appear.
Click the Notebook
button in the center of the SageMaker Studio.
Preprocessing
Preparing Dataset
Start by setting up your environment variables.
Replace the following with the actual values:
<YOUR_S3_BUCKET>
<YOUR_SAGEMAKER_ROLE>
%env S3_DATASET_BUCKET=<YOUR_S3_BUCKET>%env S3_DATASET_TRAIN=knn/input/iris_train.csv%env S3_DATASET_TEST=knn/input/iris_test.csv%env S3_TRAIN_OUTPUT=knn/output%env SAGEMAKER_ROLE=<YOUR_SAGEMAKER_ROLE>
Next, create a cell with the following Python imports. The Python3 Data Science instance comes pre-installed with these libraries.
import osimport randomimport string
import boto3import matplotlib.pyplot as pltimport pandas as pdimport sagemakerfrom IPython.display import displayfrom sagemaker import image_urisfrom sagemaker.deserializers import JSONDeserializerfrom sagemaker.estimator import Estimator, Predictorfrom sagemaker.inputs import TrainingInputfrom sagemaker.serializers import CSVSerializerfrom sklearn.model_selection import train_test_split
Define constants and variables.
# Define constantsCSV_PATH = './tmp/iris.csv'S3_DATASET_BUCKET = os.getenv('S3_DATASET_BUCKET')S3_DATASET_TRAIN = os.getenv('S3_DATASET_TRAIN')S3_DATASET_TEST = os.getenv('S3_DATASET_TEST')S3_TRAIN_OUTPUT = os.getenv('S3_TRAIN_OUTPUT')SAGEMAKER_ROLE = os.getenv('SAGEMAKER_ROLE')ESTIMATOR_INSTANCE_COUNT = 1ESTIMATOR_INSTANCE_TYPE = 'ml.m5.large'PREDICTOR_INSTANCE_TYPE = 'ml.t2.medium'PREDICTOR_ENDPOINT_NAME = f'sagemaker-knn-{PREDICTOR_INSTANCE_TYPE}'.replace('.', '-')
# Define variablesbucket = boto3.resource('s3').Bucket(S3_DATASET_BUCKET)train_df = Nonetest_df = Nonetrain_object_path = Nonetest_object_path = Noneknn = Nonepredictor = None
Download the Iris dataset from AWS’s SageMaker Examples repository. This dataset is widely used for demonstrating classification models due to its simplicity and well-defined structure. For additional details about the Iris dataset, including its attributes and applications, please refer to the official page.
!mkdir -p tmp!curl -o "$(pwd)/tmp/iris.csv" -L https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/master/hyperparameter_tuning/r_bring_your_own/iris.csv
After downloading, use the following code to load and preprocess the CSV. This function processes the Iris dataset to meet SageMaker’s requirements, preparing it for training with built-in algorithms.
Refer to the SageMaker documentation for more details.
def load_csv(path: str) -> pd.DataFrame: # Load the CSV into a Pandas DataFrame df = pd.read_csv(path) # Move the label column ('Species') to the first position df = df[['Species', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']] # Convert target labels ('Species') to integers df['Species'] = df['Species'].map({'setosa': 0, 'versicolor': 1, 'virginica': 2}) return df
- SageMaker requires the first column in the CSV to be the target label or class, so the
Species
column must be moved to the first position. - The target labels (species names) must also be converted to integers for compatibility.
Visualizing Dataset
Create a scatter plot of the dataset to understand its features and distributions.
def plot(df: pd.DataFrame) -> None: pd.plotting.scatter_matrix(df, figsize=(15, 15), c=df['Species']) plt.show()
The scatter plot generated provides the following axis mappings:
- X-axis
- Represents
Species
,Sepal.Length
,Sepal.Width
,Petal.Length
, andPetal.Width
from left to right.
- Represents
- Y-axis
- Represents
Petal.Width
,Petal.Length
,Sepal.Width
,Sepal.Length
, andSpecies
from bottom to top.
- Represents
By observing the plot, it is evident that species can likely be predicted from the features, as the data points are well-classified into distinct groups.
Uploading Dataset to S3
Upload the preprocessed dataset to S3:
def upload_csv_to_s3(df: pd.DataFrame, object_path: str) -> str: filename = ''.join([random.choice(string.digits + string.ascii_lowercase) for i in range(10)]) path = os.path.abspath(os.path.join('./tmp', filename)) df.to_csv(path, header=False, index=False) # Change content-type because the default is binary/octet-stream bucket.upload_file(path, object_path, ExtraArgs={'ContentType': 'text/csv'}) return f's3://{bucket.name}/{object_path}'
Executing Preprocessing
To complete the preprocessing above, run the following:
if __name__ == '__main__': df = load_csv(CSV_PATH) display(df) plot(df) train_df, test_df = train_test_split(df, shuffle=True, random_state=0) train_object_path = upload_csv_to_s3(train_df, S3_DATASET_TRAIN) test_object_path = upload_csv_to_s3(test_df, S3_DATASET_TEST)
Training
Configure the k-NN estimator and start the training process:
def get_estimator(**hyperparams) -> Estimator: estimator = Estimator( image_uri=image_uris.retrieve('knn', boto3.Session().region_name), role=SAGEMAKER_ROLE, instance_count=ESTIMATOR_INSTANCE_COUNT, instance_type=ESTIMATOR_INSTANCE_TYPE, input_mode='Pipe', output_path=f's3://{S3_DATASET_BUCKET}/{S3_TRAIN_OUTPUT}', sagemaker_session=sagemaker.Session(), ) hyperparams.update({'predictor_type': 'classifier'}) estimator.set_hyperparameters(**hyperparams) return estimator
def train(estimator: Estimator, train_object_path: str, test_object_path: str) -> None: train_input = TrainingInput(train_object_path, content_type='text/csv', input_mode='Pipe') test_input = TrainingInput(test_object_path, content_type='text/csv', input_mode='Pipe') estimator.fit({'train': train_input, 'test': test_input})
if __name__ == '__main__': knn = get_estimator(k=1, sample_size=1000) train(knn, train_object_path, test_object_path)
After starting the training job, you will observe logs similar to the following:
2022-01-08 13:38:34 Starting - Starting the training job...2022-01-08 13:38:57 Starting - Launching requested ML instancesProfilerReport-1641649113: InProgress......[01/08/2022 13:43:00 INFO 140667182901056] #test_score (algo-1) : ('accuracy', 0.9736842105263158)[01/08/2022 13:43:00 INFO 140667182901056] #test_score (algo-1) : ('macro_f_1.000', 0.97170347)
- The log provides metrics like accuracy and macro F1 score, offering insights into the model’s performance on the test dataset.
- Leveraging both the
train
andtest
channels during training ensures the built-in algorithm evaluates the model’s generalization ability automatically.
ECR URI
The image_uri
(line 3) specifies the ECR container URI of the k-NN training algorithm provided by AWS.
For detailed information about the container URIs for built-in algorithms, refer to the official documentation.
Channel Names
The channel name (line 18) for built-in algorithms in SageMaker is fixed to train
. If you include a test
channel during the training job creation, your ML model will automatically be evaluated on the test data after training.
Using Pipe Mode
To enhance data streaming efficiency, you can enable Pipe mode by setting the input_mode
(line 7) parameter to "Pipe"
in the TrainingInput
definition. Pipe mode streams data directly from S3 to the SageMaker instance, reducing the latency and memory requirements associated with downloading the entire dataset.
k-NN Hyperparameters
The k-NN algorithm includes several configurable hyperparameters. For complete details on their usage and effects, consult the official documentation.
Inference
Deploy the trained model to an endpoint and validate predictions.
The serializer
and deserializer
in SageMaker are used to specify the formats for input and output data when interacting with a deployed inference endpoint.
- Serializer
CSVSerializer
: Converts data into CSV format.JSONSerializer
: Converts data into JSON format.NumpySerializer
: Converts NumPy arrays into binary format.
- Deserializer
JSONDeserializer
: Converts JSON responses into Python dictionaries or lists.BytesDeserializer
: Returns raw bytes.
def deploy(estimator: Estimator) -> Predictor: return estimator.deploy( initial_instance_count=1, instance_type=PREDICTOR_INSTANCE_TYPE, serializer=CSVSerializer(), deserializer=JSONDeserializer(), endpoint_name=PREDICTOR_ENDPOINT_NAME, )
def validate(predictor: Predictor, test_df: pd.DataFrame) -> pd.DataFrame: rows = [] for _, data in test_df.iterrows(): predict = predictor.predict( pd.DataFrame([data.drop('Species')]).to_csv(header=False, index=False), initial_args={'ContentType': 'text/csv'}, ) predicted_label = predict['predictions'][0]['predicted_label'] row = data.tolist() row.append(predicted_label) row.append(data['Species'] == predicted_label) rows.append(row) return pd.DataFrame(rows, columns=('Species', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Prediction', 'Result'))
if __name__ == '__main__': predictor = deploy(knn) predictions = validate(predictor, test_df) display(predictions)
The inference results will include the Prediction
and Result
columns.
Prediction
: The predicted label for each sample provided to the model.Result
: A boolean value (True
orFalse
) indicating whether the prediction matches the actual label.
Cleaning Up
Clean up all the AWS resources provisioned during this example with the following command:
def delete_model(predictor: Predictor) -> None: predictor.delete_model()
def delete_endpoint(predictor: Predictor) -> None: predictor.delete_endpoint(delete_endpoint_config=True)
if __name__ == '__main__': delete_model(predictor) delete_endpoint(predictor)