How to Implement Full-Text Search for a Movie Dataset Using AWS CloudSearch

How to Implement Full-Text Search for a Movie Dataset Using AWS CloudSearch

Takahiro Iwasa
Takahiro Iwasa
5 min read
CloudSearch

This note describes how to implement full-text search for a movies dataset using AWS CloudSearch.

Setting Up CloudSearch Domain

Create a CloudSearch domain with the following command:

Terminal window
aws cloudsearch create-domain \
--domain-name searching-movies-data

According to the official documentation, creating a CloudSearch domain typically takes approximately 10 minutes to complete.

You can verify the domain creation status by running:

Terminal window
aws cloudsearch describe-domains \
--domain-name searching-movies-data

When the command output shows Processing: false, this indicates that the domain and its endpoints are fully created and ready to use.

{
"DomainStatusList": [
{
"DomainId": "123456789012/searching-movies-data",
"DomainName": "searching-movies-data",
"ARN": "arn:aws:cloudsearch:ap-northeast-1:123456789012:domain/searching-movies-data",
"Created": true,
"Deleted": false,
"DocService": {
"Endpoint": "doc-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com"
},
"SearchService": {
"Endpoint": "search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com"
},
"RequiresIndexDocuments": false,
"Processing": false,
"SearchInstanceType": "search.small",
"SearchPartitionCount": 1,
"SearchInstanceCount": 1,
"Limits": {
"MaximumReplicationCount": 5,
"MaximumPartitionCount": 10
}
}
]
}

To enhance security, update the domain’s access policy to allow access only from your IP address:

Terminal window
aws cloudsearch update-service-access-policies \
--domain-name searching-movies-data \
--access-policies '
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": "*",
"Action": ["cloudsearch:*"],
"Condition": {"IpAddress": {"aws:SourceIp": "xxx.xxx.xxx.xxx/32"}}
}
]
}'

Configuring Index Fields

Define the index fields based on the dataset’s structure. This example uses The Movies Dataset by Kaggle (CC0: Public Domain). Below is an example command for defining an index field:

Terminal window
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name adult --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name belongs_to_collection --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name budget --type double
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name genres --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name homepage --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name id --type int
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name imdb_id --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name original_language --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name original_title --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name overview --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name popularity --type double
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name poster_path --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name production_companies --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name production_countries --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name release_date --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name revenue --type int
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name runtime --type double
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name spoken_languages --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name status --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name tagline --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name title --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name video --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name vote_average --type double
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name vote_count --type int

Once done, tells the search domain to start indexing its documents:

Terminal window
aws cloudsearch index-documents \
--domain-name searching-movies-data

Indexing Dataset

Download the dataset from Kaggle and prepare a sample file with the first 1,000 rows:

Terminal window
head -1000 movies_metadata.csv > sample.csv

CloudSearch can index data directly from CSV files through the console interface, though the AWS CLI’s aws cloudsearchdomain upload-documents command only accepts JSON or XML formats.

Navigate to Actions > Upload documents.

Choose your CSV file and click Next.

Review the detected fields and click Upload documents.

Once the process completes, you can confirm that 998 records (excluding headers) are successfully indexed.

Running Queries

Searching for Text

To search for movies with the keyword house in the title and overview fields:

Terminal window
curl --location \
-g \
--request GET \
'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q=house&q.options={fields:["title","overview"]}&return=title,overview' | jq .

The response will include matching results, such as:

{
"status": {
"rid": "8fDJv8swsgEK1DyD",
"time-ms": 1
},
"hits": {
"found": 26,
"start": 0,
"hit": [
{
"id": "local_file_466",
"fields": {
"overview": "Hip Hop duo Kid & Play return...",
"title": "House Party 3"
}
},
...
]
}
}

For more information, refer to the official documentation.

Searching for Numbers

For numeric searches, such as movies with a vote_average of 5.0:

Terminal window
curl --location --request GET 'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q.parser=structured&q=vote_average:5.0&return=title,overview' | jq .

The response will include matching results, such as:

{
"status": {
"rid": "w+Xgv8swwQEK1DyD",
"time-ms": 0
},
"hits": {
"found": 35,
"start": 0,
"hit": [
{
"id": "local_file_144",
"fields": {
"overview": "Far from home...",
"title": "The Amazing Panda Adventure"
}
},
...
]
}
}

For more information, refer to the official documentation.

Searching for Ranges

To find movies with a vote_average greater than 7.0:

Terminal window
curl --location \
-g \
--request GET \
'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q.parser=structured&q=vote_average:[7.0,}&return=title,overview' | jq .

The response will include matching results, such as:

{
"status": {
"rid": "vJ/vv8swyAEK1DyD",
"time-ms": 2
},
"hits": {
"found": 254,
"start": 0,
"hit": [
{
"id": "local_file_1",
"fields": {
"overview": "Led by Woody, Andy's toys...",
"title": "Toy Story"
}
},
...
]
}
}

For more information, refer to the official documentation.

Cleaning Up

Clean up all the AWS resources provisioned during this example with the following command:

Terminal window
aws cloudsearch delete-domain \
--domain-name searching-movies-data
Takahiro Iwasa

Takahiro Iwasa

Software Developer
Involved in the requirements definition, design, and development of cloud-native applications using AWS. Japan AWS Top Engineers 2020-2023.