How to Implement Full-Text Search for a Movie Dataset Using AWS CloudSearch

This note describes how to implement full-text search for a movies dataset using AWS CloudSearch.
Setting Up CloudSearch Domain
Create a CloudSearch domain with the following command:
aws cloudsearch create-domain \ --domain-name searching-movies-data
According to the official documentation, creating a CloudSearch domain typically takes approximately 10 minutes to complete.
You can verify the domain creation status by running:
aws cloudsearch describe-domains \ --domain-name searching-movies-data
When the command output shows Processing: false
, this indicates that the domain and its endpoints are fully created and ready to use.
{ "DomainStatusList": [ { "DomainId": "123456789012/searching-movies-data", "DomainName": "searching-movies-data", "ARN": "arn:aws:cloudsearch:ap-northeast-1:123456789012:domain/searching-movies-data", "Created": true, "Deleted": false, "DocService": { "Endpoint": "doc-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com" }, "SearchService": { "Endpoint": "search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com" }, "RequiresIndexDocuments": false, "Processing": false, "SearchInstanceType": "search.small", "SearchPartitionCount": 1, "SearchInstanceCount": 1, "Limits": { "MaximumReplicationCount": 5, "MaximumPartitionCount": 10 } } ]}
To enhance security, update the domain’s access policy to allow access only from your IP address:
aws cloudsearch update-service-access-policies \ --domain-name searching-movies-data \ --access-policies ' { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": "*", "Action": ["cloudsearch:*"], "Condition": {"IpAddress": {"aws:SourceIp": "xxx.xxx.xxx.xxx/32"}} } ] }'
Configuring Index Fields
Define the index fields based on the dataset’s structure. This example uses The Movies Dataset by Kaggle (CC0: Public Domain). Below is an example command for defining an index field:
aws cloudsearch define-index-field \ --domain-name searching-movies-data --name adult --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name belongs_to_collection --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name budget --type doubleaws cloudsearch define-index-field \ --domain-name searching-movies-data --name genres --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name homepage --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name id --type intaws cloudsearch define-index-field \ --domain-name searching-movies-data --name imdb_id --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name original_language --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name original_title --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name overview --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name popularity --type doubleaws cloudsearch define-index-field \ --domain-name searching-movies-data --name poster_path --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name production_companies --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name production_countries --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name release_date --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name revenue --type intaws cloudsearch define-index-field \ --domain-name searching-movies-data --name runtime --type doubleaws cloudsearch define-index-field \ --domain-name searching-movies-data --name spoken_languages --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name status --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name tagline --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name title --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name video --type textaws cloudsearch define-index-field \ --domain-name searching-movies-data --name vote_average --type doubleaws cloudsearch define-index-field \ --domain-name searching-movies-data --name vote_count --type int
Once done, tells the search domain to start indexing its documents:
aws cloudsearch index-documents \ --domain-name searching-movies-data
Indexing Dataset
Download the dataset from Kaggle and prepare a sample file with the first 1,000 rows:
head -1000 movies_metadata.csv > sample.csv
CloudSearch can index data directly from CSV files through the console interface, though the AWS CLI’s aws cloudsearchdomain upload-documents
command only accepts JSON or XML formats.
Navigate to Actions
> Upload documents
.
Choose your CSV file and click Next
.
Review the detected fields and click Upload documents
.
Once the process completes, you can confirm that 998 records (excluding headers) are successfully indexed.
Running Queries
Searching for Text
To search for movies with the keyword house
in the title
and overview
fields:
curl --location \ -g \ --request GET \ 'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q=house&q.options={fields:["title","overview"]}&return=title,overview' | jq .
The response will include matching results, such as:
{ "status": { "rid": "8fDJv8swsgEK1DyD", "time-ms": 1 }, "hits": { "found": 26, "start": 0, "hit": [ { "id": "local_file_466", "fields": { "overview": "Hip Hop duo Kid & Play return...", "title": "House Party 3" } }, ... ] }}
For more information, refer to the official documentation.
Searching for Numbers
For numeric searches, such as movies with a vote_average
of 5.0
:
curl --location --request GET 'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q.parser=structured&q=vote_average:5.0&return=title,overview' | jq .
The response will include matching results, such as:
{ "status": { "rid": "w+Xgv8swwQEK1DyD", "time-ms": 0 }, "hits": { "found": 35, "start": 0, "hit": [ { "id": "local_file_144", "fields": { "overview": "Far from home...", "title": "The Amazing Panda Adventure" } }, ... ] }}
For more information, refer to the official documentation.
Searching for Ranges
To find movies with a vote_average
greater than 7.0
:
curl --location \ -g \ --request GET \ 'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q.parser=structured&q=vote_average:[7.0,}&return=title,overview' | jq .
The response will include matching results, such as:
{ "status": { "rid": "vJ/vv8swyAEK1DyD", "time-ms": 2 }, "hits": { "found": 254, "start": 0, "hit": [ { "id": "local_file_1", "fields": { "overview": "Led by Woody, Andy's toys...", "title": "Toy Story" } }, ... ] }}
For more information, refer to the official documentation.
Cleaning Up
Clean up all the AWS resources provisioned during this example with the following command:
aws cloudsearch delete-domain \ --domain-name searching-movies-data