Thursday, April 14, 2016
Linux BASH Shell Script IMDB Movie Page Download and Rating Extraction from Titles
April 14, 2016
bash
,
html regex
,
imdb wget
,
linux
,
lookahead
,
lookbehind
,
ls read
,
perl grep
,
read write two files at once
,
replace space with dash
,
sed
,
shell script
,
store string in array
,
tags regular expression
,
wget save as
How the Code Works:
The code below is really messy, but I am too lazy too fix it. It does provide examples for a lot different commands.first run the script using,
bash imdbMovieSearchRatingExtract.sh
Make sure to cd into the directory when running from terminal. Also it is advised to store it in a folder of its own. After running that command enter a search term. (Ex: Batman)
User first enters a movie name to search. Using that as search string perform wget on a specific imdb query link and save that page.
wget -O "$searchText-search.html" "http://www.imdb.com/find?q=$searchText"
If the search term was Batman the saved file name will be Batman-search.html .
Next using sed it extracts a certain portion of code from the saved html file and saves the extracted partial content in a text file called partialContentFile.txt .
sed -e '/Titles<\/h3>/,/findMoreMatches/!d' "$searchText-search.html" > "partialContentFile.txt"
It essentially contains the relevant movie titles tags block.
The base url is https://imdb.com and adding specific strings it will allow pointing to specific movie pages.
grep -E -w -o "\/title\/[a-zA-Z0-9]+\/" "partialContentFile.txt" > $writeLinksFileName
What it does is extract all movie title links excluding the base url. For example: /title/tt0096895/ . But there is problem which is anchor tag and img tag both contains the same link. Using regular expression we two of the same links for each movie titles. It can be fixed using only even or odd lines.
Perform Perl regex with positive lookbehind and lookshead to get the movie names from the partialContentFile.txt file.
grep -P -o "(?<=>)([a-zA-Z0-9&: _-]+)(?=<\/a>[\(\) a-zA-Z0-9 _-]*\([0-9]+\))" "partialContentFile.txt" > "movieNames.txt"
using the same technique also get the movie years to a separate text file,
grep -P -o "(?<=<\/a> )(\([0-9]+\))(?= )" "partialContentFile.txt" > "movieYears.txt"
First it creates a new file to store the movie names and years. Then it read two files at once and combined their content ( movie names and years ) into a single file.
> "movieNameYear.txt" while read -r -u3 movieName; read -r -u4 movieYear; do echo "$movieName" "$movieYear" >> "movieNameYear.txt" done 3<movieNames.txt 4<movieYears.txt
Next it read the movie name and year file line by line and replaces all the spaces with underscore character and store them in an array called movieNameYear_array .
j=0 while read line do repline=$line # Replace file name spaces with underscore fixedline=${repline// /_} movieNameYear_array[j]=$fixedline #echo ${movieNameYear_array[j]} j=$(( j + 1 )) done < "movieNameYear.txt"
Now it gets the performs wget on each of the movie title to download the page. It does so when the link is odd since there are duplicates of each of the links. The movie title pages are saved with the name and year stored in the array. The movie title pages are downloaded in a separate folder.
moviefoldername=movies mkdir $moviefoldername i=0 k=0 while read line do temp=$(( $i % 2 )) # Temporary fix when file name or file year was not extracted correctly if [ $j -eq $k ]; then break fi if [ $temp -eq 0 ]; then # Each of the resultant files are downloaded here, Now read and perform rating extraction from it wget -O "$moviefoldername/${movieNameYear_array[k]}" "http://www.imdb.com$line" k=$(( k + 1 )) fi i=$(( i + 1 )) done < $writeLinksFileName
It performs ls on the movie titles directory to get each of the files and using grep Perl regex positive lookahead and lookbehind to get the movie ratings. The ratings along with movie title is shown in console.
for fileName in `ls $moviefoldername/` do #echo "$fileName" # Sample rating tag block #<span itemprop="ratingValue">6.4</span></strong> echo "Rating of: $fileName" grep -P -o "(?<=<span itemprop=\"ratingValue\">)([0-9][.]?[0-9]?)(?=<\/span><\/strong>)" "$moviefoldername/$fileName" echo "===================" done
Code IMDB Movie Search and Rating Extraction:
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment