Thursday, April 14, 2016

Linux BASH Shell Script IMDB Movie Page Download and Rating Extraction from Titles


How the Code Works:

The code below is really messy, but I am too lazy too fix it. It does provide examples for a lot different commands.
first run the script using,
bash imdbMovieSearchRatingExtract.sh

Make sure to cd into the directory when running from terminal. Also it is advised to store it in a folder of its own. After running that command enter a search term. (Ex: Batman)
User first enters a movie name to search. Using that as search string perform wget on a specific imdb query link and save that page.

wget -O "$searchText-search.html" "http://www.imdb.com/find?q=$searchText"

If the search term was Batman the saved file name will be Batman-search.html .

Next using sed it extracts a certain portion of code from the saved html file and saves the extracted partial content in a text file called partialContentFile.txt .

sed -e '/Titles<\/h3>/,/findMoreMatches/!d' "$searchText-search.html" > "partialContentFile.txt"

It essentially contains the relevant movie titles tags block.
The base url is https://imdb.com and adding specific strings it will allow pointing to specific movie pages.

grep -E -w -o "\/title\/[a-zA-Z0-9]+\/" "partialContentFile.txt" > $writeLinksFileName

What it does is extract all movie title links excluding the base url. For example: /title/tt0096895/ . But there is problem which is anchor tag and img tag both contains the same link. Using regular expression we two of the same links for each movie titles. It can be fixed using only even or odd lines.
Perform Perl regex with positive lookbehind and lookshead to get the movie names from the partialContentFile.txt file.

grep -P -o "(?<=>)([a-zA-Z0-9&: _-]+)(?=<\/a>[\(\) a-zA-Z0-9 _-]*\([0-9]+\))" "partialContentFile.txt" > "movieNames.txt" 

using the same technique also get the movie years to a separate text file,

grep -P -o "(?<=<\/a> )(\([0-9]+\))(?= )" "partialContentFile.txt" > "movieYears.txt" 

First it creates a new file to store the movie names and years. Then it read two files at once and combined their content ( movie names and years ) into a single file.

> "movieNameYear.txt"

while read -r -u3 movieName; read -r -u4 movieYear;
do 
 echo "$movieName" "$movieYear" >> "movieNameYear.txt"
done 3<movieNames.txt 4<movieYears.txt

Next it read the movie name and year file line by line and replaces all the spaces with underscore character and store them in an array called movieNameYear_array .

j=0

while read line
do
 repline=$line
 
 # Replace file name spaces with underscore
 fixedline=${repline// /_} 

 movieNameYear_array[j]=$fixedline
 #echo ${movieNameYear_array[j]}
 j=$(( j + 1 ))
done < "movieNameYear.txt"

Now it gets the performs wget on each of the movie title to download the page. It does so when the link is odd since there are duplicates of each of the links. The movie title pages are saved with the name and year stored in the array. The movie title pages are downloaded in a separate folder.

moviefoldername=movies
mkdir $moviefoldername

i=0
k=0

while read line
do
 temp=$(( $i % 2 )) 
 
 # Temporary fix when file name or file year was not extracted correctly 
 if [ $j -eq $k ]; then
  break
 fi
 
 if [ $temp -eq 0 ]; then
  
  # Each of the resultant files are downloaded here, Now read and perform rating extraction from it
  wget -O "$moviefoldername/${movieNameYear_array[k]}" "http://www.imdb.com$line"
  k=$(( k + 1 ))

 fi

 i=$(( i + 1 ))

done < $writeLinksFileName

It performs ls on the movie titles directory to get each of the files and using grep Perl regex positive lookahead and lookbehind to get the movie ratings. The ratings along with movie title is shown in console.

for fileName in `ls $moviefoldername/`
do
 #echo "$fileName"
 
 # Sample rating tag block
 #<span itemprop="ratingValue">6.4</span></strong>

 echo "Rating of: $fileName" 
 grep -P -o "(?<=<span itemprop=\"ratingValue\">)([0-9][.]?[0-9]?)(?=<\/span><\/strong>)" "$moviefoldername/$fileName" 
 echo "===================" 

done


Code IMDB Movie Search and Rating Extraction:


No comments: