Movie Recommender System

2 July 2025

-

7 July 2025

1. Introduction : What is a Recommender System ?

A recommender system is a system that suggests items to users based on their preferences and past behavior. This project focuses on building a content-based recommender system.

2. Types of Recommender Systems

Content-Based Filtering : Recommends items similar to those a user has liked in the past, based on item attributes.
Collaborative Filtering : Recommends items based on the preferences of similar users.
Hybrid Systems : Combine aspects of both content-based and collaborative filtering.

3. Project Flow

Our project follows a structured approach:

Data Acquisition: Obtain and prepare the necessary datasets.
Data Preprocessing: Clean and transform the data for model training.
Model Building: Develop the recommendation engine.
Website Development: Integrate the model into a user-friendly website.

4. Data Acquisition and Overview

We utilized the TMDB 5000 Movie Dataset from Kaggle, which comprises two datasets:

movies dataset: Contains information about movies.
credits dataset: Contains cast and crew details for movies.

The movies dataset has 4803 rows and 20 columns, while the credits dataset has 4803 rows and 4 columns.

5. Data Preprocessing

Our data preprocessing involved several steps to prepare the datasets for building a content-based recommender system:

5.1 Merging Datasets

Initially, we merged the movies and credits datasets based on the common title column, resulting in a single comprehensive DataFrame.

5.2 Feature Selection

For a content-based recommender, we identified and retained only the most relevant columns, dropping those that did not contribute to content-based recommendations. The selected columns are: genres, id, keywords, title, overview, cast, and crew.

Columns dropped include: budget, homepage, original_language, original_title (to avoid issues with foreign languages, title is kept), popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, vote_average, and vote_count.

5.3 Handling Missing Values

We found three movies with missing overview information. Since the overview is crucial for content-based recommendations, these three rows were dropped. No duplicate rows were found in the dataset.

5.4 Data Transformation and Formatting

Several columns, specifically genres, keywords, cast, and crew, contained data in a stringified list of dictionaries format (e.g., '[{"id": 28, "name": "Action"}, ...]'). We applied custom functions to extract the relevant names (e.g., genre names, top 3 cast members, and director's name from crew where job is 'Director') and convert them into clean lists of strings.

5.5 Text Cleaning

To ensure accurate recommendations, we performed the following text cleaning steps:

Converted the overview string column into a list of words to facilitate concatenation with other lists.
Removed spaces between words in the genres, keywords, cast, and crew columns (e.g., "Sam Worthington" became "SamWorthington"). This is crucial to prevent the recommender from treating "Sam" and "Worthington" as separate entities, which could lead to inaccurate recommendations when dealing with common names.

5.6 Creating the 'Tags' Feature

A new feature called tags was created by concatenating the processed overview, genres, keywords, cast, and crew lists for each movie. This tags column represents the comprehensive textual content of each movie. Finally, the list of words in the tags column was converted back into a single string.

After these preprocessing steps, our new DataFrame consists of movie_id, title, and tags columns.

6. Text Vectorization

6.1 What is Text Vectorization ?

Text vectorization is the process of converting text into numerical form (vectors) that machine learning models can understand and analyze. Since machines do not directly understand words or sentences, this translation is essential for any NLP task, including sentiment analysis, spam detection, or text classification. In our case, it allows us to quantify the similarity between movie texts.

6.2 Applying Vectorization

To find the similarity between movies based on their textual content (tags), we need to convert these text tags into numerical vectors. We will use the Bag of Words (BoW) technique for this purpose.

Stemming: Before vectorization, stemming will be applied to the words in our tags feature. Stemming reduces words to their root form (e.g., "running," "runs," "ran" become "run"), which helps in reducing the vocabulary size and improving the accuracy of similarity calculations.
Vocabulary Selection: We will identify the 5000 most common words across all movie tags (after stemming). This vocabulary will form the dimensions of our vectors. The number of common words can be adjusted based on performance.
Vector Creation: For each movie, a vector will be created where each dimension corresponds to a word in our 5000-word vocabulary. The value in each dimension will represent the frequency of that word in the movie's tags. Stopwords (common words like "the," "a," "is") will not be considered during vectorization.

This process will result in 4806 vectors (one for each movie), each with 5000 numerical dimensions.

6.3 Measuring Similarity

In a high-dimensional space, Euclidean distance can be misleading. Therefore, to measure the similarity between movie vectors, we will use Cosine Similarity. Cosine similarity measures the cosine of the angle between two vectors.

A smaller angle (closer to 0) indicates higher similarity between the movies.
A larger angle (closer to 90 degrees) indicates lower similarity.

We will calculate the cosine similarity between every movie and every other movie, resulting in a 4806x4806 similarity matrix. This matrix will be used by our recommendation function.

6.4 Recommendation Function

A function will be developed to recommend five movies based on a given movie. When a user selects a movie, this function will:

Retrieve the vector for the selected movie.
Calculate its cosine similarity with all other movie vectors using the pre-computed similarity matrix.
Identify the five movies with the shortest cosine distance (highest cosine similarity) to the selected movie.
Return these five recommended movies.

7. Website Development

The entire recommender system will be integrated into a user-friendly website using Streamlit. Additionally, movie posters will be fetched from the TMDB API to enhance the visual appeal of the recommendations on the website.

8. Learning Outcomes

By completing this project, I gained hands-on experience and a deeper understanding of :

Data Acquisition & Integration :Learned how to work with real-world datasets from platforms like Kaggle and combine multiple sources (movies and credits data) into a unified structure.
Data Cleaning & Feature Engineering :Practiced handling missing values and transforming complex columns into usable formats to create meaningful features like tags.
Text Processing & Vectorization :Gained insight into the Bag of Words (BoW) model, stemming, and stopword removal to convert text into numerical vectors.
Similarity Measurement :Applied cosine similarity to evaluate how closely related two movies are based on their vectorized features.
Recommendation Systems :Built a custom content-based recommendation engine that suggests similar movies based on selected movie attributes.
User Interface Design (Streamlit) :Designed a clean, interactive web interface that delivers movie recommendations along with poster visuals for a better user experience.
Project Structuring & Documentation :Strengthened my ability to organize machine learning projects effectively, write readable code, and document the workflow clearly.

9. Project Links

GitHub Repository - https://github.com/Aryanupadhyay23/Movie-Recommender-System-Project

LIVE LINK - https://huggingface.co/spaces/Aryan2301/Movie-Recommender-System

Movie Recommender System

2 July 2025

-

7 July 2025

© 2025 Aryan Upadhyay |