Real Estate Capstone Project

10 July 2025

-

6 August 2025

ABOUT THE PROJECT

Making Property Search Smarter, Faster, and More Interactive

Most real estate websites provide static listings, leaving users without the deeper insights needed to make confident decisions. This project was built to change that — transforming property exploration into an interactive, data-driven, and personalized experience.

The platform features a simple sidebar navigation and brings together three core modules designed to empower buyers, sellers, and investors:

Price Prediction Module

A property valuation tool where users can enter details such as:

Property type (flat/house)
Sector
Bedrooms, bathrooms, balconies
Property age, built-up area, furnishing type, and more

The system predicts an estimated price range instantly using a trained machine learning pipeline, helping users assess market value before making decisions.

Analytics Module

A complete market analysis dashboard with 10 interactive visualizations, giving users a clear view of trends and patterns, including:

Spatial Analysis – Map-based view of prices across sectors.
Price Distribution Across Sectors – Compare average prices by area.
Price vs. Square Foot Analysis – Explore pricing trends by property size.
BHK Distribution Pie Charts – See how many rooms properties have in each area.
Feature Word Cloud – Discover the most common property features.
Price Comparisons by BHK – Box plots for side-by-side comparisons.
Price Distribution by Property Type – Distplots and KDE charts for houses vs. flats.
Heatmaps – Average price patterns by sector and BHK.
Furnishing Type Breakdown – Pie charts for furnishing categories.
Average Price per Sqft by Sector – Horizontal bar charts ranking sectors.

With these visual insights, users can quickly understand market behavior and spot investment opportunities.

Society Recommender System

A recommendation engine that allows users to:

Select a society in their preferred location
Adjust filters such as search radius and importance of amenities, features, and proximity
Receive a list of similar societies with distance information

This feature helps users discover comparable properties they might have overlooked, making exploration smarter and more efficient.

Project Links

Github -> https://github.com/Aryanupadhyay23/Real-Estate-Capstone-Project

Live Link -> https://huggingface.co/spaces/Aryan2301/Real_Estate_Capstone_Project

Why It Matters

The real estate market is complex, and finding the right property can be overwhelming. By combining advanced analytics, predictive modeling, and personalized recommendations, this project helps users:

Make Smarter Decisions – Back choices with data, not guesswork.
Save Time – Quickly find properties that match needs.
Gain Deeper Insights – See market trends in a visual, easy-to-understand way.
Enjoy Personalization – Get recommendations tailored to preferences.

This is more than a property search tool — it’s a step toward redefining the digital real estate experience, turning raw data into actionable insights for confident decision-making.

Project Roadmap

Our project development follows a structured roadmap with the following key phases:

Project Planning & Roadmap Definition – Defining the project scope, objectives, and development timeline.
Data Gathering – Collecting relevant real estate datasets from multiple sources.
Data Preprocessing – Cleaning, transforming, and preparing raw data for analysis and model training.
Exploratory Data Analysis (EDA) – Analyzing datasets to uncover trends, relationships, and patterns.
Feature Engineering – Creating new features from existing data to improve model performance.
Price Prediction Module Development – Building and integrating the machine learning model to provide accurate property price predictions.
Analytics Module Development – Implementing spatial analysis, price distribution, price vs. square foot analysis, number of rooms pie charts, and top feature word clouds.
Recommender System Development – Creating a personalized society recommendation engine based on user-selected preferences and location.
Deployment – Launching the application with a user-friendly interface for public access.

Data Gathering

The dataset for this project was obtained through web scraping from the real estate portal 99acres, focusing on property listings in Gurgaon. The scraping process extracted structured information from individual listing pages and compiled it into three distinct datasets, each representing a different property category:

Flats Dataset – Contains details of approximately 3,018 residential flats, including essential property specifications, pricing information, location details, and available amenities.
Independent Houses Dataset – Comprises around 1,045 listings of independent houses, covering essential property characteristics, location, pricing, and other relevant details.
Apartment Societies Dataset – Includes about 248 apartment societies, containing society-level information such as average property prices, amenities, and ratings.

These datasets serve as the foundation for subsequent data preprocessing, exploratory data analysis (EDA), and predictive modeling.

Data Preprocessing

The preprocessing stage was executed in two levels: Level 1 – Dataset-specific cleaning and merging, and Level 2 – Feature engineering and refinement.

Level 1 – Dataset Cleaning & Merging

Initially, the Flats and Independent Houses datasets were cleaned separately before being merged into a single dataset.

Flats Dataset:
- Checked for duplicates (none found) and handled missing values appropriately.
- Removed irrelevant columns such as property links and unique IDs.
- Standardized textual fields (e.g., society names to lowercase, removal of symbols).
- Converted price values into a consistent unit (crores) and removed listings with “price on request”.
- Cleaned and standardized price per square foot values into numeric format.
- Standardized numeric property details such as bedrooms, bathrooms, and balconies.
- Normalized additional room details and handled missing values.
- Extracted floor numbers from mixed text formats, handling special cases like “ground” and “basement”.
- Filled missing categorical values (e.g., facing direction) with placeholders.
- Derived a new area feature using price and price per square foot.
- Added a property type label (“flat”).
Independent Houses Dataset:
- Removed duplicate records and dropped irrelevant columns.
- Standardized column naming and formatting to align with the flats dataset.
- Applied the same cleaning process for price, price per square foot, bedrooms, bathrooms, balconies, additional rooms, floor details, and facing direction.
- Derived the area feature and added a property type label (“house”).
Merging:
- Combined the cleaned flats and independent houses datasets.
- The resulting dataset contained 3,961 records and 20 features.
- Data was randomly shuffled to avoid ordering bias.

Level 2 – Feature Engineering & Refinement

Created a new feature, sector, extracted from property names and standardized to lowercase.
Manually filled missing or incomplete sector information using Google location references.
Removed localities (sectors) with fewer than three listings to ensure statistical relevance.
Dropped irrelevant columns such as property name, address, description, and rating to reduce noise and improve efficiency.

Feature Engineering

To enhance model performance and improve the quality of insights, multiple new features were engineered from the raw dataset.

1. Area Features

Properties listed three possible area types: super built-up area, built-up area, and carpet area.
Independent houses also included plot area.
Extracted values from areaWithType into three separate columns: super_built_up_area, built_up_area, and carpet_area.
Converted all values into square feet for consistency.
For houses, plot area was mapped to built_up_area.
If a property had missing types (e.g., only built-up area available), the other columns remained blank.

2. Additional Room Features

Identified five types of additional rooms: servant room, pooja room, store room, study room, and others.
Created binary indicator columns for each room type (1 = present, 0 = not present).
Properties without additional rooms had all columns set to 0.

3. Age & Possession Categorization

Converted agePossession values into categorical labels:
- New Property – 0–1 years old or possession within 6 months.
- Relatively New – 1–5 years old.
- Moderately Old – 5–10 years old.
- Old Property – 10+ years old.
- Under Construction – Future possession dates or explicitly marked under construction.
- Undefined – Missing or unclassified cases.

4. Furnishing Type

Original furnishDetails contained 18 unique furnishing items.
Instead of creating 18 binary columns, grouped into three categories:
Furnished, Semi-Furnished, and Unfurnished.
Used K-Means clustering on the furnishing items to assign these categories:
- Cluster 0 → Unfurnished
- Cluster 1 → Semi-Furnished
- Cluster 2 → Furnished

5. Luxury Score

features column listed various amenities (~130 unique items) but had 635 missing values.
Missing values were partially filled using data from the apartment dataset (TopFacilities), matching on property/society name.
For the remaining 481 missing entries, instead of unreliable clustering, created a numerical “Luxury Score”:
- Assigned weights to each amenity (e.g., swimming pool = high score, lift = low score).
- Summed weights for each property to generate a continuous luxury score.

6. Dropped Irrelevant Columns

Removed columns no longer needed after feature engineering:

nearbyLocations, furnishDetails, features, features_list, additionalRoom.

Outcome: The dataset now includes well-structured, machine-learning-ready features, improving both predictive accuracy and insight generation for analytics.

EDA (Exploratory Data Analysis)

Before diving into modeling, we performed a thorough Exploratory Data Analysis to understand the underlying patterns, anomalies, and characteristics of the dataset. The initial step involved data cleaning, where we identified and removed all duplicate rows to ensure the integrity of our analysis.

1. Univariate Analysis

We began our exploration with univariate analysis, focusing on individual features to understand their distribution.

property_type -

To understand the composition of property listings, we analyzed the property_type column. We used a bar chart to visualize the count of each property type.

Observations:

The bar chart reveals a significant imbalance in the types of properties listed:

Flats: This category dominates the dataset, accounting for approximately 75% of all listings.
Houses: Independent houses make up the remaining 25% of the listings.

Insight: This 3:1 ratio of flats to houses is a critical finding. It indicates that our dataset is heavily skewed towards flats. This imbalance must be considered during the modeling phase, as it could lead to a model that is more skillful at predicting outcomes for flats than for houses.

society -

The society column is a complex feature with high cardinality, containing 676 unique values. Our analysis revealed that these are split between 486 'independent' properties and listings spread across 675 different apartment societies.

The distribution of listings within these societies is highly skewed:

Concentrated Data: A small number of societies dominate the dataset. The top 75 societies (just 11% of the total) contain 50% of all apartment listings.
Long Tail: In contrast, the remaining 50% of listings are scattered across over 600 other societies. Notably, 308 of these societies (almost half) have only a single listing each.

To identify the most dominant players, we visualized the top 10 societies by listing count.

Observations: The chart shows that societies like "Tulip Violet" and "SS The Leaf" are the most significant contributors to the dataset.

Insight: The high cardinality and severe skew of the society feature make it unsuitable for direct use in machine learning models. Feature engineering, such as grouping societies into tiers based on their size or frequency, will be essential to leverage the information contained within this variable.

sector -

The sector column provides crucial geographical context for the property listings. Our analysis shows there are 104 unique sectors in the dataset.

The distribution of properties across these sectors is more balanced compared to the society feature:

A majority of the sectors (60 out of 104) have an average number of listings, ranging from 10 to 49 properties each.
A significant number of sectors (25) have a high volume of listings (50-100 properties).
Notably, every sector has at least two listings in the dataset, meaning there isn't a long tail of single-listing sectors, which makes this feature more robust.

To identify the key real estate hotspots, we plotted the top 10 sectors by property count.

Observations: The bar chart clearly indicates that "Sohna Road" is the most prominent location, with substantially more listings than any other area. Other active real estate hubs include Sector 85, Sector 102, and Sector 92.

Insight: The sector feature is a strong and reliable indicator of location. Its relatively balanced distribution makes it a powerful feature for our model, likely capturing significant price variations across different geographical areas. It can be used more directly for modeling than the highly-skewed society feature.

2. Multivariate Analysis

We then moved to multivariate analysis to explore the relationships between different features. First, we investigated how the property_type impacts the price.

Property Type vs. Price

To compare the prices of flats and houses, we created a bar chart showing the average price for each category and a box plot to visualize their price distributions.

The bar chart gives a clear picture of the average prices:

On average, houses are significantly more expensive than flats in this dataset. The average price for a house is approximately ₹4 crore.
The average price for a flat is much lower, at around ₹1.5 crore.

While the bar chart shows averages, the box plot reveals the spread and variability in prices for each category.

The box plot confirms that houses are more expensive. More importantly, it shows that houses have a much wider price range and greater variability. The prices for flats are clustered in a much tighter, lower range.
Both categories have several high-value outliers, but the outliers for houses reach significantly higher prices (upwards of ₹30 crore) compared to flats.

Insight: The analysis clearly establishes that property type is a powerful predictor of price. Houses are not only more expensive on average, but their prices are also more spread out. This feature will be essential for building an accurate price prediction model.

Property Type vs. Built-up Area

Next, we analyzed the relationship between the property_type and the built_up_area to see how the physical size of properties differs.

First, let's look at the average built-up area for flats and houses.

The bar chart shows that on average, houses have a slightly larger built-up area than flats, but the difference in their average sizes is not very large.

However, the box plot below gives a more detailed story about the variability of property sizes.

This chart reveals that while the median areas are somewhat close, houses have a significantly wider distribution of sizes.
The range of built_up_area for houses is much larger, with some properties reaching over 12,000 sq. ft., indicating high variability. Flats, in contrast, are generally clustered in a more consistent and smaller size range.

Insight: Previously, we saw that houses are, on average, 2.5 times more expensive than flats. However, this analysis shows they are only slightly larger in terms of average built-up area. This is a crucial finding. It suggests that the higher price of houses is not just due to more indoor space. Factors like the value of the land the house is built on, the number of floors, privacy, and plot size likely play a much larger role in determining the final price of a house compared to a flat.

Outlier treament

Price Column -

The price variable showed a right-skewed distribution, with most properties priced at the lower end and a few very high values. A boxplot confirmed the presence of outliers, with about 425 records lying beyond the IQR. These outliers ranged from 5.46 crores to 31.50 crores (mean: 9.23 crores).

Some represent genuine premium properties, while others may be due to data entry errors. At this stage, the outliers are retained, with final handling to be decided after analyzing other features to avoid prematurely removing valuable data.

price_per_sqft -

The initial distribution plot of price_per_sqft showed a heavily right-skewed pattern with numerous extreme outliers. The boxplot confirmed 354 points outside the IQR. On inspection, we found a unit inconsistency: for areas <1000, the values were in square yards instead of square feet. We corrected this by multiplying such values by 9, recalculating price_per_sqft accordingly.

After this fix, the distribution became more spread and realistic. We then removed 13 outliers where price_per_sqft exceeded 50,000, resulting in a cleaner boxplot with most values now under 50,000.

Area -

Initial Findings:
The area column was highly right-skewed with extreme outliers (up to 875,000 sqft). Mean (~2,971 sqft) was far from the median (~1,759 sqft), making it misleading.
Step 1 – Removing > 100,000 sqft:
Eliminated unrealistic values, reducing the extreme tail.
Step 2 – Fixing > 10,000 sqft:
Dropped incorrect entries and corrected a few based on related features.
Final Stats:
Mean: 1,946 sqft | Median: 1,746 sqft | Max: 11,000 sqft | Std Dev dropped from 23,208 to 1,197.
Impact:
Distribution is now more realistic, skewness reduced, and models will no longer be biased toward rare huge properties.

Cleaning extreme area outliers turned an unrealistic dataset into a reliable foundation for modeling.

Missing Value Imputation

Missing values were handled through a combination of statistical relationships, domain knowledge, and iterative filling methods to ensure data completeness without compromising accuracy.

1. Built-Up Area

Found strong linear relationships:
- built_up_area ↔ super_built_up_area (ratio ≈ 1.105)
- built_up_area ↔ carpet_area (ratio ≈ 0.9)
No property had all three area values missing.
Filled missing built_up_area values using:
1. If super_built_up_area & carpet_area available: average of (super_built_up_area / 1.105) and (carpet_area / 0.9)
2. If only super_built_up_area available: (super_built_up_area / 1.105)
3. If only carpet_area available: (carpet_area / 0.9)
Anomaly Correction:
- Identified properties with built_up_area < 2000 sqft but price > 2.5 crore.
- In these cases, built_up_area did not match the actual calculated area (price / price_per_sqft).
- Corrected values by replacing built_up_area with the accurate area value.
Dropped redundant columns: area, areaWithType, super_built_up_area, carpet_area, area_room_ratio.

2. Floor Number (floorNum)

Missing Values: 17 total, mostly in houses.
Median floor value for houses = 2.0.
Filled all missing values with 2.

3. Facing Direction (facing)

Missing Values: 1010 (~28% of dataset).
Due to high missing rate, this column was dropped.

4. Age & Possession (agePossession)

Undefined Values: 290 treated as missing.
Filled iteratively:
1. Matched with properties having same property_type and sector.
2. If still missing, filled using only sector.
3. Remaining filled using only property_type.
Result: 0 missing values in agePossession.

Outcome: All essential numerical and categorical features are now complete and consistent, with anomalies corrected and unnecessary columns removed.

Feature Selection & Feature Engineering

To ensure the predictive model used only relevant and user-provided information, a combination of domain-driven feature engineering and data-driven feature selection techniques was applied.

1. Domain-Driven Feature Decisions

Removed society: Not included in the model since users will select a sector, not a specific society.
Removed price_per_sqft: Not a user-input variable, hence excluded from prediction.
Luxury Score to Category:
- Converted numerical luxury_score into categorical luxury_category:
  - 0–50 → Low luxury
  - 50–150 → Medium luxury
  - 150–175 → High luxury
- Dropped luxury_score.
Floor Number to Category:
- Converted floorNum into floor_category:
  - 0–2 → Low rise
  - 3–10 → Medium rise
  - 11–51 → High rise
- Dropped floorNum.

2. Data Preparation for Feature Selection

Applied Ordinal Encoding to categorical features so feature selection techniques could operate on numerical values.
Split dataset into X (independent features) and y (target: price).

3. Feature Selection Techniques Applied

Eight different techniques were used to measure feature importance:

Correlation Analysis – Identified features with strong relationships to price.
Random Forest Feature Importance
Gradient Boosting Feature Importance
Permutation Importance
Lasso Regression – Regularization-based selection.
Recursive Feature Elimination (RFE)
Linear Regression Weights
SHAP Values – Model-agnostic interpretability method.

4. Aggregating Results

Collected feature importance scores from all eight methods.
Normalized scores and computed the mean importance for each feature.

5. Dropping Low-Impact Features

study_room, pooja_room, and others consistently scored low across all methods.
Verified impact by:
- Training a Random Forest with all features → R² = 0.8193
- Removing the three features and retraining → R² = 0.8196
Since there was no significant change, these features were dropped.

Outcome: Final post–feature-selection dataset contains only highly relevant features, improving model interpretability and reducing unnecessary complexity.

Model Selection and Building

This project focuses on predicting real estate prices, providing market analytics, and recommending similar societies based on location, pricing, and facilities.Using the Post Feature Selection V2 dataset, we implemented advanced preprocessing, experimented with multiple encoding strategies, and built a Streamlit web application integrating all three modules for an interactive end-user experience.

Tech Stack:

Python, Pandas, NumPy, Scikit-learn, Streamlit, Matplotlib, Seaborn, Plotly, GeoPandas, TF-IDF Vectorizer, Cosine Similarity.

Key Results:

Price Prediction Model → R² = 0.903, MAE = ₹44 Lakh.
±22% variation range for predicted prices.
10 interactive analytics visualizations covering location, pricing, property type, and features.
Society Recommender enabling personalized recommendations with adjustable weights for location, price, and facilities.

Module 1 – Price Predictor

Used the Post Feature Selection V2 dataset containing property attributes such as property type, sector, bedrooms, bathrooms, balconies, property age, built-up area, servant/store room, furnishing type, luxury category, and floor category.
Applied log1p transformation on price to reduce skewness, reversing it during prediction with np.expm1.
Built a preprocessing–modeling pipeline:
- StandardScaler → Numerical columns (bedRoom, bathroom, built_up_area, servant room, store room).
- OrdinalEncoder → General categorical columns (property_type, balcony, furnishing_type, luxury_category, floor_category).
- OneHotEncoder (drop='first', sparse_output=False) → sector, agePossession.
Final model → Random Forest Regressor (n_estimators = 500) giving:
- R² = 0.903
- MAE = ₹44 Lakh
Integrated into a Streamlit app where users can input property details and get a predicted price range (±22% variation).
Note: Target encoding and PCA were tested during experimentation but not used in the final deployed pipeline — One-Hot + Ordinal Encoding mix was chosen for robustness in production.

Module 2 – Analytics

Developed an interactive Streamlit dashboard with 10 visualizations for market insights:
1. Sector-Level GeoMap – Price per sqft & built-up area by location.
2. Features WordCloud – Most frequent listing features.
3. Area vs Price Scatter Plot – Price trends by property size & BHK.
4. Bedroom Distribution Pie Chart – Share of properties by BHK.
5. Price Distribution by BHK (Box Plot) – Spread & outliers.
6. Price Distribution Comparison (Distplot) – Houses vs flats.
7. KDE Plot – Smooth price density for property types.
8. Heatmap – Average price by sector & BHK.
9. Furnishing Type Distribution – Share of furnishing categories.
10. Avg Price per Sqft by Sector (Bar Chart) – Sector rankings by price per sqft.

Module 3 – Society Recommender

Built a content-based recommender system for finding similar societies.
Considered three aspects for recommendation: Location Advantages, Price Details, and Top Facilities.
Created three separate recommenders using TF-IDF vectorization and cosine similarity for each aspect.
Combined recommendations either equally or with user-defined weights for flexibility.
Integrated into a Streamlit app where users can adjust importance sliders for location, price, and facilities to get tailored suggestions.

Learning Outcomes

Through the development of the Real Estate Capstone Project , the following outcomes were achieved:

End-to-End Real Estate Data Pipeline Development
- Successfully designed and implemented a complete workflow covering data collection (web scraping), preprocessing, exploratory analysis, feature engineering, modeling, and deployment for real-world property data from Gurgaon.
Advanced Data Cleaning & Preprocessing Techniques
- Applied multi-stage cleaning to merge heterogeneous datasets, standardize formats, fix unit inconsistencies, handle outliers, and impute missing values using statistical relationships and domain knowledge.
Domain-Specific Feature Engineering
- Engineered impactful real estate–specific features such as luxury category, floor category, and price per sqft normalization, enabling more accurate predictions and insightful analytics.
Comprehensive Exploratory Data Analysis (EDA)
- Conducted univariate, bivariate, and multivariate analysis to uncover key insights on property type distribution, sector-level market trends, size-price relationships, and factors influencing property valuation.
Interactive Analytics Dashboard Creation
- Built a Streamlit-based analytics module featuring 10 interactive visualizations (sector-level geomaps, BHK analysis, price distribution charts, heatmaps, word clouds) to make market trends easy to interpret for users.
Price Prediction Model Development
- Trained and optimized a machine learning regression pipeline (Random Forest, Gradient Boosting, etc.) to deliver instant property price estimates, balancing high R² scores with low MAE for accuracy and reliability.
Personalized Recommendation System Implementation
- Designed a Society Recommender Engine allowing users to filter by location, amenities, and proximity, delivering intelligent suggestions for similar housing societies.
Model Interpretability & Explainability
- Utilized feature importance and SHAP-based analysis to explain how location, property type, size, and amenities influence predicted prices, building trust in the system’s outputs.
User-Centric Application Deployment
- Deployed the project on Hugging Face Spaces with an intuitive sidebar navigation, ensuring smooth integration of all modules for real-time, user-friendly interaction.
Business & Market Insight Generation
- Delivered actionable insights for buyers, sellers, and investors to identify undervalued properties, high-demand sectors, and optimal investment opportunities through data-driven decision-making.