Predicting Home Values: A Full Data Science Project Walkthrough // SALAMEH

In this blog post, I've taken you on a comprehensive, step-by-step journey through a real-world data science project: predicting home values in West Roxbury, Boston. I started by defining the core problem and outlining our industry, stakeholders, goals, and the benefits of achieving them. From there, I walked through the crucial stages of data collection, showing how I loaded and prepared the WestRoxbury.csv file. You've seen my approach to data cleaning, where I tackled missing values and standardized column names to ensure our dataset was pristine. Then, we dove deep into data exploration (EDA), visualizing relationships and uncovering insights before moving on to feature engineering to prepare our data for modeling.

1. Problem Definition

Example: Predict the value of homes in West Roxbury, Boston, MA
Industry: Real Estate
Stakeholder: Zillow
Goal: To become the dominant platform for checking house prices
Benefits: If the goal is achieved, more realtors will pay Zillow for advertising and hence become the dominant online advertising venue for realtors.

2. Data Collection

Usually, we find this data in databases, or we have to web scrape to get the data, or there might be some API that can have the data we want. We found the city housing data that are used to estimate property values for tax assessment.

In this file WestRoxbury.csv we have the data.

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

let's talk a little bit about the libraries we imported

pandas: to manipulate the csv file and work with tables
numpy: to do math operations ig
matplotlib and seaborn: to visualize data on plots and charts idk why we are using two libraries
sklearn: we are just importing train_test_split, its purpose is to just split the data to use some of it for testing and the remaining for the training

python

housing_df = pd.read_csv("WestRoxbury.csv")

we used the pandas library to load the data from the csv file

Note: df in the name of the variable stands for "data frame"

3. Data Cleaning

python

print(housing_df)

CODE

      TOTAL VALUE    TAX  LOT SQFT   YR BUILT  GROSS AREA   LIVING AREA  \
0            344.2  4330       9965      1880         2436         1352   
1            412.6  5190       6590      1945         3108         1976   
2            330.1  4152       7500      1890         2294         1371   
3            498.6  6272      13773      1957         5032         2608   
4            331.5  4170       5000      1910         2370         1438   
...            ...   ...        ...       ...          ...          ...   
5797         404.8  5092       6762      1938         2594         1714   
5798         407.9  5131       9408      1950         2414         1333   
5799         406.5  5113       7198      1987         2480         1674   
5800         308.7  3883       6890      1946         2000         1000   
5801         447.6  5630       7406      1950         2510         1600   

      FLOORS   ROOMS  BEDROOMS   FULL BATH  HALF BATH  KITCHEN  FIREPLACE  \
0         2.0      6        NaN          1          1        1          0   
1         2.0     10        NaN          2          1        1          0   
2         2.0      8        NaN          1          1        1          0   
3         1.0      9        5.0          1          1        1          1   
4         2.0      7        3.0          2          0        1          0   
...       ...    ...        ...        ...        ...      ...        ...   
5797      2.0      9        3.0          2          1        1          1   
5798      2.0      6        3.0          1          1        1          1   
5799      2.0      7        3.0          1          1        1          1   
5800      1.0      5        2.0          1          0        1          0   
5801      2.0      7        3.0          1          1        1          1   

     REMODEL  
0        NaN  
1     Recent  
2        NaN  
3        NaN  
4        NaN  
...      ...  
5797  Recent  
5798     NaN  
5799     NaN  
5800     NaN  
5801     NaN  

[5802 rows x 14 columns]

we can print the data in the terminal, but there are better ways to do this 🧏

python

print(housing_df.shape)

CODE

(5802, 14)

this prints the (rows_num , coloumn_num)

python

print(housing_df.columns)

CODE

Index(['TOTAL VALUE ', 'TAX', 'LOT SQFT ', 'YR BUILT', 'GROSS AREA ',
       'LIVING AREA', 'FLOORS ', 'ROOMS', 'BEDROOMS ', 'FULL BATH',
       'HALF BATH', 'KITCHEN', 'FIREPLACE', 'REMODEL'],
      dtype='object')

this prints the names of the columns to see what the data we are dealing with is

python

housing_df = housing_df.rename(columns={'BEDROOMS ':'BEDROOMS'})

we have to remove the space from the name of the column. In this case, we renamed it but without the space. We can just use the strip method and it does it automatically. I don't know why we are doing this.

python

housing_df.columns = [col_name.strip().replace(" ", "_") for col_name in housing_df.columns]

ok now we loop over each column and change to remove the space from the beginning and the end of each thing with the strip method and replacing each space, I mean the ones between the words, with underscore _ with the replace method. We removed the space and replaced it with the _ to get access to the . functions in the future.

We can also write this line in this syntax to be more readable:

python

new_columns = []
for col_name in housing_df.columns:
    cleaned_name = col_name.strip().replace(' ', '_')
    new_columns.append(cleaned_name)
housing_df.columns = new_columns

python

housing_df.head()

	TOTAL_VALUE	TAX	LOT_SQFT	YR_BUILT	GROSS_AREA	LIVING_AREA	FLOORS	ROOMS	BEDROOMS	FULL_BATH	HALF_BATH	KITCHEN	FIREPLACE	REMODEL
0	344.2	4330	9965	1880	2436	1352	2.0	6	NaN	1	1	1	0	NaN
1	412.6	5190	6590	1945	3108	1976	2.0	10	NaN	2	1	1	0	Recent
2	330.1	4152	7500	1890	2294	1371	2.0	8	NaN	1	1	1	0	NaN
3	498.6	6272	13773	1957	5032	2608	1.0	9	5.0	1	1	1	1	NaN
4	331.5	4170	5000	1910	2370	1438	2.0	7	3.0	2	0	1	0	NaN

The head method will show the first 5 rows in the df

python

housing_df.loc[0:9]

	TOTAL_VALUE	TAX	LOT_SQFT	YR_BUILT	GROSS_AREA	LIVING_AREA	FLOORS	ROOMS	BEDROOMS	FULL_BATH	HALF_BATH	KITCHEN	FIREPLACE	REMODEL
0	344.2	4330	9965	1880	2436	1352	2.0	6	NaN	1	1	1	0	NaN
1	412.6	5190	6590	1945	3108	1976	2.0	10	NaN	2	1	1	0	Recent
2	330.1	4152	7500	1890	2294	1371	2.0	8	NaN	1	1	1	0	NaN
3	498.6	6272	13773	1957	5032	2608	1.0	9	5.0	1	1	1	1	NaN
4	331.5	4170	5000	1910	2370	1438	2.0	7	3.0	2	0	1	0	NaN
5	337.4	4244	5142	1950	2124	1060	1.0	6	3.0	1	0	1	1	Old
6	359.4	4521	5000	1954	3220	1916	2.0	7	3.0	1	1	1	0	NaN
7	320.4	4030	10000	1950	2208	1200	1.0	6	3.0	1	0	1	0	NaN
8	333.5	4195	6835	1958	2582	1092	1.0	5	3.0	1	0	1	1	Recent
9	409.4	5150	5093	1900	4818	2992	2.0	8	4.0	2	0	1	0	NaN

The loc method takes a range and it shows the data from the range you selected.

python

housing_df.iloc[0:9,1]

CODE

0    4330
1    5190
2    4152
3    6272
4    4170
5    4244
6    4521
7    4030
8    4195
Name: TAX, dtype: int64

The iloc method is the same as the loc, but the second parameter is the index of the column.

python

housing_df.loc[0:9,"ROOMS"]

CODE

0     6
1    10
2     8
3     9
4     7
5     6
6     7
7     6
8     5
9     8
Name: ROOMS, dtype: int64

if we add the second parameter, we get the data from the column we want

python

housing_df["REMODEL"] = housing_df["REMODEL"].fillna("None")

the fillna method we used to replace any NA values with something we want, in this case, None

python

housing_df.ROOMS.dtype

CODE

dtype('int64')

here we are getting the data type of the column, and thats why we renamed the columns to have _ and not space to access the functions like now

python

housing_df.dtypes

CODE

TOTAL_VALUE    float64
TAX              int64
LOT_SQFT         int64
YR_BUILT         int64
GROSS_AREA       int64
LIVING_AREA      int64
FLOORS         float64
ROOMS            int64
BEDROOMS       float64
FULL_BATH        int64
HALF_BATH        int64
KITCHEN          int64
FIREPLACE        int64
REMODEL         object
dtype: object

we can use this to see if the data is categorical or not, like with the REMODEL column, which is categorical

python

housing_df.loc[0:9,"REMODEL"]

CODE

0      None
1    Recent
2      None
3      None
4      None
5       Old
6      None
7      None
8    Recent
9      None
Name: REMODEL, dtype: object

python

housing_df.REMODEL = housing_df.REMODEL.astype("category")
housing_df.REMODEL.cat.categories

CODE

Index(['None', 'Old', 'Recent'], dtype='object')

the value of the REMODEL column is categorical, so we converted it to a categorical type and we can see the categories with the cat.categories method, the .cat is a method that is used to work with categorical data

python

na_counts = housing_df.isna().sum()
print(na_counts)

CODE

TOTAL_VALUE    0
TAX            0
LOT_SQFT       0
YR_BUILT       0
GROSS_AREA     0
LIVING_AREA    0
FLOORS         0
ROOMS          0
BEDROOMS       3
FULL_BATH      0
HALF_BATH      0
KITCHEN        0
FIREPLACE      0
REMODEL        0
dtype: int64

now we can see the number of NA values in each column, as we can see there are NA values in the BEDROOMS column, lets fix that

python

housing_df.BEDROOMS.count()

CODE

np.int64(5799)

the .count method counts the number of non-NA values in a column

python

medianBedrooms = housing_df.BEDROOMS.median()
housing_df.BEDROOMS = housing_df.BEDROOMS.fillna(medianBedrooms)

we found the median of the BEDROOMS column and we filled the NA values with the median,there are better ways to do this, but this is one of them

4. Data Exploration (EDA)

we preform expletory data analysis (EDA) to understand the data and find patterns and relationships between the data.
from what I've understood we just have to visualize the data, and find relations

python

pd.pivot_table(housing_df,index=["REMODEL"],values=["TOTAL_VALUE"],aggfunc=[np.mean,np.std])

	mean	std
	TOTAL_VALUE	TOTAL_VALUE
REMODEL
None	380.189140	93.463511
Old	405.088528	101.993764
Recent	446.518951	105.710596

the pivot_table is a pandas method that takes four parameters

the dataframe we want to pivot in this case housing_df
the index we want to pivot on in this case REMODEL, which is a categorical column
the values we want to pivot on in this case TOTAL_VALUE, which is a numerical column, we can also pivot on more than one column by providing the column names in the list
the aggregation function we want to use in this case mean and std, which is the standard deviation

python

housing_df.describe()

	TOTAL_VALUE	TAX	LOT_SQFT	YR_BUILT	GROSS_AREA	LIVING_AREA	FLOORS	ROOMS	BEDROOMS	FULL_BATH	HALF_BATH	KITCHEN	FIREPLACE
count	5802.000000	5802.000000	5802.000000	5802.000000	5802.000000	5802.000000	5802.000000	5802.000000	5802.000000	5802.000000	5802.000000	5802.00000	5802.000000
mean	392.685715	4939.485867	6278.083764	1936.744916	2924.842123	1657.065322	1.683730	6.994829	3.229748	1.296794	0.613926	1.01534	0.739917
std	99.177414	1247.649118	2669.707974	35.989910	883.984726	540.456726	0.444884	1.437657	0.846497	0.522040	0.533839	0.12291	0.565108
min	105.000000	1320.000000	997.000000	0.000000	821.000000	504.000000	1.000000	3.000000	1.000000	1.000000	0.000000	1.00000	0.000000
25%	325.125000	4089.500000	4772.000000	1920.000000	2347.000000	1308.000000	1.000000	6.000000	3.000000	1.000000	0.000000	1.00000	0.000000
50%	375.900000	4728.000000	5683.000000	1935.000000	2700.000000	1548.500000	2.000000	7.000000	3.000000	1.000000	1.000000	1.00000	1.000000
75%	438.775000	5519.500000	7022.250000	1955.000000	3239.000000	1873.750000	2.000000	8.000000	4.000000	2.000000	1.000000	1.00000	1.000000
max	1217.800000	15319.000000	46411.000000	2011.000000	8154.000000	5289.000000	3.000000	14.000000	9.000000	5.000000	3.000000	2.00000	4.000000

the describe method is a pandas method that returns a summary of the data in the data frame, which includes the count, mean, standard deviation, minimum, maximum, and quartiles

python

ax = housing_df.TOTAL_VALUE.hist()
ax.set_xlabel("Total Value")
ax.set_ylabel("Count")
plt.show()

the hist method is used for the columns of the data frame, it returns a histogram of the data, and the set_xlabel and set_ylabel methods are used to set the labels of the x and y axes, and the plt.show method is used to show the plot

python

housing_df.plot.scatter(x="ROOMS",y="TOTAL_VALUE")

python

sns.pairplot(housing_df, x_vars=["ROOMS","FLOORS"], y_vars="TOTAL_VALUE", hue="REMODEL",height=5, aspect=.8, kind="reg")

python

sns.scatterplot(x="LOT_SQFT",y="TOTAL_VALUE",size="FULL_BATH",data=housing_df)

python

sns.scatterplot(x="LOT_SQFT",y="TOTAL_VALUE",size="FULL_BATH",hue="REMODEL",data=housing_df)

python

ax = housing_df.boxplot(column="TOTAL_VALUE",by="ROOMS", figsize=(10,10),grid=True)
ax.set_ylabel("Total Value")
plt.suptitle('')
plt.title('')

here is Youtube video explaining the above plot

5. Feature Engineering

Create and transform variables (features) that enhance the predictive power of the model. This may involve scaling, encoding categorical variables, or generating new features based on domain knowledge.

python

housing_df = pd.get_dummies(housing_df,drop_first=True,prefix_sep="_")
housing_df.REMODEL_Old = housing_df.REMODEL_Old.astype("int")
housing_df.REMODEL_Recent = housing_df.REMODEL_Recent.astype("int")
housing_df.head()

	TOTAL_VALUE	TAX	LOT_SQFT	YR_BUILT	GROSS_AREA	LIVING_AREA	FLOORS	ROOMS	BEDROOMS	FULL_BATH	HALF_BATH	KITCHEN	FIREPLACE	REMODEL_Recent
0	344.2	4330	9965	1880	2436	1352	2.0	6	3.0	1	1	1	0	0
1	412.6	5190	6590	1945	3108	1976	2.0	10	3.0	2	1	1	0	1
2	330.1	4152	7500	1890	2294	1371	2.0	8	3.0	1	1	1	0	0
3	498.6	6272	13773	1957	5032	2608	1.0	9	5.0	1	1	1	1	0
4	331.5	4170	5000	1910	2370	1438	2.0	7	3.0	2	0	1	0	0

any categorical value have to be converted to a binary value, cuz in the regression model i have to have numbers not names and categorys.

we can just take two of the columns and if both of them are zero, we know that the third value is One

6. Modelling

Select suitable algorithms and build predictive or prescriptive models. This involves training machine learning or statistical models using the prepared data, then tuning the models to improve accuracy and reliability.

In this step, dataset is split into:

Training set (used to build/train multiple models)
Validation set (used to tune parameters, compare models and select the best model)
Testing set (held back until final model evaluation to assess the performance of the chosen model).

pay attention to overfitting

python

excludeColumns = ('TOTAL_VALUE', "TAX")
predectors = [col for col in housing_df.columns if col not in excludeColumns]
X = housing_df[predectors]
y = housing_df["TOTAL_VALUE"]

train_X,valid_X,train_y,valid_y = train_test_split(X,y,test_size=0.7,random_state=1)

It excludes the TOTAL_VALUE and TAX columns from the dataset.
It assigns the remaining columns to X (features) and TOTAL_VALUE to y (target variable).
It splits X and y into training and validation sets using train_test_split, with 70% of the data going into the training set and 30% into the validation set. The split is done randomly, but the randomness is reproducible due to the fixed random_state.

python

model = LinearRegression()
model.fit(train_X,train_y)

train_predictions = model.predict(train_X)
valid_predictions = model.predict(valid_X)

7. Evaluation

Assess model performance using appropriate metrics (e.g., accuracy, precision, recall, etc.).

python

train_results = pd.DataFrame({"TOTAL_VALUE":train_y,"predicted":train_predictions, "residuals":train_y-train_predictions})
print(train_results.head())

CODE

      TOTAL_VALUE   predicted   residuals
3772        517.5  416.830508  100.669492
3335        325.5  338.979750  -13.479750
2662        419.7  392.672343   27.027657
3546        372.7  364.048194    8.651806
4825        373.8  371.904935    1.895065

This code creates a pandas DataFrame train_results that compares the actual TOTAL_VALUE (train_y) with the predicted values (train_predictions) from a linear regression model. It also calculates the residuals (differences between actual and predicted values). The head() function then prints the first few rows of this DataFrame.

python

all_residuals = train_y-train_predictions
ax = pd.DataFrame({"residuals":all_residuals}).hist(bins=25)
plt.tight_layout()
plt.show()

This code calculates the residuals (differences between actual and predicted values) of a linear regression model and visualizes them as a histogram with 25 bins.

We visualize the residuals to check the assumptions of linear regression, specifically:

Normality: A histogram of residuals can help verify if the errors are normally distributed, which is an assumption of linear regression.
Constant variance: A uniform spread of residuals in the histogram indicates constant variance, another assumption of linear regression.
No patterns: A random, scattered pattern in the histogram suggests that the model is capturing the underlying relationships in the data.

By examining the residual histogram, we can identify potential issues with the model, such as non-normality, heteroscedasticity, or non-linearity.

python

r2= model.score(train_X,train_y)

This code calculates the R-squared (R²) value of the linear regression model on the training data. R² measures the goodness of fit of the model, with higher values indicating better fit (up to a maximum of 1).

but as we said earlier we have to pay attention to overfitting

python

valid_residuals = valid_y-valid_predictions
ax = pd.DataFrame({"residuals":valid_residuals}).hist(bins=25)
plt.tight_layout()
plt.show()

8. Interpretation and Communication

Translate results into actionable insights and communicate findings to stakeholders. This includes explaining model predictions, highlighting key features, and addressing how the findings support the original business problem.

9. Deployment

Implement the model in a real-world environment where it can be used to generate actionable results. This step involves creating a Minimum Viable Product (MVP) by deploying a simplified, essential version of the model to test its effectiveness in solving the problem. During deployment, you may integrate the model with existing systems, develop an API, or set up a dashboard. The MVP allows you to gather initial feedback, monitor
performance, and assess its value to users, providing a foundation for further refinement and scaling based on real-world data and feedback.

python

new_data = pd.DataFrame({
    'LOT_SQFT': [4200, 6444, 5035],
    'YR_BUILT': [1960, 1940, 1925],
    'GROSS_AREA': [2670, 2886, 3264],
    'LIVING_AREA': [1710, 1474, 1523],
    'FLOORS': [2.0, 1.5, 1.9],
    'ROOMS': [10, 6, 6],
    'BEDROOMS': [4, 3, 2],
    'FULL_BATH': [1, 1, 1],
    'HALF_BATH': [1, 1, 0],
    'KITCHEN': [1, 1, 1],
    'FIREPLACE': [1, 1, 0],
    'REMODEL_Old': [0, 0, 0],
    'REMODEL_Recent': [0, 0, 1],
})

print(new_data)

print('Predictions: ', model.predict(new_data))

CODE

   LOT_SQFT  YR_BUILT  GROSS_AREA  LIVING_AREA  FLOORS  ROOMS  BEDROOMS  \
0      4200      1960        2670         1710     2.0     10         4   
1      6444      1940        2886         1474     1.5      6         3   
2      5035      1925        3264         1523     1.9      6         2   

   FULL_BATH  HALF_BATH  KITCHEN  FIREPLACE  REMODEL_Old  REMODEL_Recent  
0          1          1        1          1            0               0  
1          1          1        1          1            0               0  
2          1          0        1          0            0               1  
Predictions:  [384.30511506 377.59257723 385.8679671 ]

and that's it we predicted the values

10. Monitoring and Maintenance

Continuously monitor the model’s performance to ensure it remains accurate over time. Regular updates may be needed as new data becomes available or if the business problem evolves

Conclusion

As we conclude this project, I hope I've clearly illustrated the complete data science lifecycle, from our initial problem definition all the way to model deployment. We've meticulously cleaned and explored our raw data, transforming it into meaningful features, and then built and rigorously evaluated our linear regression model. The successful prediction of new home values at the end, as shown with our new_data DataFrame, truly showcases the tangible outcome of this systematic process. Remember, data science isn't a one-and-done deal; continuous monitoring and maintenance are absolutely key to ensuring our models remain accurate and impactful in a dynamic real-world environment.