Machine learning has been used for years to offer image recognition, spam detection, natural speech comprehension, product recommendations, and medical diagnoses. Today, machine learning algorithms can help us enhance cybersecurity, ensure public safety, and improve medical outcomes. Machine learning systems can also make customer service better and automobiles safer.
When I started experimenting with machine learning, I wanted to come up with an application that would solve a realworld problem but would not be too complicated to implement. I also wanted to practice working with regression algorithms. So I started looking for a problem worth solving. Here’s what I came up with.
If you’re going to sell a house, you need to know what price tag to put on it. And a computer algorithm can give you an accurate estimate!
In this article, I’ll show you how I wrote a regression algorithm to predict home prices.
Regression in a nutshell
Put simply, regression is a machine learning tool that helps you make predictions by learning – from the existing statistical data – the relationships between your target parameter and a set of other parameters. According to this definition, a house’s price depends on parameters such as the number of bedrooms, living area, location, etc. If we apply artificial learning to these parameters we can calculate house valuations in a given geographical area.
The idea of regression is pretty simple: given enough data, you can observe the relationship between your target parameter (the output) and other parameters (the input), and then apply this relationship function to real observed data.
To show you how regression algorithm works we’ll take into account only one parameter – a home’s living area – to predict price. It’s logical to suppose that there is a linear relationship between area and price. And as we remember from high school, a linear relationship is represented by a linear equation:
y = k0 + k1*x
In our case, y equals price and x equals area. Predicting the price of a home is as simple as solving the equation (where k0 and k1 are constant coefficients):
price = k0 + k1 * area
We can calculate these coefficients (k0 and k1) using regression. Let’s assume we have 1000 known house prices in a given area. Using a learning technique, we can find a set of coefficient values. Once found, we can plug in different area values to predict the resulting price.
[In this graph, y is price and x is living area. Black dots are our observations. Moving lines show what happens when k0 and k1 change.]
But there is always a deviation, or difference between a predicted value and an actual value. If we have 1000 observations, then we can calculate the total deviation of all items by summing the deviations for each k0 and k1 combination.
Regression takes every possible value for k0 and k1 and minimizes the total deviation; this is the idea of regression in a nutshell.
But in real life, there are other challenges you need to deal with. House prices obviously depend on multiple parameters, and there is no clear linear relationship between all of these parameters.
Now I’m going to tell you how I used regression algorithms to predict house price for my pet project.
How to use regression algorithms in machine learning
1. Gather data
The first step for any kind of machine learning analysis is gathering the data – which must be valid. If you can’t guarantee the validity of your data, then there’s no point analyzing it. You need to pay attention to the source you take your data from.
For my purposes, I’ve relied on a database from one of the largest realestate portals in the Netherlands. Since the real estate market in the Netherlands is strictly regulated, I didn’t have to check its validity.
Initially I had to do some data mining because the required data were available in various formats across multiple sources. In most realworld projects you’ll have to do some data mining as well. We won’t discuss data mining here, however, as it’s not really relevant to our topic.
For the purpose of this article, let’s imagine that all the realestate data I found were in the format shown below.
2. Analyze data
Once you’ve gathered data it’s time to analyze it. After parsing the data I got the following records:

{ "has_garden" : 1, "year" : 1980, "lng" : 4.640685, "has_garage" : 0, "changes_count" : "2", "area" : 127, "bedrooms_count" : "4", "com_time" : 57, "price" : 305000, "energy_label" : 1, "lat" : 52.30177, "rooms_count" : "5", "life_quality" : 6, "house_type" : 1, "is_leasehold" : 0 }

{ "has_garden" : 1, "year" : 1980, "lng" : 4.656503, "has_garage" : 0, "changes_count" : "3", "area" : 106, "bedrooms_count" : "3", "com_time" : 64, "price" : 275000, "energy_label" : 3, "lat" : 52.35456, "rooms_count" : "4", "life_quality" : 5, "house_type" : 1, "is_leasehold" : 0 }

{ "has_garden" : 1, "year" : 1980, "lng" : 4.585596, "has_garage" : 0, "changes_count" : "3", "area" : 106, "bedrooms_count" : "3", "com_time" : 74, "price" : 244000, "energy_label" : 3, "lat" : 52.29309, "rooms_count" : "4", "life_quality" : 6, "house_type" : 1, "is_leasehold" : 0 }

{ "has_garden" : 1, "year" : 1980, "lng" : 4.665817, "has_garage" : 0, "changes_count" : "2", "area" : 102, "bedrooms_count" : "4", "com_time" : 77, "price" : 199900, "energy_label" : 3, "lat" : 52.14919, "rooms_count" : "5", "life_quality" : 6, "house_type" : 1, "is_leasehold" : 0 }

{ "has_garden" : 1, "year" : 1980, "lng" : 4.620336, "has_garage" : 0, "changes_count" : "1", "area" : 171, "bedrooms_count" : "3", "com_time" : 79, "price" : 319000, "energy_label" : 1, "lat" : 52.27822, "rooms_count" : "4", "life_quality" : 6, "house_type" : 1, "is_leasehold" : 0 }

{ "has_garden" : 1, "year" : 1980, "lng" : 5.062804, "has_garage" : 1, "changes_count" : "1", "area" : 139, "bedrooms_count" : "5", "com_time" : 38, "price" : 265000, "energy_label" : 5, "lat" : 52.32992, "rooms_count" : "6", "life_quality" : 7, "house_type" : 1, "is_leasehold" : 0 }

{ "has_garden" : 1, "year" : 1980, "lng" : 5.154957, "has_garage" : 1, "changes_count" : "2", "area" : 129, "bedrooms_count" : "4", "com_time" : 57, "price" : 309500, "energy_label" : 1, "lat" : 52.35634, "rooms_count" : "5", "life_quality" : 6, "house_type" : 1, "is_leasehold" : 0 }

{ "has_garden" : 1, "year" : 1980, "lng" : 4.622486, "has_garage" : 0, "changes_count" : "1", "area" : 125, "bedrooms_count" : "4", "com_time" : 76, "price" : 289000, "energy_label" : 1, "lat" : 52.2818, "rooms_count" : "5", "life_quality" : 6, "house_type" : 1, "is_leasehold" : 0 }
Here’s what the fields mean:

has_garden – does the property has a garden? 1 – true, 0 – false

year – year of construction

lat, lng – house location coordinates

area – total living area

has_garage – does the property has a garage? 1 – true, 0 – false

bedrooms_count – amount of bedrooms

rooms_count – total rooms count

energy_label – energy efficiency label (assigned for each house in the Netherlands)

life_quality – life quality mark calculated for each district by local authorities

house_type – property type (1 – house, 0 – appartement)

com_time – commuting time to Amsterdam center

changes_count – transport changes count if you go to Amsterdam center by public transport
I worked on the assumption that these are all measurable data that affect a home’s price. Of course, there may be more parameters that matter as well, such as house condition and location. But these parameters are more subjective and almost impossible to measure, so I ignored them.
3. Check the correlation between parameters
Now you need to check for strong correlations among given parameters. If there are, then remove one of the parameters. In my dataset there were no strong correlations among values.
4. Remove outliers from the dataset
Outliers are observation points that are distant from other observations. For example, in my data there was one house with an area of 50 square meters for a price of $500K. Such houses may exist on the market for various reasons, but they are not statistically meaningful. I want to make a price estimate based on the market average, and so I won’t take such outliers into account.
Most regression methods explicitly require outliers be removed from the dataset as they may significantly affect the results. To remove the outlier I used the following function:
def get_outliners(dataset, outliers_fraction=0.25):
clf = svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05, kernel="rbf", gamma=0.1)
clf.fit(dataset)
result = clf.predict(dataset)
return result
This will return – 1 for outliers and 1 for nonoutliers. Then you can do something like this:
training_dataset = full_dataset[get_outliners(full_dataset[analytics_fields_with_price], 0.15)==1]
After that you will have nonoutlier observations only. Now it’s time to start regression analysis.
5. Choose a regression algorithm
There’s more than one way to do regression analysis. What we’re looking for is the best prediction accuracy given our data. But how can we check accuracy? A common way is to calculate a socalled r^2 score which is basically a squared difference between an actual and a predicted value.
It’s important to remember that if we use that same dataset for learning and checking our accuracy, our model may overfit. This means it will show excellent accuracy on a given dataset but will completely fail when given new data.
A common approach to solve this problem is to split the original dataset into two parts and then use one for learning and another for testing. This way we will simulate new data for our learning model, and if there is an overfit, we can spot it.
We can split our dataset using a proportion of 80/20. We’ll use 80% for training and the remaining 20% for testing. Let’s take a look at this piece of code:
//code for algorithm quality estimation
from sklearn import svm
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVR
from sklearn.metrics import r2_score
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
#prepare dataset
#....
#spilt dataset
Xtrn, Xtest, Ytrn, Ytest = train_test_split(training_dataset[analytics_fields], training_dataset[['price']],
test_size=0.2)
# model = RandomForestRegressor(n_estimators=150, max_features='sqrt', n_jobs=1) # случайный лес
models = [LinearRegression(),
RandomForestRegressor(n_estimators=100, max_features='sqrt'),
KNeighborsRegressor(n_neighbors=6),
SVR(kernel='linear'),
LogisticRegression()
]
TestModels = pd.DataFrame()
tmp = {}
for model in models:
# get model name
m = str(model)
tmp['Model'] = m[:m.index('(')]
# fit model on training dataset
model.fit(Xtrn, Ytrn['price'])
# predict prices for test dataset and calculate r^2
tmp['R2_Price'] = r2_score(Ytest['price'], model.predict(Xtest))
# write obtained data
TestModels = TestModels.append([tmp])
TestModels.set_index('Model', inplace=True)
fig, axes = plt.subplots(ncols=1, figsize=(10, 4))
TestModels.R2_Price.plot(ax=axes, kind='bar', title='R2_Price')
plt.show()
As a result, I got the following graph:
As you can see, the RandomForest
regressor showed the best accuracy, so we decided to use this algorithm for production.
Price prediction in production works pretty much the same as in our test code except there’s no need to calculate r^2 and switch models anymore.
At this point, we can offer fair price predictions. We can compare the actual price of a house with our predicted price and observe the deviation.
Take a look at my house price prediction project. I’ve sorted the houses I analyzed from undervalued to overvalued. This tool only ranks homes in the Netherlands.
Naturally, my house price prediction algorithm isn’t 100% accurate. But for me it’s acceptable because my initial goal was to create some kind of custom ranking to reduce the amount of manual work for people who consider buying or selling a house and want to compare prices on the market.