Berlin Price Predictor

A basic multivariate linear regression model to predict house/apartment sales price in Berlin.

How I Built This

Caveat: this is my first ever attempt at building a linear regression model end to end. It's basic, but a good learning exercise!

Gather training data

I collected some data with a simple scrape of Immobilien Scout, scraping 252 samples. Number of rooms, and square metres of floor space were the two features I selected, whilst the listing price was the label.


import numpy as np

x_train = np.array([
    [
        4,
        152
    ],
    [
        3,
        66
    ],
    # ...
)


import numpy as np

y_train = np.array([
    1550000,
    499000,
    # ...
    )

Run gradient descent to calculate model weights

I then ran a standard gradient descent model on the data. It took some tweaks to the iteration count, learning rate and initial parameters to get the model to converge.


import numpy as np

# Training data
from x import x_train

# Training Labels
from y import y_train

# Functions
from functions import compute_cost, compute_gradient, gradient_descent

# initialize parameters
initial_w = np.array([0.75376741, 0.00039133535])
initial_b = 0

# some gradient descent settings
iterations = 20000
alpha = 1.0e-8

# run gradient descent
w_final, b_final, J_hist = gradient_descent(
    x_train, y_train, initial_w, initial_b, compute_cost, compute_gradient, alpha, iterations)


import copy
import numpy as np
import math
from y import y_train

def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters):
    """
    Performs batch gradient descent to learn w and b. Updates w and b by taking 
    num_iters gradient steps with learning rate alpha

    Args:
      X (ndarray (m,n))   : Data, m examples with n features
      y (ndarray (m,))    : target values
      w_in (ndarray (n,)) : initial model parameters  
      b_in (scalar)       : initial model parameter
      cost_function       : function to compute cost
      gradient_function   : function to compute the gradient
      alpha (float)       : Learning rate
      num_iters (int)     : number of iterations to run gradient descent

    Returns:
      w (ndarray (n,)) : Updated values of parameters 
      b (scalar)       : Updated value of parameter 
      """

    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    w = copy.deepcopy(w_in)  # avoid modifying global w within function
    b = b_in

    for i in range(num_iters):

        # Calculate the gradient and update the parameters
        dj_db, dj_dw = gradient_function(X, y, w, b)  # None

        # Update Parameters using w, b, alpha and gradient
        w = w - alpha * dj_dw  # None
        b = b - alpha * dj_db  # None

        # Save cost J at each iteration
        if i < 100000:      # prevent resource exhaustion
            J_history.append(cost_function(X, y, w, b))

        # Print cost every at intervals 10 times or as many iterations if < 10
        # if i % math.ceil(num_iters / 10) == 0:
        #     print(f"Iteration {i:4d}: Cost {J_history[-1]:8.2f}   ")

    return w, b, J_history  # return final w,b and J history for graphing


def compute_cost(X, y, w, b):
    """
    compute cost
    Args:
      X (ndarray (m,n)): Data, m examples with n features
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter

    Returns:
      cost (scalar): cost
    """
    m = X.shape[0]
    cost = 0.0
    for i in range(m):
        f_wb_i = np.dot(X[i], w) + b  # (n,)(n,) = scalar (see np.dot)
        cost = cost + (f_wb_i - y[i])**2  # scalar
    cost = cost / (2 * m)  # scalar
    return cost


def compute_gradient(X, y, w, b):
    """
    Computes the gradient for linear regression 
    Args:
      X (ndarray (m,n)): Data, m examples with n features
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter

    Returns:
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 
    """
    m, n = X.shape  # (number of examples, number of features)
    dj_dw = np.zeros((n,))
    dj_db = 0.

    for i in range(m):
        err = (np.dot(X[i], w) + b) - y[i]
        for j in range(n):
            dj_dw[j] = dj_dw[j] + err * X[i, j]
        dj_db = dj_db + err
    dj_dw = dj_dw / m
    dj_db = dj_db / m

    return dj_db, dj_dw

Review Results

The results are in! These are what are used in the model above in the end.

          
# bias and weights that were found
b,w found by gradient descent: 119.32,[ 201.47428015 6772.49617494]

Evaluate Model

I used a Mean Error Percentage to evaluate the effectiveness/accuracy of the model - basically, on average, how far off was I in my estimated house prices if I used the model on the existing training date

          
# Mean Error Percentage
predictions = x_train.dot(w_final) + b_final
percentage_errors = 100 * np.abs((predictions - y_train) / y_train)
mean_error_percentage = np.mean(percentage_errors)
print(f"Mean Error Percentage: {mean_error_percentage:0.2f}%")


# mean error percentage is still quite high! I tried feature scaling 
# but it didn't help much, I think the data is just too noisy
Mean Error Percentage: 20.90%

Plot Data

I thought plotting the data might help me visualise how the model could be improved, so I plotted:

Each feature against the label
The cost function
The model's predictions if I used the mean price as the prediction
A 2D contour plot

Optimize

From the results and plots above I was able to determine that the model isn't as accurate for the top ~20% of house prices. I tried using weighted costs (1.5x for the bottom 80% of data points), but it didn't improve the model.

I also tried Feature Scaling, but did not experience success - it didn't offer better predictions than the original model.

In hindsight, I could have collected more features to play around with and see the relationship. It appears that, from the data collected, that space is far more influential on listing price than the number of rooms.