## Metadata

- Author: Jose A. Rodriguez-Serrano
- Full Title: Prototype-based models for real estate valuation
- URL: https://readwise.io/reader/document_raw_content/144751153

## Highlights

- One step further in sophistication, machine learning (ML) techniques, such as ran- dom forests [13] or neural networks [14] have attracted the attention of researchers in real estate valuation [6, 15]. Their advantages include their capability to deal with non- linear relationships, which normally translates to superior performance. Importantly for real estate research, some models like decision trees, random forests or neural net- works can accept location variables such as longitude and latitude without any further processing [3]. (View Highlight)
- However, there is a prevalent consensus that while these systems excel in prediction capabilities, they often fall short in providing valuable insights due to their inherent lack of interpretability (View Highlight)
- some works in real estate valuation have successfully combined machine learning models with explainable AI (XAI) techniques [12, 16]. These techniques complement machine learning models by extracting metrics such as relevance scores for each variable. Still, an issue is that these techniques are applied ex post and remain detached from the model itself. (View Highlight)
- Furthermore, one additional limitation of both OLS and ML models is that they ignore a fundamental cue in valuation theory: direct comparison — perhaps the most common and intuitive way in which humans appraise real estate. In direct comparison, “the value of the property being appraised […] is assumed to relate closely to the the selling prices of selling properties within the same market area”, as put by [1], who name this concept comparables method. The same idea has also been dubbed direct capital comparison (DCC) by [17]. In order to unify the terminology, in the following we refer to this concept as direct comparison. (View Highlight)
- Yet it can be clearly argued that OLS and, more broadly, machine learning approaches, do not implement explicitly the idea of direct comparison, because these models establish direct mappings from input variables to prices. In other words, most of these predictive models do not involve similarity computations between the input property and the nearby or similar properties. (View Highlight)
- There are notable exceptions to the previous statement, such as K-Nearest Neigh- bors (KNN) [18] and its variants. These are designed to offer predictions grounded in such comparative analyses and have exhibited good performance in the domain of real estate price prediction [19, 20]. One technical drawback of KNN is that it is a non-parametric density estimation method; as such it cannot be adjusted to mini- mize a loss metric. The only mechanism to increase the quality of the estimation is by increasing K, which renders the method inefficient. (View Highlight)
- it is clear that none of the discussed approaches simulta- neously complies with one of these characteristics: (i) capability to model complex and non-linear relationships, (ii) capability to optimize a predictive objective func- tion, (iii) exhibit some degree of explainability, and (iv) implement the notion of direct comparison explicitly. (View Highlight)
- The method is based on a methodology known as “prototype-based learning” [21] which, to the author’s knowledge, has not been previously applied for real estate valuation. (View Highlight)
- Prototype-based learning is a predictive model that: • can learn non-linear relations between the dependent and input variables, • is a supervised learning method that can be trained to minimize any given loss function (unlike KNN), and • a includes explainability as part of the model, and • it implements explicitly the notion of direct comparison: the model predicts the price of a property by contrasting the property of interest to a set of (automatically inferred) reference properties. (View Highlight)
- The model works by minimizing a prediction error, based on comparing the vari- ables of a property with the distributions of a set of references, and then optimizing these references. Therefore, as a by product, the model produces a segmentation of the data into groups (or areas) with differentiated characteristics, similar to sub-market models [22]. (View Highlight)
- We study prototype-based models using a public dataset of real estate prices in Seattle. Our experiments conclude that a) prototype-based learning successfully distills the prototypes as theoretically expected, b) the method outperforms other strategies for finding prototypes, c) the prediction error observed is lower compared to other machine learning models, and d) the model can incorporate location-based information in an effective way. (View Highlight)
- The term prototype-based models [21] encompasses a family of approaches rather than a singular methodology, and within this family we focus on the subset of super- vised learning approaches. These make predictions by leveraging latent “prototypes”, which refer to a reduced set of data samples which summarize most of the patterns occurring the whole dataset. Prototypes can correspond to actual data points, or can consist of calculated variables and not strictly correspond to any real observed datum. Additionally, when the space is segmented based on proximity to prototypes, these models facilitate the identification of distinct regions or clusters within the data space, effectively acting as “cluster-aware supervised learning algorithms” (View Highlight)
- The proposed model requires a dataset of real estate properties with N samples, each with D independent or (in the following) input variables xnd (thus n ∈ {1, … ,N} (View Highlight)
- and D ∈ {1, … ,D}) describing characteristics of the real estate properties and a dependent or output variable yn indicating the observed price. By price we mean either the direct price, or any transformation of the price, such as the price normalized by a constant or non-constant factor (e.g. price per square foot), or the logarithm of the price, just to mention some commonly used options. Our model is compatible with all these definitions. (View Highlight)
- The main design principle of our model is that it should implement the notion of direct comparison: if we observe a new property with input variables x, we should be able to infer its price by comparing it to a number of “references”, denoted as prototypes, and interpolating from the price of similar references. (View Highlight)
- We define a set of M prototypes, which represent a set of properties with similar input variables and with an attributed price. Formally, each reference consists of: 1. A prototype distribution, which models the probability distribution fm(x) of properties within group m. In the simplest case, we take a multivariate Normal distribution fm(x) = N(x|pm, sm) (1) where pm = (pm1, … , pmD) specifies a vector of means and sm refers to the vector of standard deviations (assuming diagonal covariance)1. Because pm represents the mode of fm, it therefore indicates the typical values of the input variables for each prototype. 2. A prototype value vm, which models an price attributed to that group of properties. To compare a property x to each of the M references, a principled method [40] is determining the posterior probability that fm has generated the sample xn, i.e. (View Highlight)
- This comparison should yield, ideally, a small set of properties with non-negligible values of γm(x) which represent the most similar references. Then the predicted value is a a weighted mean2 of the references, where the weighing coefficient is precisely the value of γ (View Highlight)
- As we will verify empirically later, it is common that most values of γm(x) are negligible and only a few terms contribute to the sum, and therefore Eq. 3 can be understood as an “interpolation” of the attributed values of the most similar references. (View Highlight)
- Note that the proposed model is completely determined by pmd and vm (and also smd in case these are optimized). We jointly denote these parameters as θ. Its optimal values will be learned from the data using the following loss function: C = 1 N XN n=1 (yn − ˆyn)2 + ηR(θ), (4) The first term is as mean squared error (MSE), which measures the fitness error of the model with respect to the data, and R is a regularization term, common in ML tasks to penalize model complexity or to imposes reasonable restrictions (View Highlight)
- R(θ) = 1 M X m min n ∥pm − xn∥. (5) This term averages the distance between each prototype and its closest data point, similar to [33], therefore adding a penalty if prototypes pm are far from real properties x(here, ∥ · ∥ stands for the Euclidean norm). (View Highlight)
- The parameters pmd, vm of the model have a natural interpretation. Specifically, as pmd is the mean of the distribution fm, it can be interpreted as the variables of a representative property, and vm as its representative price. Therefore, not only will the model be able to predict property prices, it also produces a list of M prototypical properties and suggests values for their expected variables and price. (View Highlight)
- To illustrate the model capabilities, for now we select a simple example on a con- trolled setting (experiments on real data are reported in the next section). Fig. 1 shows a scatterplot corresponding to a small dataset with two input variables. The locations of the circles on the horizontal and vertical axes represent, respectively, the input vari- ables xn1 and xn2, and the color indicates the value of the output variable y (yellow for y = 0.0 and blue for y = 1.0) (View Highlight)
- In this case, we train the prototype-based model with M = 20, and using a loss function with η = 0. In this setting, the parameters can be directly translated to the plot: pm1 and pm2 represent the location of the prototype means on the horizontal and vertical axes. The color indicates the prototype values vm (black for vm = 0 and white for vm = 1). The color scale is continuous, although perhaps that is not perceivable as most prototype values are close to vm = 0 or vm = 1. (View Highlight)
- By visual inspection, it is clear in this example that the locations pmd of the prototypes span the different “clusters” in the data and the values vm match the values of yn in each area. As a conclusion, we have quantitative evidence in a controlled example that the obtained prototypes can be interpreted as representative property characteristics and their price. (View Highlight)
- To shed light on how the model implements the intiotion of direct comparison, consider the prediction for a given sample x. To apply Eq. 3, first we need to determine the values of γ in Eq. 2. Continuing with our example, in the case of two variables, using a bivariate Normal distribution for fm, we have: (View Highlight)
- efine λm ≡ log fm(x), then Eq. 2 becomes γm(x) = eλm P k eλk , . (6) (7) sometimes known as the softmax function. As we can verify from Eq. 6, as the Euclidean distance from (x1, x2) to (pm1, pm2) decreases, the higher is the value of γm and the more the value vm contributes to the prediction. Crucially, this is precisely the expected behavior in a comparison-based model. (View Highlight)
- The prototype distribution fm in Eq. 1 takes as input a property x. However, we note that our model is compatible with approaches that first compute a so-called embedding h = g(x), where g denotes the result of a neural network. (View Highlight)
- In such a case, the parameters of the neural network g can be pre-trained, or jointly optimized during the optimization of the prototype-based model. Technically, one needs to replace fm(x) by fm(h) and this turns the loss function in Eq. 4 an implicit function of g and its weights. This change is straightforward to apply in the aforementioned automatic differentiation tools. (View Highlight)
- We use data from a record of properties in King County (Seattle)3, which includes property prices and other variables collected from May 2014 to May 2015. The dataset contains information about 21,613 properties. Table 2 indicates a summary of the variables available in the dataset that will be relevant for the anal- yses below, and some typical statistics. Fig. 2 displays a map of the location of the properties , where the color indicates the price per square foot. (View Highlight)
- The first experiment will evaluate the model considering exclusively location and price per square foot. There are several reasons for this choice. First, it is widely acknowl- edged that “location is the primary factor in determining market prices” [42] and that this primacy arises from its multifaceted influence on factors such as proximity to economic opportunities, appeal of certain leaving areas or transportation accessibility. (View Highlight)
- Therefore, we simulate a model that predicts real estate price solely based on location. Second, when the input variables are location, the prototypes pm of our model can be directly interpreted as a list of typical locations, and the values as typical prices in these locations. While deliberately simplistic, this is still very relevant information for practitioners in real estate or seeking to obtain a geographical segmentation of the market and sub-market analyses [22]. (View Highlight)
- To perform model training and evaluation, an initial step involves a linear partition of the dataset into training and testing subsets (with a training/test distribution of 60%/40%), structured in accordance with chronological order (given by the date col- umn). This approach emulates real-world conditions where available information is time-bound, and the goal is to predict property values beyond a specific temporal point. This method effectively mitigates the potential leakage of temporal informa- tion into the predictive models. Notably, within the training subset, 20% of the data is reserved for parameter validation, ensuring that model parameters are optimized effectively. (View Highlight)
- For the analysis, we take latitude and longitude as the input variables, and con- struct an output variable price per square foot as the quotient of the available columns price and square feet. In the data pre-processing stage, each input variable is standardized (to zero mean and unit variance). The output variable y (price per square foot) is scaled by a factor of 1/400, with the objective of rescaling the values of y to fall within the approximate range of (0.0, 1.0), which makes numerical optimization more favorable. (View Highlight)
- To build our model, we use M = 50 prototypes. The model is implemented and optimized using the keras library, with an Adam optimizer with learning rate lr, batch size b and number of epochs E. To these hyperparameters, we add the regularization coefficient η of Eq. 4 and also use a constant preset σ for n Eq. 1. All these hyperpa- rameters are systematically determined using a standard grid search strategy on the validation set. We highlight that we use a constant value of σ for both coordinates and for all prototypes, so that the computation of importances γ of Eq. 2 is explicitly a function of the Euclidean distance between each property and the prototype mean. (View Highlight)
- Our model is subjected to comparative evaluation against two alternatives. First, we conduct a comparison with a nearest neighbor algorithm using randomly selected neighbors and M=50 (in order to match the number of prototypes). This comparison offers insights into the efficacy of the prototypes in guiding the predictive process. Second, we compare with an alternative way of discovering prototypes, wherein we use a clustering algorithm. Clustering stands as an alternative for identifying groups of properties with similar characteristics and valuations. Therefore, we apply a K-Means clustering algorithm [40] with K = 50 and take its means as the prototypes pm, and the average price per square foot in each cluster as the vm. For all models, we measure the quality of fit with the root mean squared error (RMSE), in the original y-scale (before normalization). To mitigate the impact of sources of variability such as random seeds used in parameter initialization, we repeat the experiment a total of 10 times, and report the RMSE averaged over the 10 runs. (View Highlight)
- Now we turn to discussing the interpretation of the prototypes pm obtained by the model. Because our input variables are longitude and latitude, the model parameters pm and vm can be interpreted directly as the prototype locations and prototype values. Fig. 3 displays the locations pm overlaid on the original data. (View Highlight)
- Through our analysis, it becomes evident that the prototypes pm exhibit a wide distribution that effectively spans the different regions of the entire dataset. Further- more, as depicted in Figure 4.3, a closer examination is directed towards a specific prototype, shedding light on properties in close proximity to it or, in other words, all the properties that can be succinctly encapsulated or “summarized” by the respective prototype. (View Highlight)
- We verify by visual inspection that the values vm and the nearby prices are con- sistent. Therefore, we qualitatively verify that the behavior of the model mimics the process of valuation-based comparison that is similar to the human intuition behind pricing. We now study the effect of the number of prototypes. We repeat the previous protocol using different number of prototypes and compare the three different methods with N = 10, 20, 50, 100. Results are reported graphically in Fig. 5. (View Highlight)
- es. Apart from latitude and longitude, we consider number of bathrooms, square feet of living room, square feet of lot, number of floors, waterfront, view, square feet above, square feet of basement, square feet of living room in 2015, square feet of lot in 2015. Thus, the new model uses 12 input variables in total. (View Highlight)
- The experimental details are the same as in the last section, with one exception: because now we use several variables of different types, it is reasonable to expect that different variables need to be weighted differently for the importance computation of Eq. 2. (View Highlight)
- For completeness, we will report the results using a constant hyperparameter σ for all prototypes (equivalently to the last section), and also by explicitly learning the σmd’s as part of the optimization process. (View Highlight)
- . In this case, we found the random neighbor initialization of the KNN method yields a higher variance between the 10 runs. Therefore, we also report the minimum and maximum RMSE across the 10 runs to make clear there is no overlap in the error metrics. (View Highlight)
- There are some noteworthy observations when comparing to the results of the previous section, which leveraged location variables only. First, the RMSE of the pro- totype based method (using constant σ) has improved (68.1 in Table 4 vs. 76.6 of Table 3. This underscores the positive impact of adding more variables in the prototype- based model. This impact is even larger when one considers the improvement brought by learning the parameters σm (down to 62.0). (View Highlight)
- We can still interpret the means pm as the prototype modes – with the clarification that now these do not only specify the typical locations, but also the typical values of each variable considered. As an illustration, Fig. 6 displays five scatterplots, covering the 12 input variables, and also the projections of the prototypes in the corresponding subspace. Second, when inspecting the results for the other 2 methods, we observe an opposite trend: in both cases the RMSE has not improved but increased as more variables are integrated (for k-means, RMSE increases to 90.9 using the 12 input variables, up from a value of 81.7 with the model of the previous section; for KNN the increase is from 110 to 115). (View Highlight)
- A possible explanation for this phenomenon is as follows. When we used latitude and longitude as the only input variables, the proximity between samples (both KNN and k-Means are based on Euclidean distance), is a good proxy for the similarity in price, as argued previously. However, when additional variables are introduced, the Euclidean distance calculation between two data points becomes a composite measure, encompassing multiple components, some of which may not have a direct correlation with price differences. In essence, this expanded feature space introduces dimensions that are not necessarily indicative of variations in property prices. Consequently, simi- larity between variables does not translate to an effective similarity in price. The reason why our prototype-based model exhibits the opposite behavior can be attributed to the model’s ability to ’learn’ and adapt the optimal locations of the prototypes (and, especially, the scales of each dimension). By doing so, it effectively hones in on the specific variables and combinations thereof that are most relevant for accurate price estimation. (View Highlight)
- Another research question we could ask is “do the model predictions ˆy also approxi- mate well the distribution of observed test prices y? To that end, Fig. 7 displays the histogram of both test and estimated prices. For the case of N = 50 (the setting used in this section), indicated on the left, we see the distribution of estimated prices sub- stantially overlaps that of real prices, albeit putting more mass in the center of the distribution and less in the extremes. However, this phenomenon is not surprising and, to some extent, expected, since prediction models can be interpreted as “smoothing the data” (and consequently reduce extreme observations. To verify this, if we increase N, as we use more prototypes we should expect less smoothing, which is clearly visible in the plot on the right for N = 100. (View Highlight)
- Now we compare the prototype-based model with regression trees, random forests and neural networks, all of which were discussed in Section 2. First, we discuss the comparison settings. For all methods, we still use identical train/validation/test splits as in the case of the prototype-based model. We select the best hyperparameters by evaluating the RMSE of the trained models in the validation set. Then, we report the RMSE of the best hyperparameters on the test set. Again, we repeat this process 10 times and report the average RMSE, in order to mitigate factors such as randomness in parameter initialization. (View Highlight)
- Finally, we qualitatively evaluate the extension of the model as a neural network introduced in Section 3.4. To that end, we use the network trained in the last section, remove the output layer and add a 2-unit layer. We denote the intermediate output of the previous layer as h. We transform the properties xn to embeddings hn by computing a forward pass and use these embeddings hn to train the prototype-based model. Because hn has two dimensions, we can visualize the prototype distribution in this space. The result is in Fig. 8. We visualize the prototype modes (red) on top of a sample of properties in the projected space h. While it may not be possible to assign an interpretable meaning to each projection component, we identify some structure, e.g. top left is lower prices, bottom right higher prices. Also, by inspecting prototypes we can still assign a meaning to some of them, as indicated in the figure, based on the value (such as “expensive apartments with no basement and good views”). This experiment is to demonstrate qualitatively the potential extensions, but it suggests interesting directions for application and research. (View Highlight)
- when using spatial variables only (latitude and longitude) the prototypes spread well across all the differ- ent physical regions, as depicted in Figs. 3 and 4.3. Inspection of the prototype values also indicates a reasonable segmentation of the space and seemingly well-approximated price estimates. (View Highlight)
- on top of this qualitative examination, the quantitative evaluation con- firms that the prototypes and values obtained by the model, in different settings, are appropriate. The strongest argument for this is that the predictive errors obtained by the prototype-based model are smaller than (i) alternative ways of constructing the prototypes, as summarized in Table 3, and (ii) alternative machine learning methods, as summarized in Table 5. (View Highlight)
- For the model using the location variables only, R2 = 0.56, for the model using all the variables but constant σ, R2 = 0.64, and for the model using all the variables, R2 = 0.73. By inspecting the RSS values, the second model reduces the variance by about 20% compared to the first, and the third model reduces the variance by about 34% compared to the first. We see a big part of the variance is captured by the location variables, but adding more variables produces a significant reduction. According to this analysis, location is far from being the only factor that is significant. (View Highlight)
- these ratios also show an apparent limitation of the model, as the fraction of variance explained by the best model is R2 = 0.73, indicating that there is a substantial portion of the variance not accounted for by the model, which can be attributed to unobserved variables or inherent irreducible noise within the dataset. This is not surprising given the complexity in the real estate market and is comparable to other studies [3]. Some studies have indeed tried to cover this gap by capturing variables such as the quality of the view from the window [12] using computer vision techniques. (View Highlight)
- In any case, this limitation points the need for caution when utilizing the model for individual sample predictions, as it may not adequately capture the intricacies of a specific data point. However, it is essential to recognize the utility of the model when used for group predictions or to summarize a real estate dataset with a reduced set of prototypes. (View Highlight)
- Another aspect of the model is that it treats location in an general purpose way. Usually, location is difficult to model. Some previous studies added variables to the model such as the distance to different employment centers [11]. Those studies were focused on inference and precisely studied the effects of proximity to those key loca- tions in the price. However, when the goal is price prediction with a general-purpose methodology that is valid across different markets, it will be cumbersome to specify the key locations for each new study. The prototype-based model will capture the price changes in different areas regardless of the reason, without having to define specific variables. (View Highlight)
- Future lines of research include using more complex form of the prototypes and values. In the present study, the prototypes are Normal distributions and the values are constant and we could explore other distributions or non-constant values. Perhaps if we could perform variable selection within the model we would obtain an even higher degree of explainability. Finally, we also acknowledge that there could be some use cases where researchers or practitioners are limited to the use of linear models (View Highlight)
- ). In such a situation, we could investigate if we could use the prototype-based model as a “feature engineering” method, where for example the obtained values of γ (Eq. 2) are used as additional location-related variables in the linear models. (View Highlight)