In the age of the internet, there is no dearth of information on any topic we want. But organizing information in a structured manner and making it consumable is still a challenge. Knowledge Graphs (KG) are one such tool for storing information in a more structured manner, capturing the relationships across data points, and making it easy to consume — both for humans and machines. (View Highlight)
We experience the benefits of a KG during our internet searches, when structured information gathered from a variety of sources is summarized by the search engines. For example, a keyword search for the word “New York” returns a variety of results related to land area, weather, demographics, and news reports, all of which are linked and structured under the same page easily. The results are similar even when you look for “NYC”. This is a nice example of how a KG can be used to visually represent the relationship between a query and indexed data from different sources, making the data easily available for a variety of tasks and consistent for related searches. (View Highlight)
At Zillow, in addition to an abundance of user engagement data, we work with large amounts of home-related data in the form of listing images, listing descriptions, home attributes, neighborhood information, points of interest (POI) datasets, etc. We also work with several in-domain knowledge banks, such as curated blogs, real estate attribute definitions, and annotation guides that are relevant to the business. In order to create seamless experiences for our users, and help them find their next home, it’s critical that the business consume and structure this data correctly. Figure 1 is an illustrative knowledge graph representing the relationships among a small subset of possible home attributes (View Highlight)
Figure 1. A small sample of a real estate KG consisting of concepts for different home features and connected to each other through Parent / Child (i.e., Hypernym / Hyponym) and Synonym relationships (View Highlight)
The diagram in Figure 2 shows the functional view of a KG enriched by data from a variety of sources and supporting various use cases and product experiences. The Content Understanding Platform is Zillow’s internal platform built to tap into a variety of structured and unstructured data sources available to extract information and store it in a KG, with the help of Human in the Loop (HITL) validation (i.e., a human validating correcting model prediction). The platform is crucial in aggregating and normalizing information and acts as a bridge for KG creation and updates. The platform hosts a variety of image and text models and can make near real-time predictions for supported use cases. The data present in the KG powers a variety of applications, as shown below. We will now briefly touch upon how the KG helps to power these applications. (View Highlight)
Figure 2. Applications of a Knowledge Graph (KG), together with data sources and models informing the KG. (View Highlight)
Keyword search: The KG plays a critical role in understanding user intent during keyword search, normalizing the query keywords based upon canonical concepts that can be used to generate consistent retrieval of relevant listings, which are indexed with the same canonical concepts. On the indexing side, the KG enables standard indexing of listings and normalizes the listings in terms of the canonical concepts. A keyword search that leverages a knowledge graph in this way is referred to as a concept search: (View Highlight)
Images: Images related to listings are also a valuable source of information because we can extract a lot of information about a house and surrounding area that may or may not be clearly conveyed in the description. A typical extraction includes running image models to get scenes of the image, image quality, and image-attributes alignment. (Read this post and this post to get a general understanding of Zillow’s image processing techniques.) (View Highlight)
Historical user queries: Zillow apps now also support natural language queries as a part of the search box experience(e.g. homes near me with 2 beds 2 baths and a fireplace). The Zillow website also supports keyword searches along with standard filters. We also review external data such as SEO queries coming from search engines like Google. These are great sources of information that can help us understand various user preferences, important real estate-related attributes, and different ways of expressing them. We extensively review these data sources and ingest them as a part of our Knowledge Graph. (View Highlight)
Defining Ontology: The next step to creating a KG typically involves normalizing the various data sources to standard ontologies and storing them in a standard format for easy consumption and inference. In simple terms, Ontology refers to the standard entities, classes, and relationships defined in a KG, and guidelines on how to interact with them. Here are some sample node types and relation types defined for the real estate domain: (View Highlight)
Figure 5: Example of Ontology nodes related to Home concept and Base form along with a set of relationship edges.
In Figure 5 we show two different node types to represent information:
Home Concepts such as pool, architecture style, amenities, etc. present or related to a home. They can also be related to other home concepts through parent, child, or synonym relationships. Ex: Concept “pool” is synonymous with swimming pool and is a parent of the concept “heated pool”. Such metadata helps us better understand the various concepts and use them based on use cases.
Base forms: These are any entities we observe across our datasets that we want to include in our knowledge graph. They are the basic block of home concepts and are aggregated to form a unique concept, on an as-needed basis. (View Highlight)
Normalization and entity disambiguation: One critical aspect of ingesting data across various sources is the different forms in which the same entity can appear across the dataset. For example, the home concept “pool” can appear as “pool”, “swimming pool”, “swimmingpool”, or “has_pool: True” across various text data sources. However, we know that they all refer to the same standard concept of pool, and hence they need to be normalized and stored correctly in the Knowledge Graph. There are a variety of ways of handling this task, but we will talk about two broad classes of methods we use: (View Highlight)
We have trained BERT-based models in-house that can help classify relationship types given a pair of nodes or generate candidates. Figure 6 depicts the flow of the link discovery process: (View Highlight)
The first step in connecting nodes is candidate generation, which identifies a limited number of nodes that could be connected to a given node. For the candidate generation process, we use an in-domain SBERT model to generate candidate embeddings and generate nearest neighbor candidates to be sent to the final pairwise classification model. This process is to reduce the computation cost of comparing all the pairs in the candidate bank. (View Highlight)
The second step is a pairwise classification for a given relationship. The pairwise classification model takes the base form and each selected candidate by SBERT one at a time and makes predictions on a given type of relationship. (View Highlight)
We additionally have a Human in the loop step (HITL) as well to get this verified using a human annotator when needed to ensure high accuracy. (View Highlight)
The KG has been a great tool for us in aggregating data across different sources, standardizing it, and powering many new experiences and products. This enabled us to launch the first Natural Language Search experience in the real estate domain and we experienced lifts in customer experiences measured through AB tests. We also observed a significant lift in the number of properties shown for keyword searches, the ability to understand user queries better, and better relevance score for properties shown to users. The standardization also led to a better understanding of users and improved our search and ranking algorithms. The initial success has been encouraging and paves the way for our future extension of the KG, as well as delighting our customers through new products and services. (View Highlight)
summary, a knowledge graph can help Zillow:
Understand and enrich structured and unstructured data from a variety of sources and normalize them to the same vocabulary
Represent relevant relationships, dependencies, and summaries for a variety of use cases
Create seamless experiences for our users by better understanding their needs and creating relevant product features (View Highlight)
As shown below, we solve several important problems with the help of Knowledge Graphs:
Search Query Autocomplete: The KG provides a variety of concepts that can be presented to users in order to provide them with the right options as they search for homes. The KG offers suggestions related to various home concepts, such as amenities, specific architectures, or locations:
(View Highlight)
Query Understanding: The KG helps in understanding various components of a user’s natural language query and enables concept search for items they are looking for in the query:
Query Understanding (View Highlight)
User Profile: A user profile in search and recommendation engines is a personalized dataset that captures a user’s preferences, behavior, and attributes in order to tailor the experience and improve the relevance of search results. KG helps create new user profile features based on the user’s interaction with explicit searches or their interaction with listings that are relevant to specific KG nodes they might be interested in. A more accurate user profile helps improve personalized recommendations:
User Profile (View Highlight)
Creating a Real Estate Knowledge Graph
At Zillow, we work with data sources that are both structured and unstructured. Here are some examples of each type:
Structured data: Structured data includes data syndicated from MLSs or agents about a property and data sources about the location and region of the property. We also have structured data representing user interactions on the website, user searches, search sessions, and user/agent profiles.
Unstructured data: Unstructured data comes in a variety of formats, such as text in a property listing description, images of the property, 3D/floor plans, documents, scanned images related to the property, etc. (View Highlight)
Knowledge extraction: This process deals with aggregating information across different data sources and getting it ready for ingestion in the KG. In our case, these are some of the data sources listed above. We use both statistical models and the latest transformer-based model to extract this information. (Read this blog post and this blog post for more information on keyphrase extraction.) We will touch briefly on the processes through which we extract this data:
• • Listing Description: We apply various NLP/information extraction techniques to extract interesting home-related information from the natural language descriptions of a home. For example, we extract all-important home attributes from the listing description as shown below:
Figure 3: Example of extracting important home-related attributes from listing description. (View Highlight)
MLS Structured data: We capture a lot of structured data as part of agent input, and MLS feeds that are also stored in our data stores directly. Normally, this data has some structure and is easier to consume compared to unstructured data. We also capture user interaction on the website and natural language queries, which are another great source of information and learning. (View Highlight)
• Capturing static list of various forms**:** Under this method, we keep a list of the various observed forms for a concept as a synonym, or mapping list and use it to disambiguate a base form. This approach is fast and offers better understanding and quality control, but fails when addressing out-of-vocabulary words. The list must also be updated constantly to ensure that we have high coverage. At Zillow, we maintain such a Knowledge base as a part of the Knowledge graph for easy disambiguation and linking. This also acts as a data source for the next method discussed.
• ML models for disambiguation: In this method, we take the help of ML models to link a new base form to existing nodes in the knowledge graph. These could be Image models that take an image and link it to a concept class or text models or graph models that can identify synonyms/same concepts or other open-source KG for entity disambiguation and association to an existing node. At Zillow, we have trained BERT-based models to identify synonyms and help with entity disambiguation. We send a pair of phrases to the model and it classifies if the pair is a synonym or not. This model can not only help in online inference but also generate new candidates to expand the static list discussed above. (View Highlight)
Connecting nodes in the Knowledge Graph: A major step in KG creation is connecting the nodes with relevant links to make the KG more informative and help find newer insights. A part of the linking process can be done with the help of:
• Structured data that lets us know the different links present across nodes. Ex: Listing data capturing dishwasher as one of the kitchen amenities and hence we can connect those two nodes for that listing. However, this data is normally incomplete and does not cover all the links we may be interested in.
• For such missing links, we need to mine for relationships in the data and connect them appropriately. These methods of finding missing links in KG are a function of both the node type and relation type. We can apply a variety of methods ranging from heuristic or rule-based to more complex methods needing language models or graph-based models. (View Highlight)
We will quickly touch upon one such method we use for home concepts that are normally in text form. For one of the use-cases, we care mostly about synonyms, parent and child links detection as explained:
• If a user is looking for the concept of a large backyard, it could be a good idea to show all listings that mention the base form big backyard, huge backyard, etc. In addition, we can also show the user home with concepts such as covered large backyard, fenced huge backyard, etc. that are children of the concept backyard. This not only helps in better listing discoverability for the user, but we can also create relevant touch points for the user to better refine their search preference and look for nuanced concepts. This knowledge can also help us do query relaxation in areas where there is less inventory (View Highlight)
• The same process is followed for synonyms, parents, or other relationships as well.
(View Highlight)
KG updates and versioning: As you may have guessed by now, the process of creating and maintaining the KG is a dynamic process since there is a constant inflow of new information and updates sent across data sources. This makes the process of maintaining and updating KG a critical and challenging task in order to return correct information at any point in time. Typical updates can range from updates to listing description, image, and property structure data that is more common to rare occurrences of updates in definitions of some key concepts, parent-child relationships, and new node creation. In order to cater to the above-mentioned changes, we need to have a KG update workflow in place and a versioning methodology for easy tracking and analysis in the future. Two broad classes of updates we see at Zillow and the complexities associated with them are:
• Point-wise updates: These refer to local changes in the KG that have a limited scope of impact and are normally limited to small sets of nodes. This can include updates in a listing description that show some home concept getting added or removed, new images added to a property, new base forms added to KG coming from a new source, etc. These changes are normally localized and have a limited impact on the other nodes on the graph. Hence these updates are easy to make and maintain.
• Knowledge base updates: This refers to bigger updates that have an impact across a variety of nodes and impact a lot of downstream applications that consume this data. This can be related to Ontology change, the addition of a new relationship type, or an update to a concept’s parent/child or synonym list. These changes alter the way the data gets consumed and inferred by applications and hence need stricter control and tracking mechanisms for updates and point-in-time analysis we may need in the future. (View Highlight)
There are multiple ways to handle the above-mentioned updates and tracking of the KG depending on the use case, frequency of data updates, and consuming applications. At Zillow, we adopt the following mechanism of KG updates and versioning:
• We try to classify our major update tasks as either belonging to Pointwise updates or Knowledge-based updates. A Point-wise update is easy to execute and requires smaller changes to the KG and minimal communication to consumers. We normally conduct Knowledge-based updates less frequently and rely on the help of subject matter experts and humans in the loop to ensure high accuracy. Also, extensive communication is done with downstream clients to ensure limited impact.
• We do time-based versioning of our KG and have the ability to maintain multiple versions at the same time. This helps improve tracking and gives consumer teams enough time to move to the next version of the KG. Not all teams use all parts of KG for their use. As such, they can make updates as needed when a new version of the KG is rolled out.
• Major releases and updates are made available to teams for review and feedback before a new version is released. (View Highlight)