With millions of creative items across thousands of categories – many of which are unique – it’s difficult to accurately capture all possible product attributes, which range from standard attributes like “color” and “material”, to niche attributes like “bead hole size” and “slime additives.” The range of possible attributes and their values is so broad that it’s a challenge even to enumerate them, let alone label listings with specific attribute data. Unlike other online retailers (that may also have enormous inventories), because products on Etsy are listed by third party sellers and often handmade or customized, we do not have global SKUs (stock keeping units), or mappings from SKUs to product attributes. (View Highlight)
We collect both structured and unstructured data from sellers, and they serve different roles in our marketplace.
• Unstructured data comes in the form of free-text descriptions, creative titles, and listing photos. While this content is full of useful product information, it’s harder for machines to interpret consistently and quickly at scale.
• Structured data - in the form of product attributes like size and color - is easy for our systems to parse. It powers the buyer experience through tools such as search filtering options (offered through selectors in UI) and product-to-product comparison for characteristics of interest (material, price, etc). (View Highlight)
While Etsy does ask sellers to provide structured data on their listings’ attributes, most fields are not required. This reduces friction in the listing process and gives sellers the flexibility to represent their often unique items accurately. (View Highlight)
Sellers can fill in these attributes, or leave them blank and continue through the listing process
As a result, most sellers only or mostly provide unstructured data in the form of listing titles, descriptions, and photos. Frequently, key information like product dimensions is buried in the listing description or only available in listing photos. (View Highlight)
While our powerful search and discovery algorithms can process unstructured data such as that in descriptions and listing photos, passing in long context and images directly to search poses latency concerns. For these algorithms, every millisecond counts as they work to deliver relevant results to buyers as quickly as possible. Spending time filtering through unstructured data for every query is just not feasible. (View Highlight)
These constraints led us to a clear conclusion: to fully unlock the potential of all inventory listed on Etsy’s site, unstructured product information needs to be distilled into structured data to power both ML models and buyer experiences. (View Highlight)
Before the availability of scalable LLMs, we explored various ML-based solutions to this challenge. Supervised product attribute extraction models had limited efficacy; even if we could enumerate all possible product attributes and values, many of them would be so sparse that traditional classification models would struggle to capture the long tail. Sequence tagging approaches also had difficulty scaling to multiple attributes. Transformer-based question-answering models (e.g. AVEQA, MAVEQA ) allowed for generalization to unseen attribute values, but still required large amounts of application-specific training data. (View Highlight)
This is where the availability of foundational LLMs presented a transformational opportunity for Etsy. These models have a vast amount of general knowledge from pre-training, can process large context windows quickly and affordably, and can follow instructions given a small number of examples. (View Highlight)
When working with LLMs, one of the biggest challenges is evaluating model performance. We needed to ensure that, at scale across our 100M+ listings, the LLMs were consistently and reliably producing accurate, actionable results. To do this, we initially worked with a third-party labeling vendor to collect a large sample of human-annotated data containing attribute annotations for listings across multiple categories. We evaluated performance by comparing LLM inferences to this human-annotated dataset and calculated metrics like precision, recall, and Jaccard index. We used these ground truth metrics as a benchmark for model improvements via prompt and context engineering. (View Highlight)
Unfortunately, there were several significant drawbacks to relying on human-labeled data. In many cases, we found that human annotators made mistakes, especially when annotating thousands of listings (after all, no one’s perfect). In the example below, a human annotator marked the light fixture as ½ inch width, while the LLM correctly extracted 5.5 inches. (View Highlight)
Furthermore, human labeling is more time-consuming and expensive. To start scaling attribute inference across thousands of categories, we needed to come up with an automated process for labeling that did not rely exclusively on human annotation.
Instead, we’ve started using high-performance, state-of-the-art LLMs to generate ground truth labels (often called “silver labels”). Human-in-the-loop is still an essential part of this process: Etsy domain experts review silver labels and iterate on the prompt to ensure high quality results. Once we’re confident in our silver label generation, we produce a larger dataset for evaluating a more scalable LLM. The diagram below shows the updated process for model development. (View Highlight)
The core of our LLM pipeline is context engineering. We’ve worked with partners in product, merchandising, and taxonomy to ensure that the LLM has the right context for attribute extraction, including:
• Seller-provided listing data, including listing titles, descriptions, and images
• Few-shot examples hand-selected by domain experts
• Business logic from Etsy’s product taxonomy
• Category-specific extraction rules (View Highlight)
Each listing is represented as a JSON string of context information. This context is injected into a series of prompts to extract product attributes in parallel. LLM requests are routed through LiteLLM to different regions, ensuring higher parallelization and removing a dependency on one singular cloud location. Finally, LLM responses are parsed into Pydantic dataclasses, which provide both basic type validation and custom validation based on business logic. (View Highlight)
Beyond the challenges of evaluating the LLM output, Inference itself may fail for many reasons: code bugs, permissions issues, transient errors, quota exceeded errors, safety filters, and more. Rather than failing the pipeline for any individual error, errors are logged, and error metrics are surfaced via our observability platform. Our team is alerted if the number of failed inferences exceeds a certain threshold. To support debugging, we log a sample of traces to HoneyComb. (View Highlight)
Even if the error rate is low, it’s possible that model performance has degraded. To track changes in model performance, we added performance evaluation to our pipeline. First, we run LLM inference on a sample of a ground-truth dataset, and calculate performance metrics like precision, and recall. These metrics are compared to baseline scores from the full ground-truth dataset. If any metrics deviate significantly, the pipeline is terminated. This process allows us to confirm that third-party LLMs are working as expected before we run production-scale inference. (View Highlight)
Where we’ve applied LLM-generated product attribute data to buyer and seller-facing experiences, we’ve seen promising results. In target categories, we’ve increased the number of listings with complete attribute coverage from 31% to 91%. And earlier this year, we added LLM-inferred attributes to search filters, leading to more engagement from buyers:
• Engagement with relevant Search filters increased
• Overall post-click conversion rate increased
All this work combined most recently into leveraging LLM-inferred color attributes to display color swatches for each listing on the search results page. This provides at-a-glance additional information to our buyers to find exactly what they want, faster. (View Highlight)