In this study, we research use cases from Booking.com and define in total 239 valu- able topics. Each topic is set with a topic name and a topic description, to better match the natural customer language and for optimal model training results. The main data source is user-generated con- tent on Booking.com, including customer reviews and forum posts from hotel owners and travelers. (View Highlight)
Multi-label text classification is a critical task in the industry. It helps to extract structured information from large amount of textual data. We propose Text to Topic (Text2Topic), which achieves high multi-label classification perfor- mance by employing a Bi-Encoder Transformer architecture that utilizes concatenation, subtrac- tion, and multiplication of embeddings on both text and topic. Text2Topic also supports zero- shot predictions, produces domain-specific text embeddings, and enables production-scale batch-inference with high throughput. The final model achieves accurate and comprehensive results compared to state-of-the-art baselines, including large language models (LLMs). (View Highlight)
Developing an architecture that ensures high ac- curacy, scalability for a large number of topics, low cost and low latency on real-world inference is of utmost importance. Sentence-BERT (Reimers and Gurevych, 2019) extends BERT (Devlin et al., 2019) for sentence-level embeddings, achieving im- pressive performance on tasks like sentence similar- ity and semantic retrieval. (View Highlight)
In this study, a total of 239 topics are defined, and around 1.6 million text-topic pairs annota- tions (in which 200K are positive) are collected on approximately 120K texts from 3 main data sources on Booking.com. The data is collected with optimized smart sampling and partial la- beling. (View Highlight)
Multilingual Universal Sentence Encoder for Semantic Retrieval (MUSE) (Yang et al., 2019), a multilingual extension of the Universal Sentence Encoder (Cer et al., 2018), enables cross-lingual semantic retrieval and pro- vides multiple open-source models. Though there are also other state-of-the-art approaches, the two methods above are prevalent in real industry appli- cations, due to the computational efficiency, high and robust in-domain performance by fine-tuning, zero-shot ability, and strengths in scalability. (View Highlight)
Our proposed Text2Topic framework adopts a fine-tuning approach upon pre-trained language models. (View Highlight)
we employ the bi-encoder transformer (Vaswani et al., 2017) architecture pro- posed by Sentence-BERT, which allows separate injection of the text and topic information (View Highlight)
In the digital age, large-scale online travel plat- forms (OTPs) face the challenge of effectively ex- tracting valuable insights from massive volumes of textual data. Such an OTP can get hundreds of mil- lions of customer reviews in one year, so structured insights are crucial for comprehending customer behavior and making data-driven decisions in order to improve the overall travel experience. (View Highlight)
This architecture not only enables the model to have zero-shot capabilities (handle new topics for inference) but also exhibits text em- bedding abilities. (View Highlight)
we pre-calculate and cache all topic embeddings, only embed each text once and repeat the text vec- tor to score on all topics. Given N as the number of texts and T as the number of topics, to get N ×T predictions, the bi-encoder needs O(N + T) en- coding operations, while it is O(N · T) for the cross-encoder. 2) In-house embedding: bi-encoder enables us to have the text part embeddings, that can be used as features for other tasks. 3) For the same base model, the bi-encoder allows longer text input since text and topic are embedded separately. (View Highlight)
All the above architectures can easily extend to include new topics for training, and also have zero-shot possibility when the unseen topic is well-defined with a description. (View Highlight)
With hundreds of topics, the annotation becomes challenging: how to define, merge, and distinguish topics; how to decide annotation volume and can- didate texts per topic and reduce cost. (View Highlight)
For each text-topic (with topic description), we know one binary ground truth for this pair, (View Highlight)
Cross-encoder: the text and topic description are tokenized and concatenated as one input with [SEP] separator (“TEXT[SEP]TOPIC”), then passed into the transformer encoder and a classification head to derive logits, where Binary Cross Entropy (BCE) loss is applied. (View Highlight)
We start with a proof-of-concept stage, where 43 topics are pre-defined and annotated by domain experts on 12K texts. With the data, we run mul- tiple cross-encoder model training by increasing the number of positive annotations per topic in the training data. Figure 5 in the Appendix shows that for most topics, the mAP metrics saturate at 200 positive annotations, which reflects basic guidance. (View Highlight)
• Bi-encoder Concatenation (Figure 1): we first generate a pair of embeddings (U, V) both as dimension dmodel, where U is the topic description embedding and V is the text embedding. Then we feed E (the embed- ding concatenation, subtraction and multipli- cation) into 2 feedforward layers (FFN1 ∈ RdE×dmodel, ReLU activation, and dropout, and then FFN2 ∈ Rdmodel×1), and finally apply BCE loss. (View Highlight)
In the end, we define 239 topics from user re- searches, covering broader topics such as trip types (romantic trip, city trip etc.), travel activities (surf- ing, hiking etc.), and specific user needs such as hotel facilities (garden, balcony etc.). Each topic has a name and a description which is refined with the help from LIME (see Section 5.5). (View Highlight)
• Bi-encoder Cosine: similar to the Figure 1, but at step 4, instead of embedding combination, we apply cosine similarity directly on U and V, and then apply mean-squared-error loss as the objective function. (View Highlight)
Smart Sampling and Partial Labeling In our corpus, text is typically short and contains low number of topics, so we apply partial labeling instead of full annotation. (View Highlight)
into 38 multi-choice question groups (e.g., food topics are in one group). With the best 43-topic model (from the training as Section 3.1 shows), we apply it to predict on a large corpus, detecting 239 topics and generating text-topic scores. Notably, the prediction results for the unseen 196 topics are generated by zero-shot. (View Highlight)
. The auditing team performs regular performance checks and we get > 95% annotation accuracy. Finally, we col- lected almost 1.6 million annotations at a low cost, with 200K of them being positive (which is 12.5%). These annotations are gathered from approximately 120K unique multilingual texts, including English, German, French, Russian, etc., from guests, trav- elers, and property owners, sourced from reviews, the travel community, and partner hub. (View Highlight)
With the predictions, we perform smart text sam- pling: 1) Firstly, for each topic, we do probability- weighted sampling on the texts whose scores pass a threshold, and assign selected texts to the multi- choice group which contains that topic. Figure 4 in the Appendix shows an example for a text that passes the threshold for the “romantic trip” topic and was assigned with the group of topics that con- tains the “romantic trip” topic. 2) In addition, to avoid annotation bias, each selected text is also as- signed to one random group (besides the already assigned relevant groups). For example, the text in Figure 4 is also assigned to another group ran- domly. 3) Besides the model-based text sampling, we also randomly sample some texts from corpus and randomly assign them to groups. (View Highlight)
We use AWS SageMaker Ground Truth as the platform for the annotations collection, and lever- age on some of the MuMIC (Wang et al., 2023) annotation pipelines and strategies (majority voting etc.). We recruit specialized annotators to form one auditing team, and multiple worker teams (View Highlight)
All experiments are performed on a computation instance equipped with 1 NVIDIA Tesla T4 Tensor Core GPU, 4 vCPU, and 16GB RAM. The exper- iments have the following general settings: bert- base-multilingual-cased model as the pre-trained base model; fine-tuning all layers; mixed-precision training; batch size of 12 text-topic pairs, with gra- dient accumulation steps as 8; weight decay (on all weights that are not gains or biases) with coefficient 0.01; AdamW optimizer (Loshchilov and Hutter, 2017); initial learning rate 1e-5 with a linear sched- uler; allowing maximum 6 epochs with early stop- ping patience as 3 steps, and warm-up steps as 10K. We apply stratified sampling (on topic frequency) on the texts, and get training/validation/test sets, with a ratio of 70/15/15 respectively. (View Highlight)
Given T topics, and N texts, the ground truth and the predictions can both be represented as a matrix with size N × T 1 . (View Highlight)
Table 1 compares performance across multiple model architectures and MUSE baseline. We per- form hyperparameter tuning on all methods, report the highest reachable performance and find all of them beat the MUSE baseline 3. The cross-encoder outperforms all other architectures because it learns the topic-text relation attention layer by layer inside the transformer. Generally, the bi-encoder concate- nation method is better than simple cosine similar- ity architecture, and the embedding subtraction and multiplication are both necessary. (View Highlight)
Google provides mul- tiple versions of MUSE models, and we use the “multilingual-large-3” one 2. The model covers the languages we have in the dataset, is trained with multi-task learning on Transformer architecture, and is optimized for multi-word length text. Given a text, MUSE generates a 512-dimensional vector as the embedding. For each text-topic pair, we cal- culate the cosine similarity on text embedding and topic embedding as the model prediction score. (View Highlight)
It’s worth mentioning that for all methods, the (4) 4.2 Baselines MUSE (Yang et al., 2019): Google provides mul- tiple versions of MUSE models, and we use the “multilingual-large-3” one 2. The model covers the languages we have in the dataset, is trained with multi-task learning on Transformer architecture, and is optimized for multi-word length text. Given a text, MUSE generates a 512-dimensional vector as the embedding. For each text-topic pair, we cal- culate the cosine similarity on text embedding and topic embedding as the model prediction score.
GPT-3.5: we choose the gpt-3.5-turbo-0301, which supports a maximum 4K token context length.
1With partial labeling, the matrix has null values, and we filter them out accordingly when do metrics calculation.
2https://tfhub.dev/google/universal-sentence-encoder- multilingual-large/3 96 model training typically can saturate at around 2nd or 3rd epoch, which takes less than one day for one model training. The cross-encoder has the high- est performance but with too high inference time complexity, so we choose the “bi-encoder (con- cat, sub, mult)” one for production and refer it as “bi-encoder concat” model in this paper. (View Highlight)
In our case, Text2Topic is a better choice because of: 1) less dependency on non-open-source models, so that the model iteration and rate limits are under- control; 2) avoiding tedious prompt tuning proce- dures; 3) a lower cost and it’s more eco-friendly: the Text2Topic model has less than 200M param- eters (considering we cache the topics embedding during inference). (View Highlight)
We randomly split all topics into 5 groups, and then each time train 4 groups and evaluate the zero- shot ability on the remaining one group. Table 3 provides an overall performance comparison, and we see bi-encoder concat model performs the best. (View Highlight)
We use Local Interpretable Model-agnostic Expla- nations (LIME) (Ribeiro et al., 2016) for model in- terpretation. LIME generates local explanations by perturbing individual text instances, approximating the model behavior by using a surrogate model that highlights the importance of words in the original text. LIME is effective in the error-analysis flow, and help topic description refinements at an early stage (see example in Figure 3 in the Appendix) (View Highlight)
Considering the GPT-3.5 context length, we select 24 topics for the evaluation, covering 3 representa- tive groups: food, trip types, and room conditions. With multiple prompt iterations, we find the few- shot prompting is necessary because it can regulate output format by showing examples. We finally get two best prompts: 8-shot and 38-shot. Both prompts include the 24 topic definition list with de- scriptions, and Chain-of-Thought (CoT) (Wei et al., 2023) rules: ask the model to quote each part of the text, infer topics, and then output a topic list. Besides topic description and CoT rules, the 8-shot prompt (around 1700 tokens) has 3 text examples, covering 8 positive annotations on 8 topics; while the 38-shot (around 2900 tokens) has 16 examples, covering 38 positive annotations on 24 topics (View Highlight)
Detect Property Type: With Text2Topic pre- dictions on reviews, we are able to detect hidden properties categorizations, by analyzing relevant topic (guest house, farm stay, resort, chalet etc.) frequencies. For example, an Apartment property that is described as a Guest house, could be sur- faced to users that are searching for a guest house. We detect over a million extra properties supply (774K more apartments, 25k more villas and 60k more cabins/chalets). (View Highlight)
Fintech: Text2Topic training pipeline enables us to train a new model on Fintech data and top- ics, such as payments and questions about invoices and commissions. The model auto-classifies in- coming messages from customers and correctly re-routes them to the right self-service solution, which increases the auto-reply success rate by 9% and reduces manual handling time. (View Highlight)
Property Recommendation: Reviews contain rich information that encapsulates the users’ pref- erences towards different properties. Text2Topic turns them into structured features, which enhance the in-house property recommendation models’ per- formance. With classification scores on reviews, we perform property-level score aggregations, to extract a variety of insightful attributes such as a property’s relevancy for different themes (e.g., beach, spa/wellness). These attributes are inte- grated into the recommendation models to increase relevant inventory (e.g., number of beach properties is increased by 4% by leveraging Text2Topic); and to create novel and nuanced categories of recom- mendations (such as castle-type hotels (View Highlight)
Aiming to minimize the cost per prediction, we start with grid searching the optimal batch size by performing stress tests using a single model end- point. For each batch size we randomly sample batches of texts from the corpus, and then iterate over the batches sequentially and invoke the model. We observe that while increasing the batch size, the throughput (number of texts predictions per minute) first increases, and then starts decreasing as the number of available GPU cores exhausted. An op- timal and memory-explosion safe batch size is 300. Then we run experiments to compare batch invo- cations against asynchronous I/O invocations. (View Highlight)