• One of our jobs as data scientists is to train machine learning models. This work requires careful preparation of the training data. When we face supervised classification problems, we need to label the data. These labels classify the information and ultimately allow the model to make decisions. (View Highlight)
  • The quality and consistency of labels are fundamental to model development. Traditionally, labeling has been a manual task where human annotators identify which data belongs to one category or another. This tedious task can be affected by the subjective factor it involves, so the consistency of the labeling can be compromised, especially when the data being labeled is unstructured. (View Highlight)
  • There are several ways to streamline this task. One novel way is to use Large Language Models, which can automatically recognize and label specific categories in large text sets. Because these models have been trained on a large corpus of text data, they can understand the complexity of such texts, contextualize them as a human would, and even reduce human error. (View Highlight)
  • When using an LLM for automatic annotation, it is essential to write a clear and detailed prompt1 that defines the categories to be annotated, for which we can provide concrete examples and guidelines. For instance, in what context and for what purpose is the data being annotated? This allows the model to generate accurate and consistent labels. (View Highlight)
  • One of the most important phases in the development cycle of an analytical model, in this case a language model, is evaluating its results. This helps us assess and improve its performance in production and even fine-tune the prompts we provide. However, this evaluation is less well-defined than that of more traditional machine learning models, where we have established metrics such as accuracy for classification problems or MSE (Mean Squared Error) for regression. (View Highlight)
  • In generative AI, defining a set of metrics to evaluate aspects such as response relevance, consistency, and completeness is essential. Once defined, scoring each response according to these criteria can become a bottleneck due to the time required and the subjectivity of the evaluation, as different human evaluators may have different criteria. For this reason, the alternative is to use another LLM to score the generated responses. (View Highlight)
  • It has been observed that Large Language Models have a certain “reasoning” or “critical” capacity that can be very helpful in evaluating the text generated by other LLMs or even by themselves, thus facilitating the task of manual evaluation. (View Highlight)
  • LLMs have an excellent ability to understand and process large amounts of data, whether text, images, or video. This ability allows them to synthesize information and process requests in a conversational format. (View Highlight)
  • The first possibility, and probably the most obvious, is to take advantage of the large context size that some of the current LLMs can handle (the amount of text we can feed them) to enrich our query with all the necessary information. For example, let’s suppose we ask the model a question regarding government regulations. We can get a better answer if we include in the query a document with the details of that specific regulation rather than just relying on the model’s general knowledge. (View Highlight)
  • However, other techniques come into play when we have more information than the model can process or when we do not know exactly which documents can help us answer the question. The two main ones are Retrieval Augmented Generation (RAG) and fine-tuning. RAG retrieves information from documents using a language model. In the case of a query, the first step is to search the database for information relevant to the answer. The LLM then uses this retrieved information to generate an enriched and contextualized answer. (View Highlight)
  • Suppose we are working on a multi-class classification problem where we label financial products according to their subject matter: cards, accounts, insurance, mutual funds, investment funds, pension plans, cryptocurrencies, and so on. When training a classifier, it may show some tendency or bias in its predictions towards some categories, possibly because other categories are less represented in the dataset. This can affect the model’s performance. (View Highlight)
  • One solution to this problem is to generate synthetic data. Synthetic data is information artificially created by computational models that mimic the structure and characteristics of actual data but do not come from real-world events. In the context of artificial intelligence, this data can help balance a dataset by providing additional examples in underrepresented categories. (View Highlight)
  • LLM are particularly useful for this task. Using prompting techniques, we provide specific examples of the type of text we want the model to generate to balance our dataset. This process can be implemented using “few-shot learning,” where we show the model a few examples to learn the task, or the “person pattern,” which defines a specific format the generated text must follow. (View Highlight)
  • LLM-based agents are systems that can perform actions. As data scientists, we can use this new functionality in applications like chatbots. Traditionally, invoking an action in a conversation, such as when interacting with a device like Alexa, requires a specific command. For example, if you wanted to turn on the lights in your living room, you’d have to specify a particular action and the exact name you gave that lamp to Alexa. However, these agents allow LLMs to interpret the user’s intent, even if the command is not explicitly mentioned, and invoke the necessary actions. (View Highlight)
  • These actions must be clearly defined as tools2 so that agents can perform them. Tools are components that extend the functionality of LLM beyond its dialog capabilities. They must contain a detailed description of the action to be performed, the necessary parameters for its execution, and the corresponding code to carry it out. (View Highlight)
  • These LLM applications are profoundly changing our daily work as data scientists. Tools like GitHub Copilot make coding more accessible, while techniques like prompt critique improve the quality of the prompts we write. They allow us to identify gaps and correct instructions that could be clearer and more accurate in automated tagging tasks. (View Highlight)