rw-book-cover

Metadata

Highlights

  • Datadog releases a state-of-the-art open-weights time series foundation model and an observability benchmark (View Highlight)
  • We are excited to announce a new open-weights release of Toto, our state-of-the-art time series foundation model (TSFM), and BOOM, a new public observability benchmark that contains 350 million observations across 2,807 real-world time series. Both are open source under the Apache 2.0 license and available to use immediately. (View Highlight)
  • The open-weights Toto model, trained with observability data sourced exclusively from Datadog’s own internal telemetry metrics, achieves state-of-the-art performance by a wide margin compared to all other existing TSFMs. It does so not only on BOOM, but also on the widely used general purpose time series benchmarks GIFT-Eval and LSF (long sequence forecasting). (View Highlight)
  • BOOM, meanwhile, introduces a time series (TS) benchmark that focuses specifically on observability metrics, which contain their own challenging and unique characteristics compared to other typical time series. As with Toto, BOOM’s data is sourced exclusively from our internal metrics, but from a separate environment to eliminate any cross-contamination with Toto training data. Further details can be found in our latest technical paper. (View Highlight)
  • Observability metrics represent an important, unique, and challenging subset of TS data, raising the need for observability-focused models and benchmarks. Accurately modeling observability metrics is essential for critical operations like anomaly detection (e.g., identifying spikes in error rates), predictive forecasting (e.g., anticipating resource exhaustion or scaling needs), and dynamic autoscaling. However, observability time series have challenging distributional properties, including sparsity, spikes, and noisiness; anomaly contamination; and contain high-cardinality multivariate series, with complex relationships among variates. (View Highlight)
  • Real-world observability systems also routinely generate millions or even billions of distinct time series, making individual fine-tuning or supervised training of complex models per time series infeasible. Finally, many of these series are short lived and correspond to ephemeral infrastructure, creating a “cold-start” problem. All these operational challenges provide the need for a zero-shot TSFM with excellent performance on observability data specifically. (View Highlight)
  • The new (151 million parameter) open-weights Toto model (model card) builds upon our initial proprietary model architecture, which combined a Student-t mixture model prediction head to model heavy-tailed observability time series and a proportional factorized attention mechanism to efficiently capture variate interactions. (View Highlight)
  • Our new model also uses a novel patch-based causal normalization approach, which enables the model to generalize better to highly nonstationary data, and uses a composite robust loss to stabilize training dynamics. Detailed descriptions of all of these features as well as ablations are presented in our paper. (View Highlight)
  • One of the biggest factors in Toto’s success is that it is trained on a large amount of real-world observability data. The bar chart below shows the number of unique (non-synthetic) time series observations in the training repositories of several TSFMs and the fraction of Toto’s training data that consists of observability metrics specifically. (View Highlight)
  • As shown in the bar plots at the top of this post, the Toto model achieves state-of-the-art (SOTA) performance, gaining the best rank score by a significant margin on the BOOM and GIFT-Eval benchmarks (as of May 19, 2025). (View Highlight)
  • On the smaller LSF benchmark, we also measure the performance of a fine-tuned Toto model (TotoFT) and, again, find state-of-the-art performance. As is standard for this benchmark, we compute normalized mean squared error (MSE) and mean absolute error (MAE) for each of six different datasets within the LSF benchmark. Toto achieves the best mean performance across datasets, both in terms of MSE and MAE, in particular achieving a 5.8 percent improvement in MAE over the next best method. (View Highlight)
  • BOOM (Benchmark of Observability Metrics) (data card) contains 350 million observations across 2,807 real-world production series, where each series can contain multiple variates. We highlight the diversity of BOOM by visualizing a two-dimensional projection of each series—along with series from GIFT-Eval and LSF. The projection is generated by using the first two principal components computed from a total of six statistical features of each series. (View Highlight)
  • The evaluation procedure for the benchmark closely follows the methodology established by the GIFT-Eval benchmark. In particular, we calculate MASE (mean absolute scaled error) and CRPS (continuous ranked probability score) on all zero-shot models on three types of forecasting lengths: short, medium, and long (detailed numbers can be found in our paper). We then normalized these metrics by the seasonal naive performance and rank them based on the normalized CRPS. (View Highlight)