• Today, a lot of LLM output is primarily for human consumption. A human might read around 250 words per minute, which is around 6 tokens per second (250 words/min / (0.75 words/token) / (60 secs/min)). So it might initially seem like there’s little value to generating tokens much faster than this. (View Highlight)
  • But in an agentic workflow, an LLM might be prompted repeatedly to reflect on and improve its output, use tools, plan and execute sequences of steps, or implement multiple agents that collaborate with each other. In such settings, we might easily generate hundreds of thousands of tokens or more before showing any output to a user. This makes fast token generation very desirable and makes slower generation a bottleneck to taking better advantage of existing foundation models. (View Highlight)
  • That’s why I’m excited about the work of companies like Groq, which can generate hundreds of tokens per second. Recently, SambaNova published an impressive demo that hit hundreds of tokens per second. (View Highlight)
  • Incidentally, faster, cheaper token generation will also help make running evaluations (evals), a step that can be slow and expensive today since it typically involves iterating over many examples, more palatable. Having better evals will help many developers with the process of tuning models to improve their performance. (View Highlight)
  • Fortunately, it appears that both training and inference are rapidly becoming cheaper. I recently spoke with Cathie Wood and Charles Roberts of the investment firm ARK, which is famous for its bullish predictions on tech. They estimate that AI training costs are falling at 75% a year. If they are right, a foundation model that costs 25M to train next year. Further, they report that for “enterprise scale use cases, inference costs seem to be falling at an annual rate of ~86%, even faster than training costs.” (View Highlight)
  • What’s new: Udio launched a web-based, text-to-song generator that creates songs in styles from barbershop to heavy metal. Suno, which debuted its service late last year with similar capabilities, upgraded to its offering. (View Highlight)
  • How it works: Both services take text prompts and generate full-band productions complete with lyrics, vocals, and instrumental solos, two separate generations per prompt. Users can generate lyrics to order or upload their own words, and they can download, share, and/or post the results for others to hear. Leaderboards rank outputs according to plays and likes. (View Highlight)
  • Founded by alumni of Google’s DeepMind division, Udio lets registered users generate up to 1,200 songs monthly for free and expects to offer paid services at an unspecified future date. Users enter a text prompt and/or choose style tags. The system automatically replaces artist names with stylistic descriptions but sometimes produces results that sound uncannily like the artists requested. Users can choose to generate an instrumental track or add lyrics, allocating them to verse, chorus, or background vocals. Udio generates audio segments 33 seconds long, which users can extend, remix, and modify. The company has not released information about the underlying technology. (View Highlight)
  • Suno lets users generate 10 songs daily for free or pay to generate more. Enter a prompt, and the system generates complete songs up to 2 minutes long; alternatively, users can specify lyrics, style, and title in separate prompts. The system refuses to generate music from prompts that include the name of a real-world artist. Suno hasn’t disclosed technical information, but last year it released an open-source model called Bark that turns a text prompt into synthetic music, speech, and/or sound effects. (View Highlight)
  • Behind the news: Most earlier text-to-music generators were designed to produce relatively free-form instrumental compositions rather than songs with structured verses, choruses, and vocals. Released earlier this month, Stable Audio 2 generates instrumental tracks up to three minutes long that have distinct beginnings, middles, and endings. Users can also upload audio tracks and use Stable Audio 2.0 to modify them. (View Highlight)
  • Yes, but: Like text-to-image generators circa last year, current text-to-music models offer little ability to steer their output. They don’t respond consistently to basic musical terminology such as “tempo” and “harmony,” and requesting a generic style like “pop” can summon a variety of subgenres from the last 50 years of popular music. (View Highlight)
  • Why it matters: With the advent of text-to-music models that produce credible songs, audio generation seems primed for a Midjourney moment, when the public realizes that it can produce customized music at the drop of a prompt. Already Udio’s and Suno’s websites are full of whimsical paeans to users’ pets and hobbies. The technology has clear implications for professional performers and producers, who, regrettably, have little choice but to adapt to increasing automation. But for now fans have fun, new toys to play with. (View Highlight)
  • What’s new: Vals.AI, an independent model testing service, developed benchmarks that rank large language models’ performance of tasks associated with income taxes, corporate finance, and contract law; it also maintains a pre-existing legal benchmark. Open AI’s GPT-4 and Anthropic’s Claude 3 Opus did especially well in recent tests. (View Highlight)
  • How it works: Vals AI hosts leaderboards that compare the performance of several popular large language models (LLMs) with respect to accuracy, cost, and speed, along with with analysis of the results. The company worked with independent experts to develop multiple-choice and open-ended questions in industrial domains. The datasets are not publicly available. (View Highlight)
  • Behind the news: Many practitioners in finance and law use LLMs in applications that range from processing documents to predicting interest rates. However, LLM output in such applications requires oversight. In 2023, a New York state judge reprimanded a lawyer for submitting an AI-generated brief that referred to fictitious cases. (View Highlight)
  • Why it matters: Typical AI benchmarks are designed to evaluate general knowledge and cognitive abilities. Many developers would like to measure more directly performance in real-world business contexts, where specialized knowledge may come into play. (View Highlight)
  • We’re thinking: Open benchmarks can benefit from public scrutiny, and they’re available to all developers. However, they can be abused when developers cherry-pick benchmarks on which their models perform especially well. Moreover, they may find their way into training sets, making for unfair comparisons. Independent testing on proprietary benchmarks is one way to address these issues. (View Highlight)
  • We’re thinking: One key to successful implementation of AI in manufacturing is tailoring systems to the unique circumstances of each individual facility. The highly heterogeneous tasks, equipment, and surroundings in different factories mean that one model doesn’t fit all. Developers who can solve this long-tail problem stand to reap rewards. (View Highlight)
  • Video diffusion provides a new basis for generating 3D models. What’s new: Vikram Voleti, Chun-Han Yao, Mark Boss, Varun Jampani, and colleagues at Stability AI produced a method that generates a 3D model from a single image based on Stability’s video diffusion model. You can see its output here. Key insight: The approach known as a Neural Radiance Field (NeRF) learns to create a 3D model from images of the same object shot at various angles. Given a single image of an object, a video diffusion model can learn to generate videos that orbit around it. The frames from such orbital videos give NeRF the information it needs to produce a 3D model. How it works: To generate an image, the authors took one step before and two steps during inference. Before inference: Learn to generate an orbital video. During inference: (i) Train a NeRF model on an orbital video. (ii) Improve the 3D model using diffusion following DreamFusion. (View Highlight)
  • Results: The authors produced 3D models from images of 50 objects in GSO, a 3D object dataset of scanned household items. They compared their 3D models to those produced by other methods including EscherNet, a method that uses an image diffusion model to generate images of an object from different angles that are used to train a pair of vanilla neural networks to produce a 3D model. Evaluated according to Chamfer distance, a measure of the distance between the points on the ground truth and generated 3D models (lower is better), their method achieved .024, while EscherNet achieved .042. (View Highlight)
  • Why it matters: Video diffusion models must generate different views of the same object, so they require a greater understanding of 3D objects than image diffusion models, which need to generate only one view at a time. Upgrading from an image diffusion model to a video diffusion model makes for better 3D object generation. We’re thinking: Building 3D models used to be difficult, but with models like this, it’s becoming less of a mesh. (View Highlight)