Why AI Costs Are Soaring: The Hidden Expenses Behind Training Massive Models

As the field of artificial intelligence (AI) continues to advance, the financial burden of developing and training large-scale AI models has become a significant concern. Building today’s massive AI models can cost hundreds of millions of dollars, with projections suggesting that these expenses could hit a staggering billion dollars within a few years. While much of this cost is attributed to the high demand for specialized computing power, particularly Nvidia GPUs, there is another often overlooked yet rising expense: data labeling.

The Cost of Computing Power

To understand the soaring costs of AI, it’s essential to start with the hardware. Training state-of-the-art AI models requires immense computing power, typically provided by Nvidia GPUs. These GPUs, which may cost as much as $30,000 each, are crucial for handling the extensive calculations needed for training large models. Companies often need tens of thousands of these GPUs, driving up the overall expense significantly.

The Hidden Expense: Data Labeling

Beyond the hardware, another major cost driver in AI development is data labeling. Data labeling involves annotating datasets with tags or metadata to help AI models recognize and interpret patterns. This process is painstaking and labor-intensive. For example, in the development of self-driving cars, images captured by cameras need to be labeled with terms like “pedestrian,” “truck,” or “stop sign” to train the model effectively.

Data labeling isn’t just a technical necessity; it’s also a growing ethical concern. After the release of ChatGPT in 2022, OpenAI faced criticism for outsourcing data labeling to workers in Kenya who were paid less than $2 per hour. This incident highlighted the ethical implications and potential exploitation in the data labeling industry.

The Complexities of Modern AI Models

Today’s AI models, particularly large language models (LLMs), use a technique known as Reinforcement Learning from Human Feedback (RLHF). This method involves human annotators providing qualitative feedback or rankings on the model’s outputs. The costs associated with RLHF are substantial because they involve continuous human intervention to refine the model’s performance.

Moreover, the expense of data labeling increases when dealing with specialized data. Labeling data in fields like legal, financial, and healthcare often requires expert knowledge. This has led companies to hire high-cost professionals such as doctors, lawyers, and PhDs to ensure the accuracy of the labeled data. Outsourcing to third-party firms like Scale AI, which recently secured $1 billion in funding, is another option, though it comes with its own high costs.

William Falcon, CEO of AI development platform Lightning AI, notes, “You now need a lawyer to label stuff, [which is] a crazy use of legal hours.” He emphasizes that expert-level labeling is crucial for high-stakes applications, such as legal advice, where precision is paramount.

Budget Strains for Startups

The rising cost of data labeling poses significant challenges for tech startups, particularly those operating in high-stakes areas like healthcare. Neal Shah, CEO of CareYaya, a platform for elder caregivers, reveals that data labeling costs for their AI caregiver trainer for dementia patients have increased by 40% over the past year. The specialized knowledge required from gerontologists and dementia experts drives these costs higher. Shah is exploring ways to mitigate these expenses by involving healthcare students and professors in the labeling process.

Innovations in Cost Reduction

In response to the mounting costs, several innovative solutions are emerging. Bob Rogers, CEO of Oii.ai, a data science company specializing in supply chain modeling, points to platforms like BeeKeeper AI, which facilitate cost-sharing among companies by allowing them to collaborate on data and algorithms while keeping their private data secure.

Kjell Carlsson, head of AI strategy at Domino Data Lab, highlights the use of synthetic data as another cost-saving measure. Synthetic data is generated by AI models themselves, which can help automate the data collection and labeling process. For example, biopharma companies are using generative AI to develop synthetic proteins and then conduct experiments based on these AI-generated outputs, creating new training data with labels in the process.

Finding Cost-Effective AI

While data labeling remains a costly and time-intensive aspect of AI development, its importance cannot be overstated. Properly labeled data is essential for training accurate and effective AI models, and the potential benefits of well-trained AI systems can be immense. As Neal Shah of CareYaya puts it, “Data labeling’s a beast, but the potential payoff is massive.”

The soaring costs associated with AI development are driven by a combination of expensive computing power and the often-overlooked expense of data labeling. As the industry continues to evolve, finding cost-effective solutions and innovations will be key to sustaining the growth and advancement of AI technology.

Finding Cost-Effective AI

SHARE THIS POST

More You Need to Know

Unlock Hollywood’s Secrets: The Surprising Truth Behind Studio Lot Tours!

How to Create Your Own Product

Strategies to Successfully Navigate Business Competition