As the field of artificial intelligence (AI) continues to advance, the financial burden of developing and training large-scale AI models has become a significant concern. Building today’s massive AI models can cost hundreds of millions of dollars, with projections suggesting that these expenses could hit a staggering billion dollars within a few years. While much of this cost is attributed to the high demand for specialized computing power, particularly Nvidia GPUs, there is another often overlooked yet rising expense: data labeling.
The Cost of Computing Power
To understand the soaring costs of AI, it’s essential to start with the hardware. Training state-of-the-art AI models requires immense computing power, typically provided by Nvidia GPUs. These GPUs, which may cost as much as $30,000 each, are crucial for handling the extensive calculations needed for training large models. Companies often need tens of thousands of these GPUs, driving up the overall expense significantly.
The Hidden Expense: Data Labeling
Beyond the hardware, another major cost driver in AI development is data labeling. Data labeling involves annotating datasets with tags or metadata to help AI models recognize and interpret patterns. This process is painstaking and labor-intensive. For example, in the development of self-driving cars, images captured by cameras need to be labeled with terms like “pedestrian,” “truck,” or “stop sign” to train the model effectively.
Data labeling isn’t just a technical necessity; it’s also a growing ethical concern. After the release of ChatGPT in 2022, OpenAI faced criticism for outsourcing data labeling to workers in Kenya who were paid less than $2 per hour. This incident highlighted the ethical implications and potential exploitation in the data labeling industry.
The Complexities of Modern AI Models
Today’s AI models, particularly large language models (LLMs), use a technique known as Reinforcement Learning from Human Feedback (RLHF). This method involves human annotators providing qualitative feedback or rankings on the model’s outputs. The costs associated with RLHF are substantial because they involve continuous human intervention to refine the model’s performance.
Moreover, the expense of data labeling increases when dealing with specialized data. Labeling data in fields like legal, financial, and healthcare often requires expert knowledge. This has led companies to hire high-cost professionals such as doctors, lawyers, and PhDs to ensure the accuracy of the labeled data. Outsourcing to third-party firms like Scale AI, which recently secured $1 billion in funding, is another option, though it comes with its own high costs.
William Falcon, CEO of AI development platform Lightning AI, notes, “You now need a lawyer to label stuff, [which is] a crazy use of legal hours.” He emphasizes that expert-level labeling is crucial for high-stakes applications, such as legal advice, where precision is paramount.
Budget Strains for Startups
The rising cost of data labeling poses significant challenges for tech startups, particularly those operating in high-stakes areas like healthcare. Neal Shah, CEO of CareYaya, a platform for elder caregivers, reveals that data labeling costs for their AI caregiver trainer for dementia patients have increased by 40% over the past year. The specialized knowledge required from gerontologists and dementia experts drives these costs higher. Shah is exploring ways to mitigate these expenses by involving healthcare students and professors in the labeling process.
Innovations in Cost Reduction
In response to the mounting costs, several innovative solutions are emerging. Bob Rogers, CEO of Oii.ai, a data science company specializing in supply chain modeling, points to platforms like BeeKeeper AI, which facilitate cost-sharing among companies by allowing them to collaborate on data and algorithms while keeping their private data secure.
Kjell Carlsson, head of AI strategy at Domino Data Lab, highlights the use of synthetic data as another cost-saving measure. Synthetic data is generated by AI models themselves, which can help automate the data collection and labeling process. For example, biopharma companies are using generative AI to develop synthetic proteins and then conduct experiments based on these AI-generated outputs, creating new training data with labels in the process.
Finding Cost-Effective AI
While data labeling remains a costly and time-intensive aspect of AI development, its importance cannot be overstated. Properly labeled data is essential for training accurate and effective AI models, and the potential benefits of well-trained AI systems can be immense. As Neal Shah of CareYaya puts it, “Data labeling’s a beast, but the potential payoff is massive.”
The soaring costs associated with AI development are driven by a combination of expensive computing power and the often-overlooked expense of data labeling. As the industry continues to evolve, finding cost-effective solutions and innovations will be key to sustaining the growth and advancement of AI technology.