💡What we do
Background
The AI data labeling, collections and trading industry is crucial for training machine learning models, experiencing rapid growth due to rising demand for high-quality datasets. In web2 world, key players like Scale AI, Appen, and Lionbridge dominate the market, using a mix of human annotators and automated tools to label images, text, audio, and video.
Crowdsourcing is widely used, leveraging a global workforce for large-scale tasks. While cost-effective and scalable, it challenges consistency and quality, addressed through rigorous quality control measures like multiple reviews and automated validation.
Automation is a growing trend, with AI-powered tools enhancing speed and reducing costs. However, human oversight remains essential for complex tasks. The industry also faces ethical and logistical challenges, including fair wages, data privacy, and managing a diverse workforce.
Current AI data labeling faces several challenges: High Costs: For AI entities to obtain high-quality data annotations, they must either build their own annotation team or outsource the work to third-party annotation companies. This involves significant costs related to business cooperation, personnel management, and the establishment of companies across borders. Additionally, the fluctuating demand for annotations means that during peak times, a large number of annotators need to be hired, while during low-demand periods, layoffs become necessary. Overall, this leads to consistently high annotation costs.
Poor Annotation Quality: Currently, annotated data needs to undergo multiple rounds of review. However, due to limitations in traditional review models, increasing the number of reviewers requires hiring more annotators, which involves high marginal costs. As a result, the number of reviews in traditional models is limited, leading to poor annotation quality.
Poor Scalability: If an AI entity wishes to expand its business into more countries or languages, it needs to recruit teams fluent in the relevant languages or renegotiate outsourcing contracts. The lengthy business cycles involved severely impact the speed of business expansion, placing the company at a disadvantage in competition with other AI companies.
For Annotators:
Low Income: From the time raw data leaves the AI company until it reaches the annotator, it incurs significant costs. The income eventually received by annotators is minimal, with those working in Asia and Africa earning just a few dollars a day. This is a common criticism faced by many Web2 companies, as they exploit annotators to control annotation costs.
High Costs of Securing Work: Annotators face long and cumbersome processes to secure annotation work, including job searching and trial periods. Once employed, they often encounter poor working conditions in developing countries, including delayed payments and malicious pay cuts, making it difficult for many to participate in annotation tasks. Overall, the cost of securing annotation work is very high for annotators.
Restricted Payment Methods: In the Web2 model, annotators receive tasks through crowdsourcing platforms, but to get paid, they must bind a bank card. Moreover, these platforms do not support the currencies of most countries, preventing many annotators from participating in data annotation tasks.
Our AI data labeling Solution
The web3 industry has developed for over a decade, with both the industry and technology maturing steadily. We leverage the advantages of web3 technology to transform the existing data labeling industry.
DeData want to redefine the data annotation paradigm, build a trusted data foundation layer for decentralized AI, provide trusted data support for decentralized AI networks and calculations, . DeData also users to verify the original data and annotations to ensure that the data has not been altered, so that it can be used by the AI blockchain to train more powerful AI models.
LTE Engine: We enable any web3 user to participate in data labeling. Web3 users do not need to join any corporate entity to label and earn rewards. The platform uses a decentralized reward and penalty engine to incentivize high-quality data labeling. The final labeling result is determined based on the principle of majority rule.
Pre-Label Engine: We are training a dedicated pre-annotation AI model for the DeData platform based on labeled data. Before entering the platform for annotation, raw data can be pre-annotated using our pre-annotation model. This approach combines the advantages of both manual annotation and model annotation, resulting in the highest quality and most efficient annotated data.
Dual Incentives: Web3 users can earn bounties from AI companies, labs, and non-profit organizations, as well as token rewards from the platform. For web3 users in developing countries, this allows them to earn double compensation, encouraging more web2 users to join the web3 data labeling industry.
Multiple checkers: Traditional web2 data labeling is limited by the cost of reviewers. Adding each new reviewer increases marginal costs, making multiple reviews unfeasible and limiting data quality. In addition to the platform's proprietary reward and penalty mechanism, DeData allows for a review mechanism of up to 100 people, ensuring the highest accuracy through human voting review.
Collect dat to earn: DeData has developed a data marketplace platform called Databazaar, enabling users to earn money by collecting online and offline data for AI entities. Simultaneously, AI companies can share labeled data with other AI companies to maximize the value of the data and generate additional revenue.
The transparency and security:On the DeData platform, users' task and points data are uploaded to the blockchain. Following the TGE, the exchange rate between points and tokens will be displayed in real-time, allowing users to receive dual income rewards at any time. This system ensures transparency and prevents issues like wage withholding or inaccurate payments, which can occur in less developed regions.
Current AI data trading faces several challenges:
Lack of a Trusted Third-Party Platform: Traditional web2 platforms store data on centralized servers, enabling platform owners to download and privately sell user-uploaded data without consent. This undermines trust between data owners and platforms, discouraging data owners from engaging in transactions, making it difficult to establish a trustworthy third-party platform.
Weak Copyright Protection: After data is labeled by traditional teams, even if buyers take precautions to prevent unauthorized sales, labeling teams often sell the data privately, resulting in losses for the buyers. These losses involve multiple steps in the process, making it difficult to track responsible individuals. Additionally, teams who purchase labeled data privately struggle to verify the data's ownership, which could lead to copyright infringement and legal risks.
Unreliable Data Sampling: Before purchasing data, buyers typically rely on sampling to verify the quality. However, data owners may deceive buyers by providing high-quality samples, while the full dataset being pledged is of inconsistent quality. This results in buyers receiving lower-quality data than expected.
Our AI data trading Solution:
Privacy-Encrypted Storage: All data is stored in a decentralized manner and encrypted using zk (zero-knowledge) technology. Only users holding the relevant NFTs can access the data, while the DeData platform itself has no access. This ensures that the data is immutable and tamper-proof.
Copyright Deposit: Before selling data, data owners must provide a deposit. This deposit is held on the platform for a period after the sale is completed. If no copyright disputes arise during this time, the deposit is returned to the data owner. In the event of a complaint, the platform will initiate an investigation process. If the data owner is found not to hold the original copyright, the deposit will be forfeited.
Random Sampling Algorithm: The platform uses a random sampling algorithm to ensure the randomness and validity of samples. When buyers check data via sampling, the results carry statistical significance, ensuring that buyers are not deceived by selective high-quality samples.
Royalty Rights: Data owners are entitled to royalties from any future revenue generated by the data. They can also decide whether the buyer is allowed to resell the data and set the terms for royalty sharing on any revenue generated from resale.
Last updated