🛞DataEngine

DataEngine, the Next-Generation Annotator Language Model,is a Large Model dedicated to data pre-labeling that is different from the traditional general or vertical field big models. It pre-labels text and image data and generates multiple pre-labeled results to allow users to vote and select the best labeled result. This best data will also be used as a data source for further training, making the pre-labeling of the data engine more accurate.

DataEngine, a state-of-the-art language model annotator that is designed to transform the landscape of data annotation. DataEngine is built upon a standard autoregressive Transformer-based architecture, utilizing a decoder-only model. This architecture enables DataEngine to efficiently process and generate text, making it exceptionally powerful for a wide range of natural language processing (NLP) tasks.

Optimized by a Modified Instruction Tuning Strategy

What sets DataEngine apart is its innovative optimization through a modified instruction tuning strategy. This strategy enhances the model's ability to perform annotations by following a three-step algorithm that leverages statistical voting results. Specifically, DataEngine is trained on LLM-annotated data in conjunction with user-provided annotations, which helps refine its performance and accuracy over time.

At the heart of DataEngine's instruction tuning strategy is a three-step algorithm, visualized in the flowchart. This algorithm enables DataEngine to integrate user feedback effectively, improving its annotation capabilities with each iteration. Here’s how the process works:

1. SFT: First, we collect open-source annotated data to train our LLM in a supervised manner.

2. RM Training: A statistical voting system is employed to evaluate the annotations from all kinds of open-source LLMs. Annotations are ranked according to the statistical results by human voters.

3. RLHF: DataEngine then uses this feedback to fine-tune its annotation abilities, ensuring that it continues to learn and adapt to user needs.

Empowering Data Annotation

DataEngine is fine-tuned specifically for annotating data across various NLP tasks, making it a versatile tool for researchers and developers. It excels in handling data from open-source datasets, ensuring that your annotation process is both accurate and efficient. Additionally, DataEngine welcomes the annotation of crowd-sourced data, providing a robust solution for projects that require the integration of diverse data sources.

With DataEngine, we are not just offering a tool—we are empowering the future of data annotation, streamlining workflows, and setting new standards for accuracy and efficiency in NLP projects.

Last updated