# DataEngine

DataEngine, the Next-Generation Annotator Language Model,is a Large Model dedicated to data  pre-labeling that is different from the traditional general or vertical field big models. It pre-labels text and image data and generates multiple pre-labeled results to allow users to vote and select the best labeled result. This best data will also be used as a data source for further training, making the pre-labeling of the data engine more accurate.

DataEngine, a state-of-the-art language model annotator that is designed to transform the landscape of data annotation. DataEngine is built upon a standard autoregressive Transformer-based architecture, utilizing a decoder-only model. This architecture enables DataEngine to efficiently process and generate text, making it exceptionally powerful for a wide range of natural language processing (NLP) tasks.

<figure><img src="https://2886267311-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FyP4CqYO2HuHwTXo4I4R0%2Fuploads%2F7A5Dm4phCPvwhdqB4fKM%2F%E6%88%AA%E5%B1%8F2024-08-11%2008.53.35.png?alt=media&#x26;token=48f777ed-560b-448c-bf3c-10520db7df06" alt="" width="304"><figcaption><p>Figure 1. DataEngine’s model architecture is a Transformer-based decoder-only autoregressive model.</p></figcaption></figure>

#### Optimized by a Modified Instruction Tuning Strategy

What sets DataEngine apart is its innovative optimization through a modified instruction tuning strategy. This strategy enhances the model's ability to perform annotations by following a three-step algorithm that leverages statistical voting results. Specifically, DataEngine is trained on LLM-annotated data in conjunction with user-provided annotations, which helps refine its performance and accuracy over time.

<figure><img src="https://2886267311-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FyP4CqYO2HuHwTXo4I4R0%2Fuploads%2FnFVuv3R08ryc619xx8VR%2FFlow%202.png?alt=media&#x26;token=823bc871-b464-4172-b936-696c357796ca" alt="" width="563"><figcaption><p>Figure 2. The three-step algorithm to train an LLM to be a data annotator.</p></figcaption></figure>

At the heart of DataEngine's instruction tuning strategy is a three-step algorithm, visualized in the flowchart. This algorithm enables DataEngine to integrate user feedback effectively, improving its annotation capabilities with each iteration. Here’s how the process works:

&#x20;

1\. SFT: First, we collect open-source annotated data to train our LLM in a supervised manner.

2\. RM Training: A statistical voting system is employed to evaluate the annotations from all kinds of open-source LLMs. Annotations are ranked according to the statistical results by human voters.

3\. RLHF: DataEngine then uses this feedback to fine-tune its annotation abilities, ensuring that it continues to learn and adapt to user needs.

#### Empowering Data Annotation

DataEngine is fine-tuned specifically for annotating data across various NLP tasks, making it a versatile tool for researchers and developers. It excels in handling data from open-source datasets, ensuring that your annotation process is both accurate and efficient. Additionally, DataEngine welcomes the annotation of crowd-sourced data, providing a robust solution for projects that require the integration of diverse data sources.

With DataEngine, we are not just offering a tool—we are empowering the future of data annotation, streamlining workflows, and setting new standards for accuracy and efficiency in NLP projects.
