# DataEngine

DataEngine, the Next-Generation Annotator Language Model,is a Large Model dedicated to data  pre-labeling that is different from the traditional general or vertical field big models. It pre-labels text and image data and generates multiple pre-labeled results to allow users to vote and select the best labeled result. This best data will also be used as a data source for further training, making the pre-labeling of the data engine more accurate.

DataEngine, a state-of-the-art language model annotator that is designed to transform the landscape of data annotation. DataEngine is built upon a standard autoregressive Transformer-based architecture, utilizing a decoder-only model. This architecture enables DataEngine to efficiently process and generate text, making it exceptionally powerful for a wide range of natural language processing (NLP) tasks.

<figure><img src="https://2886267311-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FyP4CqYO2HuHwTXo4I4R0%2Fuploads%2F7A5Dm4phCPvwhdqB4fKM%2F%E6%88%AA%E5%B1%8F2024-08-11%2008.53.35.png?alt=media&#x26;token=48f777ed-560b-448c-bf3c-10520db7df06" alt="" width="304"><figcaption><p>Figure 1. DataEngine’s model architecture is a Transformer-based decoder-only autoregressive model.</p></figcaption></figure>

#### Optimized by a Modified Instruction Tuning Strategy

What sets DataEngine apart is its innovative optimization through a modified instruction tuning strategy. This strategy enhances the model's ability to perform annotations by following a three-step algorithm that leverages statistical voting results. Specifically, DataEngine is trained on LLM-annotated data in conjunction with user-provided annotations, which helps refine its performance and accuracy over time.

<figure><img src="https://2886267311-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FyP4CqYO2HuHwTXo4I4R0%2Fuploads%2FnFVuv3R08ryc619xx8VR%2FFlow%202.png?alt=media&#x26;token=823bc871-b464-4172-b936-696c357796ca" alt="" width="563"><figcaption><p>Figure 2. The three-step algorithm to train an LLM to be a data annotator.</p></figcaption></figure>

At the heart of DataEngine's instruction tuning strategy is a three-step algorithm, visualized in the flowchart. This algorithm enables DataEngine to integrate user feedback effectively, improving its annotation capabilities with each iteration. Here’s how the process works:

&#x20;

1\. SFT: First, we collect open-source annotated data to train our LLM in a supervised manner.

2\. RM Training: A statistical voting system is employed to evaluate the annotations from all kinds of open-source LLMs. Annotations are ranked according to the statistical results by human voters.

3\. RLHF: DataEngine then uses this feedback to fine-tune its annotation abilities, ensuring that it continues to learn and adapt to user needs.

#### Empowering Data Annotation

DataEngine is fine-tuned specifically for annotating data across various NLP tasks, making it a versatile tool for researchers and developers. It excels in handling data from open-source datasets, ensuring that your annotation process is both accurate and efficient. Additionally, DataEngine welcomes the annotation of crowd-sourced data, providing a robust solution for projects that require the integration of diverse data sources.

With DataEngine, we are not just offering a tool—we are empowering the future of data annotation, streamlining workflows, and setting new standards for accuracy and efficiency in NLP projects.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.dedata.io/product-guides/dataengine.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
