EasyNLP is a Comprehensive and Easy-to-use NLP Toolkit
EasyNLPTiếng Trung giới thiệu
EasyNLP is an easy-to-use NLP development and application toolkit in PyTorch, first released inside Alibaba in 2021. It is built with scalable distributed training strategies and supports a comprehensive suite of NLP algorithms for various NLP applications. EasyNLP integrates knowledge distillation and few-shot learning for landing large pre-trained models, together with various popular multi-modality pre-trained models. It provides a unified framework of model training, inference, and deployment for real-world applications. It has powered more than 10 BUs and more than 20 business scenarios within the Alibaba group. It is seamlessly integrated toPlatform of AI (PAI)products, including PAI-DSW for development, PAI-DLC for cloud-native training, PAI-EAS for serving, and PAI-Designer for zero-code model training.
- Easy to use and highly customizable:In addition to providing easy-to-use and concise commands to call cutting-edge models, it also abstracts certain custom modules such as AppZoo and ModelZoo to make it easy to build NLP applications. It is equipped with the PAI PyTorch distributed training framework TorchAccelerator to speed up distributed training.
- Compatible with open-source libraries:EasyNLP has APIs to support the training of models from Huggingface/Transformers with the PAI distributed framework. It also supports the pre-trained models inEasyTransferModelZoo.
- Knowledge-injected pre-training:The PAI team has a lot of research on knowledge-injected pre-training, and builds a knowledge-injected model that wins first place in the CCF knowledge pre-training competition. EasyNLP integrates these cutting-edge knowledge pre-trained models, including DKPLM and KGBERT.
- Landing large pre-trained models:EasyNLP provides few-shot learning capabilities, allowing users to finetune large models with only a few samples to achieve good results. At the same time, it provides knowledge distillation functions to help quickly distill large models to a small and efficient model to facilitate online deployment.
- Multi-modality pre-trained models:EasyNLP is not about NLP only. It also supports various popular multi-modality pre-trained models to support vision-language tasks that require visual knowledge. For example, it is equipped with CLIP-style models for text-image matching and DALLE-style models for text-to-image generation.
We have a series of technical articles on the functionalities of EasyNLP.
- BeautifulPrompt: PAI đẩy ra tự nghiên Prompt điểm tô cho đẹp khí, phú có thể AIGC một kiện ra mỹ đồ
- PAI-Diffusion tiếng Trung mô hình toàn diện thăng cấp, rộng lượng cao thanh nghệ thuật đại đồ một kiện sinh thành
- EasyNLP tổng thể K-Global Pointer thuật toán, duy trì tiếng Trung tin tức rút ra
- Ali vân PAI-Diffusion công năng lại thăng cấp, toàn liên lộ duy trì mô hình điều ưu, bình quân trinh thám tốc độ tăng lên 75% trở lên
- PAI-Diffusion mô hình tới! Ali vân máy móc học tập đoàn đội mang ngài rong chơi tiếng Trung nghệ thuật hải dương
- Mô hình độ chặt chẽ lại bị tăng lên, thống nhất vượt nhiệm vụ tiểu dạng bổn học tập thuật toán UPT cấp ra giải pháp!
- Span rút ra cùng nguyên học tập có thể va chạm ra như thế nào tân hỏa hoa, tiểu dạng bổn thật thể phân biệt tới nói cho ngươi!
- Thuật toán KECP bị đỉnh sẽ EMNLP thu nhận sử dụng, cực nhỏ huấn luyện số liệu là có thể thực hiện máy móc đọc lý giải
- Đương lửa lớn văn đồ sinh thành mô hình gặp được tri thức đồ phổ, AI bức họa xu gần với chân thật thế giới
- EasyNLP tuyên bố dung hợp ngôn ngữ học cùng sự thật tri thức tiếng Trung dự huấn luyện mô hình CKBERT
- EasyNLP mang ngươi thực hiện trung tiếng Anh máy móc đọc lý giải
- Vượt mô thái học tập năng lực lại thăng cấp, EasyNLP điện thương văn đồ kiểm tra hiệu quả đổi mới SOTA
- EasyNLP chơi nói chữ bổn trích yếu ( tin tức tiêu đề ) sinh thành
- Tiếng Trung thưa thớt GPT đại mô hình rơi xuống đất — đi thông vốn nhỏ & cao tính năng nhiều nhiệm vụ thông dụng tự nhiên ngôn ngữ lý giải mấu chốt cột mốc lịch sử
- EasyNLP tổng thể K-BERT thuật toán, mượn dùng tri thức đồ phổ thực hiện càng ưu Finetune
- EasyNLP tiếng Trung văn đồ sinh thành mô hình mang ngươi giây biến nghệ thuật gia
- Mặt hướng trường số hiệu danh sách Transformer mô hình ưu hoá phương pháp, tăng lên trường số hiệu cảnh tượng tính năng
- EasyNLP mang ngươi chơi chuyển CLIP đồ văn kiểm tra
- Ali vân máy móc học tập PAI khai nguyên tiếng Trung NLP thuật toán dàn giáo EasyNLP, trợ lực NLP đại mô hình rơi xuống đất
- Dự huấn luyện tri thức độ lượng thi đấu đoạt giải quán quân! Ali vân PAI tuyên bố tri thức dự huấn luyện công cụ
You can setup from the source:
$ git clone https://github /alibaba/EasyNLP.git
$cdEasyNLP
$ Python setup.py install
This repo is tested on Python 3.6, PyTorch >= 1.8.
Now let's show how to use just a few lines of code to build a text classification model based on BERT.
fromeasynlp.appzooimportClassificationDataset
fromeasynlp.appzooimportget_application_model,get_application_evaluator
fromeasynlp.coreimportTrainer
fromeasynlp.utilsimportinitialize_easynlp,get_args
fromeasynlp.utils.global_varsimportparse_user_defined_parameters
fromeasynlp.utilsimportget_pretrain_model_path
initialize_easynlp()
args=get_args()
user_defined_parameters=parse_user_defined_parameters(args.user_defined_parameters)
pretrained_model_name_or_path=get_pretrain_model_path(user_defined_parameters.get('pretrain_model_name_or_path',None))
train_dataset=ClassificationDataset(
pretrained_model_name_or_path=pretrained_model_name_or_path,
data_file=args.tables.split(",")[0],
max_seq_length=args.sequence_length,
input_schema=args.input_schema,
first_sequence=args.first_sequence,
second_sequence=args.second_sequence,
label_name=args.label_name,
label_enumerate_values=args.label_enumerate_values,
user_defined_parameters=user_defined_parameters,
is_training=True)
valid_dataset=ClassificationDataset(
pretrained_model_name_or_path=pretrained_model_name_or_path,
data_file=args.tables.split(",")[-1],
max_seq_length=args.sequence_length,
input_schema=args.input_schema,
first_sequence=args.first_sequence,
second_sequence=args.second_sequence,
label_name=args.label_name,
label_enumerate_values=args.label_enumerate_values,
user_defined_parameters=user_defined_parameters,
is_training=False)
model=get_application_model(app_name=args.app_name,
pretrained_model_name_or_path=pretrained_model_name_or_path,
num_labels=len(valid_dataset.label_enumerate_values),
user_defined_parameters=user_defined_parameters)
trainer=Trainer(model=model,train_dataset=train_dataset,user_defined_parameters=user_defined_parameters,
evaluator=get_application_evaluator(app_name=args.app_name,valid_dataset=valid_dataset,user_defined_parameters=user_defined_parameters,
eval_batch_size=args.micro_batch_size))
trainer.train()
The complete example can be foundhere.
You can also use AppZoo Command Line Tools to quickly train an App model. Take text classification on SST-2 dataset as an example. First you can download thetrain.tsv,anddev.tsv,then start training:
$ easynlp \
--mode=train \
--worker_gpu=1 \
--tables=train.tsv,dev.tsv \
--input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
--first_sequence=sent1 \
--label_name=label \
--label_enumerate_values=0,1 \
--checkpoint_dir=./classification_model \
--epoch_num=1 \
--sequence_length=128 \
--app_name=text_classify \
--user_defined_parameters='pretrain_model_name_or_path=bert-small-uncased'
And then predict:
$ easynlp \
--mode=predict \
--tables=dev.tsv \
--outputs=dev.pred.tsv \
--input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
--output_schema=predictions,probabilities,logits,output \
--append_cols=label \
--first_sequence=sent1 \
--checkpoint_path=./classification_model \
--app_name=text_classify
To learn more about the usage of AppZoo, please refer to ourdocumentation.
EasyNLP currently provides the following models in ModelZoo:
- PAI-BERT-zh (from Alibaba PAI): pre-trained BERT models with a large Chinese corpus.
- DKPLM (from Alibaba PAI): released with the paperDKPLM: Decomposable Knowledge-enhanced Pre-trained Language Model for Natural Language Understandingby Taolin Zhang, Chengyu Wang, Nan Hu, Minghui Qiu, Chengguang Tang, Xiaofeng He and Jun Huang.
- KGBERT (from Alibaba Damo Academy & PAI): pre-train BERT models with knowledge graph embeddings injected.
- BERT (from Google): released with the paperBERT: Pre-training of Deep Bidirectional Transformers for Language Understandingby Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
- RoBERTa (from Facebook): released with the paperRoBERTa: A Robustly Optimized BERT Pretraining Approachby Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer and Veselin Stoyanov.
- Chinese RoBERTa (from HFL): the Chinese version of RoBERTa.
- MacBERT (from HFL): released with the paperRevisiting Pre-trained Models for Chinese Natural Language Processingby Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang and Guoping Hu.
- WOBERT (from ZhuiyiTechnology): the word-based BERT for the Chinese language.
- FashionBERT (from Alibaba PAI & ICBU): in progress.
- GEEP (from Alibaba PAI): in progress.
- Mengzi (from Langboat): released with the paperMengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chineseby Zhuosheng Zhang, Hanqing Zhang, Keming Chen, Yuhang Guo, Jingyun Hua, Yulong Wang and Ming Zhou.
- Erlangshen (from IDEA): released from therepo.
Please refer to thisreadmefor the usage of these models in EasyNLP. Meanwhile, EasyNLP supports to load pretrained models from Huggingface/Transformers, please refer tothis tutorialfor details.
EasyNLP also supports various popular multi-modality pre-trained models to support vision-language tasks that require visual knowledge. For example, it is equipped with CLIP-style models for text-image matching and DALLE-style models for text-to-image generation.
EasyNLP provide few-shot learning and knowledge distillation to help land large pre-trained models.
- PET(from LMU Munich and Sulzer GmbH): released with the paperExploiting Cloze Questions for Few Shot Text Classification and Natural Language Inferenceby Timo Schick and Hinrich Schutze. We have made some slight modifications to make the algorithm suitable for the Chinese language.
- P-Tuning(from Tsinghua University, Beijing Academy of AI, MIT and Recurrent AI, Ltd.): released with the paperGPT Understands, Tooby Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang and Jie Tang. We have made some slight modifications to make the algorithm suitable for the Chinese language.
- CP-Tuning(from Alibaba PAI): released with the paperMaking Pre-trained Language Models End-to-end Few-shot Learners with Contrastive Prompt Tuningby Ziyun Xu, Chengyu Wang, Minghui Qiu, Fuli Luo, Runxin Xu, Songfang Huang and Jun Huang.
- Vanilla KD(from Alibaba PAI): distilling the logits of large BERT-style models to smaller ones.
- Meta KD(from Alibaba PAI): released with the paperMeta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domainsby Haojie Pan, Chengyu Wang, Minghui Qiu, Yichang Zhang, Yaliang Li and Jun Huang.
- Data Augmentation(from Alibaba PAI): augmentating the data based on the MLM head of pre-trained language models.
EasyNLP providesa simple toolkitto benchmark clue datasets. You can simply use just this command to benchmark CLUE dataset.
#Format: bash run_clue.sh device_id train/predict dataset
#e.g.:
bash run_clue.sh 0 train csl
We've tested chiese bert and roberta modelson the datasets, the results of dev set are:
(1) bert-base-chinese:
Task | AFQMC | CMNLI | CSL | IFLYTEK | OCNLI | TNEWS | WSC |
---|---|---|---|---|---|---|---|
P | 72.17% | 75.74% | 80.93% | 60.22% | 78.31% | 57.52% | 75.33% |
F1 | 52.96% | 75.74% | 81.71% | 60.22% | 78.30% | 57.52% | 80.82% |
(2) chinese-roberta-wwm-ext:
Task | AFQMC | CMNLI | CSL | IFLYTEK | OCNLI | TNEWS | WSC |
---|---|---|---|---|---|---|---|
P | 73.10% | 80.75% | 80.07% | 60.98% | 80.75% | 57.93% | 86.84% |
F1 | 56.04% | 80.75% | 81.50% | 60.98% | 80.75% | 57.93% | 89.58% |
Here is the detailedCLUE benchmark example.
- Tự định nghĩa văn bản phân loại thí dụ mẫu
- QuickStart- văn bản phân loại
- QuickStart-PAI DSW
- QuickStart-MaxCompute/ODPS số liệu
- AppZoo- văn bản vector hóa
- AppZoo- văn bản phân loại / xứng đôi
- AppZoo- danh sách đánh dấu
- AppZoo-GEEP văn bản phân loại
- AppZoo- văn bản sinh thành
- Cơ sở dự huấn luyện thực tiễn
- Tri thức dự huấn luyện thực tiễn
- Tri thức chưng cất thực tiễn
- Vượt nhiệm vụ tri thức chưng cất thực tiễn
- Tiểu dạng bổn học tập thực tiễn
- Rapidformer mô hình huấn luyện gia tốc thực tiễn
- API docs:http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs /release/easynlp/easynlp_docs/html/index.html
This project is licensed under theApache License (Version 2.0).This toolkit also contains some code modified from other repos under other open-source licenses. See theNOTICEfile for more information.
- EasyNLP v0.0.3 was released in 01/04/2022. Please refer totag_v0.0.3for more details and history.
Scan the following QR codes to join Dingtalk discussion group. The group discussions are mostly in Chinese, but English is also welcomed.
- DKPLM:https://paperswithcode /paper/dkplm-decomposable-knowledge-enhanced-pre
- MetaKD:https://paperswithcode /paper/meta-kd-a-meta-knowledge-distillation
- CP-Tuning:https://paperswithcode /paper/making-pre-trained-language-models-end-to-end-1
- FashionBERT:https://paperswithcode /paper/fashionbert-text-and-image-matching-with
We havean arxiv paperfor you to cite for the EasyNLP library:
@article{easynlp,
doi = {10.48550/ARXIV.2205.00258},
url = {https://arxiv.org/abs/2205.00258},
author = {Wang, Chengyu and Qiu, Minghui and Zhang, Taolin and Liu, Tingting and Li, Lei and Wang, Jianing and Wang, Ming and Huang, Jun and Lin, Wei},
title = {EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing},
publisher = {arXiv},
year = {2022}
}