# MetaDialog **Repository Path**: Gramyd/MetaDialog ## Basic Information - **Project Name**: MetaDialog - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-02-15 - **Last Updated**: 2022-02-15 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Meta Dialog Platform (MDP) Meta Dialog Platform: a toolkit platform for **NLP Few-Shot Learning** tasks of: - Text Classification - Sequence Labeling It also provides the baselines for: - [Track-1 of SMP2020: Few-shot dialog language understanding](https://smp2020.aconf.cn/smp.html#3). - [Benchmark Paper: "FewJoint: A Few-shot Learning Benchmark for Joint Language Understanding"]("https://arxiv.org/abs/2009.08138") ### Updates - Updates 2021.3.8: Fix wrong default setting for few-shot data generator scripts. - Updates 2020.9.17: FewJoint benchmark (Dataset for SMP) is available: [paper](https://arxiv.org/abs/2009.08138), [data](https://atmahou.github.io/attachments/FewJoint.zip), [reformatted data (for MetaDialog)](https://atmahou.github.io/attachments/FewJoint_for_MetaDialog.zip) ### Features State-of-the-art solutions for Few-shot NLP: - Support Few-shot Learning for sequence-labeling task with state-of-the-art methods: CDT [(Hou et al., 2020)](https://arxiv.org/abs/2006.05702). - Support to use semantic within label name or label description. - Support various deep pre-trained embedding compatible with [huggingface/transformers](https://github.com/huggingface/transformers), such as **[BERT](https://arxiv.org/abs/1810.04805)** and **[Electra](https://openreview.net/forum?id=r1xMH1BtvB)**. - Support pair-wise embedding mechanism ([Hou et al., 2020](https://arxiv.org/abs/2006.05702), [Gao et al., 2019](https://www.aclweb.org/anthology/D19-1649)). Easy-to-start & flexible framework: - Provide tools for easy training & testing. - Support various few-shot models with unified and extendable interfaces, such as ProtoNet and TapNet. - Support easy-to-switch similarity-metrics and logits-scaling methods. - Provide tools of generating episode-style data for meta-learning. ## Citation Please cite code and data: ``` @article{hou2020fewjoint, title={FewJoint: A Few-shot Learning Benchmark for Joint Language Understanding}, author={Yutai Hou, Jiafeng Mao, Yongkui Lai, Cheng Chen, Wanxiang Che, Zhigang Chen, Ting Liu}, journal={arXiv preprint}, year={2020} } ``` ## Get Started ### Environment Requirement ``` python>=3.6 torch>=1.2.0 transformers>=2.9.0 numpy>=1.17.0 tqdm>=4.31.1 allennlp>=0.8.4 pytorch-nlp ``` ### Example for Sequence Labeling Here, we take the few-shot slot tagging and NER task from [(Hou et al., 2020)](https://arxiv.org/abs/2006.05702) as quick start examples. #### Step1: Prepare pre-trained embedding - Download the pytorch bert model, or convert tensorflow param by yourself with [scripts](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py). - Set BERT path in the `./scripts/run_1_shot_slot_tagging.sh` to your setting: ```bash bert_base_uncased=/your_dir/uncased_L-12_H-768_A-12/ bert_base_uncased_vocab=/your_dir/uncased_L-12_H-768_A-12/vocab.txt ``` #### Step2: Prepare data - Download the **compatible** few-shot data at here: [download](https://atmahou.github.io/attachments/new_FewShotNLU_data(ACL20).zip) - Set test, train, dev data file path in `./scripts/run_1_shot_slot_tagging.sh` to your setting. > For simplicity, your only need to set the root path for data as follow: ```bash base_data_dir=/your_dir/ACL2020data/ ``` #### Step3: Train and test the main model - Build a folder to collect running log ```bash mkdir result ``` - Execute cross-evaluation script with two params: -[gpu id] -[dataset name] ##### Example for 1-shot slot tagging: ```bash source ./scripts/run_1_shot_slot_tagging.sh 0 snips ``` ##### Example for 1-shot NER: ```bash source ./scripts/run_1_shot_slot_tagging.sh 0 ner ``` > To run 5-shots experiments, use `./scripts/run_5_shot_slot_tagging.sh` ### Other detailed functions and options: You can experiment freely by passing parameters to `main.py` to choose different model architectures, hyperparameters, etc. To view detailed options and corresponding descriptions, run commandline: ```bash python main.py --h ``` We provide scripts for general few-shot classification and sequence labeling task respectively: - classification - `run_electra_sc.sh` - `run_bert_sc.sh` - sequence labeling - `run_electra_sl.sh` - `run_bert_sl.sh` The usage of these scripts are similar to process in Get Started. ## Run with FewJoint/SMP data - Get reformatted FewJoint data at [here](https://atmahou.github.io/attachments/FewJoint_for_MetaDialog.zip) or construct episode-style data by yourself with [our tool](https://github.com/AtmaHou/MetaDialog#few-shot-data-construction-tool). - Use script `./scripts/run_smp_bert_sc.sh` and `./scripts/run_smp_bert_sl.sh` to perform few-shot intent detection or few-shot slot filling respectively. - Notice that: 1. Change train/dev/test path in the scripts before running. 2. Find predicted results at `trained_model_path` within running scripts. ## Few-shot Data Construction Tool We also provide a generation tool for converting normal data into few-shot/meta-episode style. The tool is included at path: `scripts/other_tool/meta_dataset_generator.py`. Run following commandline to view detailed interface: ```bash python generate_meta_dataset.py --h ``` For simplicity, we provide an example script to help generate few-shot data: `./scripts/gen_meta_data.sh`. The following are some key params for you to control the generation process: - `input_dir`: raw data path - `output_dir`: output data path - `episode_num`: the number of episode which you want to generate - `support_shots_lst`: to specified the support shot size in each episode, we can specified multiple number to generate at the same time. - `query_shot`: to specified the query shot size in each episode - `seed_lst`: random seed list to control random generation - `use_fix_support`: set the fix support in dev dataset - `dataset_lst`: specified the dataset type which our tool can handle, there are some choices: `stanford` & `SLU` & `TourSG` & `SMP`. > If you want to handle other type of dataset, > you can add your code for load raw dataset in `meta_dataset_generator/raw_data_loader.py`. ##### few-shot/meta-episode style data example ```json { "domain_name": [ { // episode "support": { // support set "seq_ins": [["we", "are", "friends", "."], ["how", "are", "you", "?"]], // input sequence "seq_outs": [["O", "O", "O", "O"], ["O", "O", "O", "O"]], // output sequence in sequence labeling task "labels": [["statement"], ["query"]] // output labels in classification task }, "query": { // query set "seq_ins": [["we", "are", "friends", "."], ["how", "are", "you", "?"]], "seq_outs": [["O", "O", "O", "O"], ["O", "O", "O", "O"]], "labels": [["statement"], ["query"]] } }, ... ], ... } ``` ## Acknowledgment The platform is developed by [HIT-SCIR](http://ir.hit.edu.cn/). If you have any question and advice for it, please contact us(Yutai Hou - [ythou@ir.hit.edu.cn]() or Yongkui Lai - [yklai@ir.hit.edu.cn]()).