# multiconer-baseline **Repository Path**: mirrors_amzn/multiconer-baseline ## Basic Information - **Project Name**: multiconer-baseline - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-10-22 - **Last Updated**: 2026-03-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # MULTI-CONER NER Baseline. This code repository provides you with baseline approach for Named Entity Recognition (NER). In this repository the following functionalities are provided: - CoNLL data readers - Usage of any HuggingFace pre-trained transformer models - Training and Testing through Pytorch-Lightning Below a more detailed description on how to use this code is provided. ### Running the Code #### Arguments: ``` p.add_argument('--train', type=str, help='Path to the train data.', default=None) p.add_argument('--test', type=str, help='Path to the test data.', default=None) p.add_argument('--dev', type=str, help='Path to the dev data.', default=None) p.add_argument('--out_dir', type=str, help='Output directory.', default='.') p.add_argument('--iob_tagging', type=str, help='IOB tagging scheme', default='wnut') p.add_argument('--max_instances', type=int, help='Maximum number of instances', default=-1) p.add_argument('--max_length', type=int, help='Maximum number of tokens per instance.', default=50) p.add_argument('--encoder_model', type=str, help='Pretrained encoder model to use', default='xlm-roberta-large') p.add_argument('--model', type=str, help='Model path.', default=None) p.add_argument('--model_name', type=str, help='Model name.', default=None) p.add_argument('--stage', type=str, help='Training stage', default='fit') p.add_argument('--prefix', type=str, help='Prefix for storing evaluation files.', default='test') p.add_argument('--batch_size', type=int, help='Batch size.', default=128) p.add_argument('--gpus', type=int, help='Number of GPUs.', default=1) p.add_argument('--epochs', type=int, help='Number of epochs for training.', default=5) p.add_argument('--lr', type=float, help='Learning rate', default=1e-5) p.add_argument('--dropout', type=float, help='Dropout rate', default=0.1) ``` #### Running ###### Train a XLM-RoBERTa base model ``` python -m ner_baseline.train_model --train train.txt --dev dev.txt --out_dir . --model_name xlmr_ner --gpus 1 \ --epochs 2 --encoder_model xlm-roberta-base --batch_size 64 --lr 0.0001 ``` ###### Evaluate the trained model ``` python -m ner_baseline.evaluate --test test.txt --out_dir . --gpus 1 --encoder_model xlm-roberta-base \ --model MODEL_FILE_PATH --prefix xlmr_ner_results ``` ###### Predicting the tags from a pretrained model ``` python -m ner_baseline.predict_tags --test test.txt --out_dir . --gpus 1 --encoder_model xlm-roberta-base \ --model MODEL_FILE_PATH --prefix xlmr_ner_results --max_length 500 ``` - For this functionality we have implemented an efficient approach for predicting the output tags, independent of the tokenizer used. - The method _parse_tokens_for_ner_ in [reader.py]( https://github.com/amzn/multiconer-baseline/blob/86a1c309f19f7664a75b63c8814e7d60009c09d5/utils/reader.py#L67) while reading the data in CoNLL format, for each token it tokenizes it into its subwords, and additionally generates a mask, where only the first subword of a token is marked with True. - For example, for the token `MultiCoNER`, if we use XLM-RoBERTa, we get the following tokens `['▁Multi', 'Co', 'NER']`, which result in the following token mask `[True, False, False]` - These token masks are part of the output returned by the provided reader. - Finally, when predicting the token tags, the model has the [predict_tags](https://github.com/amzn/multiconer-baseline/blob/86a1c309f19f7664a75b63c8814e7d60009c09d5/model/ner_model.py#L187), which picks only the first tag from the first subword of each token. This process is efficient and is implemented using native python functionalities, e.g. `[compress(pred_tags_, mask_) for pred_tags_, mask_ in zip(pred_tags, token_mask)]`, which is executed for an entire batch. ### Setting up the code environment ``` $ pip install -r requirements.txt ``` # License The code under this repository is licensed under the [Apache 2.0 License](https://github.com/amzn/multiconer-baseline/blob/main/LICENSE).