# PyCode-TextEE
**Repository Path**: zml2016055/PyCode-TextEE
## Basic Information
- **Project Name**: PyCode-TextEE
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-08-10
- **Last Updated**: 2025-08-10
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# π§ Instruction Tuning with Annotation Guidelines for Event Extraction (Findings of ACL 2025)
> Efficient and extensible Event Extraction with Code Prompts and Annotation Guidelines β built on top of [TextEE](https://github.com/ej0cl6/TextEE).
This repository includes code for:
- `PyCode-TextEE`: Tools to obtain code prompts for 15 event extraction datasets supported by TextEE.
- `Instruction Tuning with Guidelines`: Source code to reproduce [our work on utlizing code prompts and annotation guidelines for Event Extraction](https://arxiv.org/abs/2502.16377). Please navigate to the directory `instruction_tuning_with_guidelines_ACL_2025` for the source code.
If you find our work helpful, please cite our work:
```
@inproceedings{srivastava-etal-2025-instruction,
title = "Instruction-Tuning {LLM}s for Event Extraction with Annotation Guidelines",
author = "Srivastava, Saurabh and
Pati, Sweta and
Yao, Ziyu",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.677/",
pages = "13055--13071",
ISBN = "979-8-89176-256-5",
abstract = "In this work, we study the effect of annotation guidelines{--}textual descriptions of event types and arguments, when instruction-tuning large language models for event extraction. We conducted a series of experiments with both human-provided and machine-generated guidelines in both full- and low-data settings. Our results demonstrate the promise of annotation guidelines when there is a decent amount of training data and highlight its effectiveness in improving cross-schema generalization and low-frequency event-type performance."
}
```
---
[](#updates) β’
[](#datasets) β’
[](#environment) β’
[](#running) β’
[](#website) β’
[](https://arxiv.org/abs/2502.16377)
---
**Authors**:
Saurabh Srivastava, Sweta Pati, Ziyu Yao
---
## π§© Introduction
**PyCode-TextEE** extends [TextEE](https://github.com/ej0cl6/TextEE), bringing **event extraction into the era of prompt-based large language models**.
While **TextEE** standardizes 10+ event extraction datasets into a unified JSON formatβmaking them reproducible and comparableβ**PyCode-TextEE** takes the next leap:
> β¨ We transform TextEE-formatted data into **code-style prompts**βa format that is both readable and executable by LLMs and ideal for structured evaluation. In addition, we annotate the code-prompts with annotation guidelines. Below, we provide an example of code prompt and how we integrate annotation guidelines within them:
### What are Code Prompts and Annotation Guidelines?
- `Code prompting` is a technique that enhances reasoning abilities in text+code LLMs by transforming natural language (NL) tasks into code representations. Instead of executing the code, the model uses it as a structured input format to reason and generate answers. *The labels such event classes and arguments are represented as Python classes, and the guidelines or instructions are introduced as docstrings.* The model start generating after the `result =` line.
- `Annotation Guidelines` involve defining how to identify and classify events and their arguments within a text or other data. These guidelines help ensure consistency and quality in the annotation process, which is crucial for training machine learning models for event extraction. The performance of current SoTA models heavily depends
on the quantity of human-annotated data, as the model learns the guidelines from these examples.
β οΈ Note that not all datasets release their annotation guidelines. We provide code to generate these annotation guidelines automatically using a few training samples.
#### An example for a code prompt with annotation guidelines is shown below:
```python
# This is an event extraction task where the goal is to extract structured events from the text. A structured event contains an event trigger word, an event type, the arguments participating in the event, and their roles in the event. For each different event type, please output the extracted information from the text into python-style dictionaries where the first key will be 'mention' with the value of the event trigger. Next, please output the arguments and their roles following the same format. The event type definitions and their argument roles are defined next.
# Here are the event definitions:
@dataclass
class Meet(ContactEvent):
"""A 'Meet(ContactEvent)' is triggered by interactions where individuals or groups come together for a specific purpose, either physically or virtually. This event involves direct interaction, distinguishing it from remote communication events like 'PhoneWrite'. It encompasses formal and informal gatherings such as diplomatic talks, business meetings, press conferences, and forums, but excludes casual or unplanned encounters."""
mention: str # The text span that triggers the event.
entity: List # Entities are individuals, groups, organizations, or countries participating in the meeting. They represent the participants involved in the event.
place: List # The place is the location where the meeting occurs, providing context for the event. It can be a city, building, specific venue, or virtual platform.
# This is the text to analyze
text = "The meeting concluded with the delegates voting by show of hands to meet again in 10 days."
result = [
Meet(mention='meeting', entity=['delegates'], time=[], place=[]),
Meet(mention='meet', entity=['delegates'], time=['10 days'], place=[])
]
```
> PyCode-TextEE transforms EE datasets into the above format which have shown to perform well with LLMs. For more details, please refer to our paper [Instruction-Tuning LLMs for Event Extraction with Annotation Guidelines](https://arxiv.org/abs/2502.16377).
---
### π Whatβs New in PyCode-TextEE?
- **CodePrompt Format Conversion**
We convert event structures (event triggers, argumentsβif available) into Python-like prompts (e.g., `Attack(mention="...", attacker=[...], target=[...])`) to help LLMs handle structured outputs.
- **Annotation Guideline Generation**
While annotation guidelines have helped LLMs achieve SOTA results for EE, previous approaches assume that these guidelines are made available which is not always true. We take the next steps in generating these guidelines automatically from a few training samples.
- **Plug-and-Play with TextEE**
Directly load standardized datasets from TextEE and transform them with one command into training-ready CodePrompts.
- **Evaluation Toolkit for Prompted LLMs**
We provide exact-match evaluation utilities that compute **precision, recall, and F1 scores** over structured LLM outputs.
- **Code to Reproduce LLaMAEvents**
Includes all data transformations and training scripts used for our paper on utilizing code prompts and annotation guidelines. Code for that will live in `LLaMAEvents/`.
## π οΈ Updates
- **April 23, 2025** β We release **PyCode-TextEE**, a modular framework for converting standardized event extraction datasets (via TextEE) into code-style prompts, along with exact-match evaluation scripts.
Feel free to reach out if youβd like to contribute your **models**, **datasets**, or ideas!
## π Supported Datasets
We support **15 datasets** for Event Detection (ED), Event Argument Extraction (EAE), and End-to-End (E2E) Event Extraction. All are converted into **code-style prompts** and support evaluation using our exact-match metric suite.
The table below also shows whether annotation **guidelines** are included for each dataset.
| Dataset |
Task(s) |
Paper Title |
Source |
Guidelines |
ACE05 |
ED, EAE, E2E |
The Automatic Content Extraction (ACE) Program |
LDC |
π |
ERE |
ED, EAE, E2E |
From Light to Rich ERE |
LDC |
π |
MLEE |
ED, EAE, E2E |
Biological Event Extraction |
Bioinformatics |
βͺοΈ |
Genia2011 |
ED, EAE, E2E |
Genia Event Task (2011) |
BioNLP 2011 |
βͺοΈ |
Genia2013 |
ED, EAE, E2E |
Genia Event Task (2013) |
BioNLP 2013 |
βͺοΈ |
M2E2 |
ED, EAE, E2E |
Cross-media Structured Common Space |
ACL 2020 |
βͺοΈ |
CASIE |
ED, EAE, E2E |
CASIE: Cybersecurity Event Extraction |
AAAI 2020 |
βͺοΈ |
PHEE |
ED, EAE, E2E |
Pharmacovigilance Event Extraction |
EMNLP 2022 |
βͺοΈ |
MEE |
ED |
Multilingual Event Extraction |
EMNLP 2022 |
βͺοΈ |
FewEvent |
ED |
Few-Shot Event Detection |
WSDM 2020 |
βͺοΈ |
MAVEN |
ED |
Massive General-Domain ED |
EMNLP 2020 |
βͺοΈ |
SPPED |
ED |
ED from Social Media for Epidemic Prediction |
NAACL 2024 |
βͺοΈ |
MUC-4 |
EAE |
Fourth Message Understanding Conference |
MUC 1992 |
βͺοΈ |
RAMS |
EAE |
Multi-Sentence Argument Linking |
ACL 2020 |
π |
WikiEvents |
EAE |
Conditional Generation for Doc-level EAE |
NAACL 2021 |
π |
GENEVA |
EAE |
Benchmarking Generalizability for EAE |
ACL 2023 |
π |
## βοΈ Environment
Although there is no need of any additional package to run PyCode-TextEE, we recommend using **Python 3.9+** with a clean virtual environment (e.g., via `venv` or `conda`).
### πΉ Install Dependencies
```bash
# Clone the repo
git clone https://github.com/yourname/PyCode-TextEE.git
cd PyCode-TextEE
# Create a virtual environment (optional)
python3 -m venv env
source env/bin/activate # On Windows: env\Scripts\activate
# Install requirements (optional)
pip install -r requirements.txt
```
### πΉ Core Dependencies
> These are the minimal dependencies to run the code.
- `datasets`
- `openai` # (used for guideline generation)
- `wandb` # (optional for experiment tracking)
### β οΈ Note
Some datasets (e.g., ACE, ERE) require **LDC license** to access raw files. We provide code for preprocessing them, but not the data itself.
## π Running the Code
Below is a step-by-step guide to run PyCode-TextEE.
Our pipeline is divided into 4 main stages:
---
### β£ Step 0 β Obtaining TextEE Format Dataset
Our code accepts data formatted after TextEE pre-processing. Please follow the instructions in `data` directory from the [TextEE repo](https://github.com/ej0cl6/TextEE/tree/main).
Make sure after running TextEE, you have data saved in the following structure:
#### π Expected dataset layout:
```
/
βββ ace05-en/
β βββ split1/
β β βββ train.json
β β βββ dev.json
β β βββ test.json
β βββ ...
βββ casie/
βββ ...
```
### πΉ Step 1 β [Optional] Generate Code Schema
If you're working with custom datasets (or want to regenerate schemas for the 15 supported ones), you'll first convert them into **TextEE format** and generate the corresponding **Python-style event definitions**.
π Directory structure:
```
PyCode-TextEE/
βββ code_schema_generation/
β βββ generate_schema.py
β βββ init_prompts/ # Contains per-dataset class schemas (*.txt)
β βββ python_event_defs/ # Python classes for eval (dataset-wise + all_ee_definitions.py)
β βββ mapper.json # Maps cleaned names β class names
β βββ schema.json # All cleaned event/arg schemas
```
π To generate schema:
```bash
cd code_schema_generation
python generate_schema.py --dataset_folder
```
πΎ Example output schema (for ACE05 `Attack` event):
```python
@dataclass
class Attack(ConflictEvent):
mention: str
target: List
victim: List
attacker: List
instrument: List
place: List
agent: List
```
**Note**: Weβve already generated schema for all 15 supported datasets. This step is only required for new datasets.
---
### πΉ Step 2: Generating Annotation Guidelines from a few Training Samples
While code prompts convert EE datasets into a structured format, annotating the schema with guidelines helps LLMs understand event and argument definitions. As shown in [our paper](https://arxiv.org/abs/2502.16377), annotated schema with these guidelines help us achive SOTA results with LLaMA-3-8.1B. However, not all datasets release these annotation guidelines and we address this in our paper by proposing 5 different ways to generate this guidelines. Specifically, we generate guidelines using following 5 variants discussed below:
- **Guideline-P**: Uses training samples from an event class e to generate guidelines. We denote such instances as positive samples in our approach.
- **Guideline-PN**: In addition to positive training samples, we also utilize 15 negative samples from different event classes to generate guidelines.
- **Guideline-PS**: We designate sibling event classes in event schema as negative samples and utilize them to generate guidelines.
- **Guideline-PN-Int and Guideline-PS-Int**: We create two more variants that Integrate the 5 diverse guideline samples from GuidelinePN and Guideline-PS into a comprehensive one,
respectively.
**Note**: Weβve already the synethesized guidelines and available human guidelines in directory `guideline_generation/synthesize_guidelines/synthesized_guidelines`
To generate the guidelines, please run the following command:
```bash
cd guideline_generation
python synthesize_guidelines/create_dictionaries.py --dataset_name
python prompting/prompt_llms.py #generates guidelines P, PN, PS
python prompting/prompt_llm_adv_guidelines.py #generates Int- guidelines
cd .. # to navigate to home directory
```
where, `` refers to the dataset for which the guidelines need to be genrated (e.g., ace05-en), `` refers to one of the 5 variants discussed above, i.e., one from Guideline-P (P), Guideline-PN (PN), Guideline-PS (PS), Guideline-PN-Int (PNI) or Guideline-PS-Int (PSI).
### π Guideline File Format
After above code execution, the guidelines will be stored in the file ``. Please make sure that your guideline file looks like:
```json
{
"EventName1": {
"description": [
"One possible definition.",
"Another variation of the same."
],
"attributes": {
"mention": "Trigger span of the event.",
"arg_1": ["One definition for arg_1", "another definition for arg_1"]
}
}
}
```
This enables *randomized sampling* during conversion to avoid overfitting to one phrasingβan approach highlighted in our paper.
---
### πΉ Step 3: Obtaining Code Prompts
We first need to make sure that python event definitions are in current environment to verify code prompts.
```bash
cd python_event_defs # this directory is already included in the code or can be generated using Step 1. You can find it in "PyCode-TextEE/code_schema_generation/python_event_defs"
export PYCODE_HOME=$(pwd)
export PYTHONPATH=$PYCODE_HOME:$PYCODE_HOME:$PYTHONPATH
cd ../../ # redirect the terminal to PyCode home directory
```
Run the following:
```bash
cd code_prompts
python prepare_dataset.py \
--input_dir \
--dataset_name \
--annotate_schema \ #if unspecified, the schema will be left unannotated because the flag defaults to False.
--guideline_file \ #if unspecified, the guidelines will be generated automatically as specified in Step-2.
--add_negative_samples \ #used to reproduce our LLaMAEvents results.
--output_dir ./processed_code_prompts/
```
---
### βοΈ Argument Descriptions
| Argument | Description |
|------------------------|-------------|
| `--input_dir` | Path to TextEE-formatted JSONs (default: `../../TextEE/processed_data`) |
| `--dataset_name` | Name of the dataset to process (e.g., `ace05-en`) |
| `--annotate_schema` | Add class docstrings and inline comments using guidelines (default: `False`) |
| `--guideline_file` | Guideline JSON file for schema annotation (required if `annotate_schema=True`) |
| `--add_negative_samples` | Add negative examples to training set (default: `False`) |
| `--output_dir` | Where to save the converted code prompts (default: `./processed_code_prompts/`) |
---
### 𧬠Annotated Prompt Example (with Guidelines)
When `--annotate_schema=True`, we generate prompts like:
```python
@dataclass
class Event(ParentEvent):
"""the event definition"""
mention: str # Event trigger definition
arg_1: List # Definition of argument 1
arg_2: List # Definition of argument 2
```
This format supports **LLM-compatible** structure learning and improves interpretability.
---
### π‘ Tip
β΅ Skip `--guideline_file` and `--annotate_schema` if you're only interested in raw code prompts. If `annotate_schema` is True but the `guideline_file` is unspecified or not found, Step 2 will be executed automatically to produce `guideline_file`.
βΆ Use `--add_negative_samples` if you want to add negative sample per instance similar to [DEGREE](https://github.com/PlusLabNLP/DEGREE).
---
### πΉ Step 4: Training Models
To train the model, you can use the following scripts with LLaMA models as default, simply run:
```bash
cd training_scripts
python train_completion.py # train a chat completion model with LLaMA-3.1-8B as backbone
```
You can also run following command to resume training from a checkpoint:
```bash
python resume_from_ckpt.py # please specify the checkpoint directory in the script. By default, it will download and run LLaMA-3.1-8B
```
## π§ͺ Evaluation
Once you've trained your model to generate Python-style event prompts, you can use our evaluation suite in `code_evaluation/` to compute standard **precision, recall, and F1 scores** via exact-match comparison of predicted and gold structured outputs.
### π Directory Overview
```
code_evaluation/
βββ all_ee_definitions.py # Event classes copied from schema generation (Step 1)
βββ event_scorer.py # π₯ Main evaluation logic
βββ utils_typing.py # (Attribution to GoLLIE β type helper module)
```
---
### π `event_scorer.py`: Evaluation in a Nutshell
The core script compares model-generated code prompts with gold ones using Python object introspection.
#### β
Key Features:
- Extracts arguments from predicted and gold event objects
- Computes **micro/macro F1** across all examples
- Identifies:
- **Trigger-level mismatches**
- **Argument-level hallucinations**
- Logs detailed stats (TP / FP / FN per role)
#### π― Core Functions:
- `compute_f1(...)`: calculates precision, recall, and F1 from match counts
- `extract_objects(...)`: extracts fields except for `mention` to compare arguments
- `micro_ed_scores`: calculate micro f1 score on Event Detection task
- `micro_eae_scores`: calculate micro f1 score on Event Argument Extraction task
- `micro_e2e_scores`: calculate micro f1 score on End-to-End Event Extraction task
- `log_hallucinations_and_mismatches(...)`: logs mismatches like hallucinated roles
---
### π§ͺ Run the Demo Evaluation
We provide a ready-to-run example in:
```
demo/e2e_demo.json
```
This file contains three illustrative cases:
\- β
One fully correct prediction - π‘ One partially correct - β One incorrect
To run the evaluation:
```bash
cd code_evaluation
python event_scorer.py --input_file ./../demo/e2e_demo.json
```