# concurrentqa
**Repository Path**: mirrors_facebookresearch/concurrentqa
## Basic Information
- **Project Name**: concurrentqa
- **Description**: This repo contains data and code for the paper "Reasoning over Public and Private Data in Retrieval-Based Systems."
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-03-18
- **Last Updated**: 2026-04-04
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Reasoning over Public and Private Data in Retrieval-Based Systems
Simran Arora, Patrick Lewis, Angela Fan, Jacob Kahn*, Christopher Ré*
[**Paper**](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00580/117168/Reasoning-over-Public-and-Private-Data-in)
| [**Blog Post**](https://ai.facebook.com/blog/building-systems-to-reason-securely-over-private-data)
| [**Download**](#getting-the-concurrentqa-dataset-and-model-checkpoints)
| [**Citing**](#citation)
This repository contains dataset resources and code for ConcurrentQA, a textual QA benchmark to require concurrent retrieval over multiple data-distributions and privacy scopes. It also contains result analysis code and other resources for research in the private QA setting.

### Set up
Clone the repository as follows.
```bash
git clone git@github.com:facebookresearch/concurrentqa.git
cd concurrentqa
cd multihop_dense_retrieval
git submodule init
git submodule update
```
Set up the environment as follows (according to the MDR instructions). We encourage the use of conda environments.
```bash
conda create --name cqa python=3.6
conda activate cqa
cd concurrentqa/multihop_dense_retrieval/
bash setup.sh
```
If you are using Cuda 11, we find the following changes to the above setup work well: 1) use ```python=3.7```, 2) in ```/multihop_dense_retrieval/setup.sh``` modify the faiss-gpu and pytorch instructions to the following:
```
conda install faiss-gpu cudatoolkit=11.3 -c pytorch
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
```
### Getting the ConcurrentQA Dataset and Model Checkpoints
To download train, dev, and test sets along with email and Wikipedia passage corpora, and model checkpoints, run:
```bash
bash scripts/download_cqa.sh
```
To download retriever and reader models trained on HotpotQA data, run:
```bash
bash scripts/download_hotpot.sh
```
The datasets can also be downloaded via Hugging Face:
- [Retrieval benchmark](https://huggingface.co/datasets/simarora/ConcurrentQA-Retrieval)
- [QA benchmark](https://huggingface.co/datasets/simarora/ConcurrentQA)
### Code
We include instructions 1) for training and evaluating models on ConcurrentQA data in the absense of privacy cocerns and 2) for evaluating performance under the PAIR privacy framework.
#### Training Models on ConcurrentQA
To run evaluation with provided model checkpoints, use the script:
```bash
cd multihop_dense_retrieval
bash CQA_Scripts/MDR_Eval_CQA.sh
```
```
Retrieval scores on test split ...
Avg PR: 0.604375
Avg P-EM: 0.190625
Avg 1-Recall: 0.276875
Path Recall: 0.184375
bridge Questions num: 1400
Avg PR: 0.5985714285714285
Avg P-EM: 0.18785714285714286
Avg 1-Recall: 0.265
Path Recall: 0.18428571428571427
comparison Questions num: 200
Avg PR: 0.645
Avg P-EM: 0.21
Avg 1-Recall: 0.36
Path Recall: 0.185
Reader scores on test split ...
'em': 0.48875,
'f1': 0.5650013858314458,
'joint_em': 0.1175,
'joint_f1': 0.3439091595024459,
'sp_em': 0.154375,
'sp_f1': 0.4496642766955267
```
To train your own MDR model from scratch, use the script:
```bash
cd multihop_dense_retrieval
bash CQA_Scripts/MDR_end2end_CQA.sh
```
#### Evaluating QA Performance Under PAIR Framework
Set the desired privacy mode and retrieval mode in the script and run as follows:
```bash
cd multihop_dense_retrieval
bash CQA_Scripts/MDR_PairBaselines.sh
```
Descriptions of privacy and retrieval modes are included in the script.
- Privacy modes include preserving document privacy (DOC_PRIVACY), query privacy (QUERY_PRIVACY), and no privacy.
- Retrieval modes include ranking the OVERALL top k after each hop (4combo_overallrank), selecting the top passages from EACH DOMAIN after each hop (4combo_separaterank).
## Citation
Please use the following Bibtex when using the dataset:
```
@article{arora2022reasoning,
title={Reasoning over Public and Private Data in Retrieval-Based Systems},
author={Simran Arora and Patrick Lewis and Angela Fan and Jacob Kahn and Christopher Ré},
year={2023},
url={https://aclanthology.org/2023.tacl-1.51/},
journal={Transactions of the Association for Computational Linguistics},
}
```
If you use MDR, please also cite the [Multi-Hop Dense Text Retrieval](https://github.com/facebookresearch/multihop_dense_retrieval) work.
## License
ConcurrentQA and related code is under an MIT license. See [LICENSE](LICENSE) for more information.