# concurrentqa **Repository Path**: mirrors_facebookresearch/concurrentqa ## Basic Information - **Project Name**: concurrentqa - **Description**: This repo contains data and code for the paper "Reasoning over Public and Private Data in Retrieval-Based Systems." - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-03-18 - **Last Updated**: 2026-04-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Reasoning over Public and Private Data in Retrieval-Based Systems Simran Arora, Patrick Lewis, Angela Fan, Jacob Kahn*, Christopher Ré* [**Paper**](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00580/117168/Reasoning-over-Public-and-Private-Data-in) | [**Blog Post**](https://ai.facebook.com/blog/building-systems-to-reason-securely-over-private-data) | [**Download**](#getting-the-concurrentqa-dataset-and-model-checkpoints) | [**Citing**](#citation) This repository contains dataset resources and code for ConcurrentQA, a textual QA benchmark to require concurrent retrieval over multiple data-distributions and privacy scopes. It also contains result analysis code and other resources for research in the private QA setting.

### Set up Clone the repository as follows. ```bash git clone git@github.com:facebookresearch/concurrentqa.git cd concurrentqa cd multihop_dense_retrieval git submodule init git submodule update ``` Set up the environment as follows (according to the MDR instructions). We encourage the use of conda environments. ```bash conda create --name cqa python=3.6 conda activate cqa cd concurrentqa/multihop_dense_retrieval/ bash setup.sh ``` If you are using Cuda 11, we find the following changes to the above setup work well: 1) use ```python=3.7```, 2) in ```/multihop_dense_retrieval/setup.sh``` modify the faiss-gpu and pytorch instructions to the following: ``` conda install faiss-gpu cudatoolkit=11.3 -c pytorch conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch ``` ### Getting the ConcurrentQA Dataset and Model Checkpoints To download train, dev, and test sets along with email and Wikipedia passage corpora, and model checkpoints, run: ```bash bash scripts/download_cqa.sh ``` To download retriever and reader models trained on HotpotQA data, run: ```bash bash scripts/download_hotpot.sh ``` The datasets can also be downloaded via Hugging Face: - [Retrieval benchmark](https://huggingface.co/datasets/simarora/ConcurrentQA-Retrieval) - [QA benchmark](https://huggingface.co/datasets/simarora/ConcurrentQA) ### Code We include instructions 1) for training and evaluating models on ConcurrentQA data in the absense of privacy cocerns and 2) for evaluating performance under the PAIR privacy framework. #### Training Models on ConcurrentQA To run evaluation with provided model checkpoints, use the script: ```bash cd multihop_dense_retrieval bash CQA_Scripts/MDR_Eval_CQA.sh ``` ``` Retrieval scores on test split ... Avg PR: 0.604375 Avg P-EM: 0.190625 Avg 1-Recall: 0.276875 Path Recall: 0.184375 bridge Questions num: 1400 Avg PR: 0.5985714285714285 Avg P-EM: 0.18785714285714286 Avg 1-Recall: 0.265 Path Recall: 0.18428571428571427 comparison Questions num: 200 Avg PR: 0.645 Avg P-EM: 0.21 Avg 1-Recall: 0.36 Path Recall: 0.185 Reader scores on test split ... 'em': 0.48875, 'f1': 0.5650013858314458, 'joint_em': 0.1175, 'joint_f1': 0.3439091595024459, 'sp_em': 0.154375, 'sp_f1': 0.4496642766955267 ``` To train your own MDR model from scratch, use the script: ```bash cd multihop_dense_retrieval bash CQA_Scripts/MDR_end2end_CQA.sh ``` #### Evaluating QA Performance Under PAIR Framework Set the desired privacy mode and retrieval mode in the script and run as follows: ```bash cd multihop_dense_retrieval bash CQA_Scripts/MDR_PairBaselines.sh ``` Descriptions of privacy and retrieval modes are included in the script. - Privacy modes include preserving document privacy (DOC_PRIVACY), query privacy (QUERY_PRIVACY), and no privacy. - Retrieval modes include ranking the OVERALL top k after each hop (4combo_overallrank), selecting the top passages from EACH DOMAIN after each hop (4combo_separaterank). ## Citation Please use the following Bibtex when using the dataset: ``` @article{arora2022reasoning, title={Reasoning over Public and Private Data in Retrieval-Based Systems}, author={Simran Arora and Patrick Lewis and Angela Fan and Jacob Kahn and Christopher Ré}, year={2023}, url={https://aclanthology.org/2023.tacl-1.51/}, journal={Transactions of the Association for Computational Linguistics}, } ``` If you use MDR, please also cite the [Multi-Hop Dense Text Retrieval](https://github.com/facebookresearch/multihop_dense_retrieval) work. ## License ConcurrentQA and related code is under an MIT license. See [LICENSE](LICENSE) for more information.