# The TARS Database Creation
This folder contains code to build the underlying database for Text Analysis and Retrieval Tool (TARS) described in *Section 3 - Solution Architecture* of the paper An AI-powered Tool for Central Bank Business Liaisons: Quantitative Indicators and On-demand Insights from Firms. 

The TARS database allows users to efficiently filter liaison textual data using the dashboard in `frontend` code directory, as well as build textual measures relevant to macroeconomic conditions using the code in `Capabilities`. 

> Note: The demo data used in this repo has been generated by ChatGPT to broadly reflect the type of text extracted and enriched in the Business Liaison TARS database described in the paper. However, given this data is artifical in nature, the output will not exactly reflect measures described in the paper.

# Code structure
```
Repo
├── Data
│   ├── Example Liaison Summary Note.docx (input into Extraction step)
│   └── Example_liaison_data.csv (input into Enrichment step)
└── backend
    ├── TARS_Enrichment.ipynb (demo on extracting text from word document)
    ├── TARS_Extraction.ipynb (demo on enriching liaison-like text)
    ├── TARSml.py (contains NLP model files used to enrich the text)
    ├── TARSutils.py (contains utility functions for extracting, cleaning and enriching text)
    └── environment.yml (file for building working python environment to run code)
```

# Getting started
### Quick Start 
1. Ensure python is installed and all required packages are installed.
2. Open one of the python notebooks and run each cell.
3. Observe outputs of each cell and gain and understanding of the process of building a TARS for liaison text.  

### Requirements
For python and package versions of the original paper, see `environment.yml`. Check your python package versions using `!pip list | grep <package_name>`, or create a virtual environment to reflect the one used in the paper with `conda env create -f environment.yml`. See [conda documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) for more details.

Also, to run with the example data ensure the following files are in the `../Data/` directory:
* `Example Liaison Summary Note.docx` for running `TARS_Extraction.ipynb`
* `Example_liaison_data.csv` for running `TARS_Enrichment.ipynb`

## File Details
#### TARS Extraction
The first step of creating the business liaison TARS is to extract the textual information from summary notes of liaison meetings which are written in Word documents. The example code in the repository extracts the text from the word document in `../Data/` and saves it at a paragraph level based on the formatting in the document. The example document is simulated using ChatGPT to reflect the style that is used in summary liaison notes created by the RBA's Regional and Industry Analysis team. 

#### TARS Enrichment
Once text has been systematically extracted from word documents, it is enriched with the metadata of the word documents, and with outputs from language models (such as topic tags and sentiment scores). The second script loads in the language models used in the TARS backend pipeline, and runs each over another simulated dataset of liaison text to enrich it with the model outputs. 

## Language Models

The textual data is enriched with a suite of language models that allow for the extraction of topic and tone measures, as well as precise numerical quantities contained within the text. The models are fine-tuned transformer-based open source language models pulled from the huggingface public repository. The models used in this work are:
* [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli): A zero-shot classification model; used for extaction topics relevant to business liaison discussions. Also used in the numerical extraction pipeline to extract sign of numerical quantities (increasing, decreasing or neither).
* [finbert](https://huggingface.co/ProsusAI/finbert): A sentiment model fine-tuned on financial text; used to extract business liaison-relevant sentiment. 
* [roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2): A question answering model; used for extracting numerical quantities from text.
