Repo for the Themis Debiasing system.
### Themis Debiasing System
#### A Quick Note
This is not a production quality repo. I've tried to provide instructions below but realize that some of the code is messy. Also, all of my instruction files for run commands use the fish shell.
#### General Instructions
For all instructions, you may have to alter the folder structure, ip addresses, and user names for the files and databases.
At a high level, the Data directory stores data and files to download and prep data. The Code directory stores code. Under SelectionBias/ there are some general files I use, ones in ReweightV1, and ones in UniversalModelV2. ReweightV1 stores the linear regression and IPF code. The UniversalModelV2 stores BN stuff and the aggregate constraint model.
I used three main datasets in my experiments, Flights, IMDB, and the CHILD Bayesian network. In each folder under Data, there are instructions for generating these datasets. I also have a copy of each data file in the folder.
To generate the population aggregates (which are just projections of population columns for ease of coding to not have to deal with summations later on), run Code/SelectionBias/generate_subselections_data.R. You'll have to hard code the data you have to generate and the the attributes you want to generate. I generated all possible 2-3 attribute combinations and then selected the right ones later in the pipeline.
To generate samples, run Code/SelectionBias/ I have some existing filters you can select, but feel free to add your own.
To generate random aggregates or queries, I have some helper functions in Code/SelectionBias.
For the BN experiments on the CHILD data, see ForGraphs/CherryTree/run_instructions.txt
For the Point Query experiments on the Flights and IMDB data, see ForGraphs/ModelV2ErrorSummary/run_instructions.txt
For the IDEBench experiments on the Flights data, see ForGraphs/IDEErrorSummary/run_instructions.txt. You could use this to do point queries, too.
