Dummy data and expectations
Because OpenSAFELY doesn't allow direct access to individual patient records, researchers must use dummy data for developing their analytic code on their own computer.
OpenSAFELY requires you to define expectations in your study definition: these describe the properties of each variable, and are used to generate random data that match the expectations.
You can also provide your own dummy data.
Defining return_expectations
🔗
Every variable in a study definition must have a return_expectations
argument defined (with the exception of the population
variable).
This defines the general shape or distribution of the variables in the dummy data used for developing the code. It is currently relatively unsophisticated; each variable is generated independently of all others. In most cases, dummy data is good enough to test that it is possible to run your study from start to finish, but sometimes not. You can find (and contribute to!) discussions on improving the dummy data framework.
Specifying default distributions🔗
All variables use a default defined at the top of the study definition, with the default_expectations
argument, as follows:
study = StudyDefinition(
# Configure the expectations framework
default_expectations={
"date": {"earliest": "1900-01-01", "latest": "today"},
"rate": "exponential_increase",
"incidence": 0.5,
},
...
)
These defaults apply to all subsequently defined variables. incidence
and rate
have slightly different meanings depending on the variable type.
In this case, we are saying that:
- Events dates are expected to be distributed between 1900 and today, with exponentially-increasing frequency, with events occurring for 50% of patients.
- Values for binary variables are expected to be positive 50% of the time.
- Values for categorical variables are expected to be present (i.e., non-missing) 50% of the time.
- Values for numeric variables are expected to be present (i.e., non-missing) 50% of the time.
Specifying variable-specific distributions🔗
If the defaults need to be overridden, then use the return_expectations
argument within the variable extractor function, for example as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
default_expecations
argument at the start of the StudyDefinition() with 20% for the copd
variable.
All options🔗
The following options are currently available for dummy data (note numeric values are shown as examples only):
integers
{"int" : {"distribution": "normal", "mean": 25, "stddev": 5}, "incidence" : 0.5}
{"int" : {"distribution": "population_ages"}, "incidence" : 1}
numeric
{"float" : {"distribution": "normal", "mean": 25, "stddev": 5}, "incidence" : 0.75}
binary
{"incidence": 0.33}
categorical
{"category": {"ratios": {"cat1": 0.1, "cat2": 0.2, "cat3": 0.7}}, "incidence" : 1}
date
{"date": {"earliest": "1900-01-01", "latest": "today"}, "rate" : "exponential_increase"}
{"date": {"earliest": "1900-01-01", "latest": "today"}, "rate" : "uniform"}
Specifc parameters/variable notes🔗
"incidence"
has a slightly different meaning dependent on the variable type it is applied to:
* binary: describes actual incidence (0.5 means values are expected to be positive 50% of the time)
* int/float/categorical: indicates non-missingness (0.5 means values are expected to be present - non-missing - 50% of the time)
"rate"
* used for the distribution of date values, with either:
* "exponential_increase"
* "uniform"
- or for non-date values:
"universal"
: indicates every patient is expected to have a value (i.e. an alias forincidence=1
)
"distribution"
(numeric variables) currently has two possible options:
* normal
* population_ages
: samples from the distribution of ages in the UK taken from the Office for National Statistics.
Providing your own dummy data🔗
If the expectations framework does not offer enough control over the dummy data that is generated, you can provide your own.
In your project.yaml
, you can add a dummy_data_file
value to a cohortextractor
action.
For instance:
generate_cohort:
run: cohortextractor:latest generate_cohort
dummy_data_file: test-data/dummy-data.csv
outputs:
highly_sensitive:
cohort: output/input.csv
The dummy data file must be committed to the repo. You should generate the dummy data using a script, and commit the script to the repo.
Warning
You must not use real clinical data for dummy data!