Constraints
Synthetic data is made to match the underlying statistical patterns of a given source dataset, with some added noise. Sometimes there are strict rules that a dataset must follow in order to be valid that data synthesis will not recognize by default. For example, if you have a state
column mixed with a zip_code
column, there is a strict set of combinations allowed between these two columns - for example, you cannot have a California ZIP code in a row whose state is Texas.
Note that defining these characteristics will generally not impact the statistical fidelity and they are generally only impactful cosmetically. In the above example, even without defining a characteristic you will still mostly see realistic ZIP/state pairs - adding a constraint will ensure that all rows follow a particular set of rules.
Adding constraints
You can add constraints to a database during onboarding; the standard onboarding flow has a step for adding constraints, and you can also add them later via the "Constraints" tab on the database details page.
The full list of supported constraints is below.
Derive
Ensures that specified columns will be populated with data from another column
birth_year
should be populated based on the "year" portion of birth_date
Group
Ensures that the columns specified are never seen in a new unique combination
State, city, ZIP can only be combined as observed in source data
Compare
Ensures that inequalities are maintained between numeric fields
arrival_date
should always be earlier than departure_date
Conditional
Ensures that a target column will be populated with a specific value when a column contains a specified value
Values generated for dischargable
will be "Y" when healthy
is 'true'
Calculate
Ensures that a target column will be populated with the results of calculation between two columns
Values generated for 'total_cost' will be the result of 'base_cost' + 'fee'
SpecialValues
Ensures that special values within columns are preserved
'-1' has a special meaning in a continuous column and should be modeled independently from the primary distribution
Existence
Ensures that a target column will be populated with 1 if the conditional matches on the user input or 0 otherwise. NOTE: Only integer columns are supported
Mask age column by setting values to 1 if age is greater than 20 else 0
Last updated