Constraints

Synthetic data is made to match the underlying statistical patterns of a given source dataset, with some added noise. Sometimes there are strict rules that a dataset must follow in order to be valid that data synthesis will not recognize by default. For example, if you have a state column mixed with a zip_code column, there is a strict set of combinations allowed between these two columns - for example, you cannot have a California ZIP code in a row whose state is Texas.

Note that defining these characteristics will generally not impact the statistical fidelity and they are generally only impactful cosmetically. In the above example, even without defining a characteristic you will still mostly see realistic ZIP/state pairs - adding a constraint will ensure that all rows follow a particular set of rules.

Adding constraints

You can add constraints to a database during onboarding; the standard onboarding flow has a step for adding constraints, and you can also add them later via the "Constraints" tab on the database details page.

The full list of supported constraints is below.

Constraint
Description
Example

Derive

Ensures that specified columns will be populated with data from another column

birth_year should be populated based on the "year" portion of birth_date

Group

Ensures that the columns specified are never seen in a new unique combination

State, city, ZIP can only be combined as observed in source data

Compare

Ensures that inequalities are maintained between numeric fields

arrival_date should always be earlier than departure_date

Conditional

Ensures that a target column will be populated with a specific value when a column contains a specified value

Values generated for dischargable will be "Y" when healthy is 'true'

Calculate

Ensures that a target column will be populated with the results of calculation between two columns

Values generated for 'total_cost' will be the result of 'base_cost' + 'fee'

SpecialValues

Ensures that special values within columns are preserved

'-1' has a special meaning in a continuous column and should be modeled independently from the primary distribution

Existence

Ensures that a target column will be populated with 1 if the conditional matches on the user input or 0 otherwise. NOTE: Only integer columns are supported

Mask age column by setting values to 1 if age is greater than 20 else 0

Last updated