Data quality

Every query you issue to Subsalt generates a quality report that compares the synthetic data you received to the data in the source system. You can access this report in the Queries section of the Subsalt portal.

Interpreting the report

In most cases there will be four sections of the quality report:

Univariate similarity
Bivariate similarity
Multivariate similarity
Machine learning efficacy

Univariate similarity

Univariate similarity measures how similar each field in your synthetic data is to its source, ignoring all other column dependencies. This is done using a variety of statistical tests like Kolmogorov-Smirnov and chi-squared tests. Each field has a score from 0-1, where 1 means the synthetic data has a statistically identical distribution for the given field, and 0 means it was not at all similar. There is also a set of side by side histograms/bar charts to help interpret these scores.

Bivariate similarity

Bivariate similarity measures how similar each pair of fields in your synthetic data is to its source by assessing the similarity of relationships between pairs of fields in the synthetic vs source data.

For continuous fields, the score is based on correlations where a score of 1 means there is no difference in correlation between the real and synthetic data and 0 means completely inverse correlations.
For categorical fields, the score is calculated by creating a contingency table for both the source and synthetic data, then computing the difference between each cell in the source/synthetic contingency table. A score of 1 means there is no difference between the contingency tables, while 0 means they are extremely dissimilar.

Multivariate similarity

Multivariate similarity measures how well deeper multivariate relationships are maintained in your synthetic data. Currently multivariate similarity only considers continuous fields, as it is based on a principal component analysis.

Your data will go through a PCA reduction to plot the source data on a 2D plane (using the first two principal components). We use the same PCA to plot the synthetic data onto the same 2D plane. Similar deep multivariate relationships will yield similarly shaped pictures.

Machine learning efficacy

ML efficacy is measured by running an A/B test on model accuracy for a model trained on the source data vs a model trained on synthetic data. The scores reported are based on the accuracy when predicting on the same set of real holdout data from the source. This is meant to replicate a pattern where you develop models on synthetic data and use them in a production setting on real data.

In all cases you will see a table with model scores, populated based on the target type. Continuous targets will get scores like MSE while categorical targets will get scores like AUC. You will also always see a feature importance plot comparing the two models' reasoning. Depending on specific attributes of your target, you may also see ROC curves or cumulative gain plots.

PreviousRunning queries NextProjects

Last updated 6 days ago