# Data quality

Every query you issue to Subsalt generates a quality report that compares the synthetic data you received to the data in the source system. You can access this report in the **Queries** section of the Subsalt portal.

### Interpreting the report

In most cases there will be four sections of the quality report:

* Univariate similarity
* Bivariate similarity
* Multivariate similarity
* Machine learning efficacy

#### Univariate similarity

Univariate similarity measures how similar each field in your synthetic data is to its source, ignoring all other column dependencies. This is done using a variety of statistical tests like Kolmogorov-Smirnov and chi-squared tests. Each field has a score from 0-1, where 1 means the synthetic data has a statistically identical distribution for the given field, and 0 means it was not at all similar. There is also a set of side by side histograms/bar charts to help interpret these scores.

<figure><img src="/files/vsy0gq6gHGFTgsgypXQf" alt=""><figcaption></figcaption></figure>

#### Bivariate similarity

Bivariate similarity measures how similar each pair of fields in your synthetic data is to its source by assessing the similarity of relationships between pairs of fields in the synthetic vs source data.&#x20;

* For continuous fields, the score is based on correlations where a score of 1 means there is no difference in correlation between the real and synthetic data and 0 means completely inverse correlations.&#x20;
* For categorical fields, the score is calculated by creating a contingency table for both the source and synthetic data, then computing the difference between each cell in the source/synthetic contingency table. A score of 1 means there is no difference between the contingency tables, while 0 means they are extremely dissimilar.

<figure><img src="/files/4Tmh5n5OHShO8rwJM1lB" alt=""><figcaption></figcaption></figure>

#### Multivariate similarity

Multivariate similarity measures how well deeper multivariate relationships are maintained in your synthetic data. Currently multivariate similarity only considers continuous fields, as it is based on a principal component analysis.&#x20;

Your data will go through a PCA reduction to plot the source data on a 2D plane (using the first two principal components). We use the same PCA to plot the synthetic data onto the same 2D plane. Similar deep multivariate relationships will yield similarly shaped pictures.

<figure><img src="/files/yfBSbgNHMMdgVQXMTbkP" alt=""><figcaption></figcaption></figure>

#### Machine learning efficacy

ML efficacy is measured by running an A/B test on model accuracy for a model trained on the source data vs a model trained on synthetic data. The scores reported are based on the accuracy when predicting on the same set of real holdout data from the source. This is meant to replicate a pattern where you develop models on synthetic data and use them in a production setting on real data.

In all cases you will see a table with model scores, populated based on the target type. Continuous targets will get scores like MSE while categorical targets will get scores like AUC. You will also always see a feature importance plot comparing the two models' reasoning. Depending on specific attributes of your target, you may also see ROC curves or cumulative gain plots.

<figure><img src="/files/5WBY0m0FBbrZweWEKQ6d" alt=""><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.subsalt.io/retrieving-data/data-quality.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
