How to format your dataset?

This article will help you understand what data you need to provide to run an analysis and how to format these data. It also includes some nomenclature you might encounter when using Snitch.

Quality Analysis requirements

To perform a Quality Analysis, you will need:

  • Model: Your machine learning model (see the What's currently supported? article for supported frameworks and architectures).
  • Training Dataset: Input features (X matrix) used for training your model.
  • Training Target Dataset: Corresponding targets (y matrix) for the Training Dataset.
  • Testing Dataset: Input features (X matrix) used for testing your model.
  • Testing Target Dataset: Corresponding targets (y matrix) for the Testing Dataset.

Drift Analysis requirements

To perform a Drift Analysis, here's what you will need to have on hand:

  • Training Dataset: Input features (X matrix) used for training your model.
  • Updated Dataset: Input features (X matrix) that you gathered after the training phase. It is also known as the in-production dataset.

NOTE: All datasets must be ready to be fed to the model without any additional transformations and all targets must be in the exact same shape as your model's output.

Supported data formats

When using the cloud version, the previously mentioned data and targets must be provided in one of the following formats:

  • .csv with or without header, no row index
  • .npy or .npz
  • Serialized pandas DataFrames in .joblib format
  • Serialized Numpy arrays in .joblib format 

NOTE: When launching an analysis with the Hybrid or On-Prem modes, pandas.DataFrame and NumPy.array are accepted. Once again, data must be ready for model ingestion.

Can data contain column headers?

Yes, The datasets can contain column headers. They are even encouraged in order to help Snitch communicate the Analyses with you. If no header is provided, Snitch will attribute generic names to the features ("column_1", "column_2", etc.).

Can data contain row headers?

No. If you use pandas to generate a .csv file, make sure to explicitly specify that you wish to exclude the row index like so: train_x_df.to_csv("train_x.csv", index=False).

Can data contain boolean values?

No. Any boolean feature should be represented with zeros and ones.

Can data contain DateTime values?

No.

Still need help? Contact Us Contact Us