Overview of the Predictive Modelling Metadata Interchange Format

The Predictive Modelling Metadata Interchange Format (PMMIF) is designed to facilitate the exchange of data for use in predictive modelling.

There are two initial use cases it seeks to support:

  • datasets for conventional predictive modelling of integer, continuous and categorical outcomes
  • datasets for uplift modelling of integer and continuous outcomes

PMMIF has the following goals:

  • Capability to describe datasets fully and accurately
  • Machine-readable metadata
  • Initial expression in JSON, with future ability to support other representations such as XML, YAML, HTML etc.
  • Reasonably readable in raw form by humans
  • Enough extra metadata to dectect faulty reading of the data
  • Ability to write programs to verify both well-formedness of the PMMIF metadata description and to verify that the data conforms to the specification contained in the metadata
  • Extensibility
  • Simple minimum requirements
  • Inclusion of key structural information, e.g. the role that various fields play in modelling.
  • Ability to describe all (or nearly all) of the datasets in the Machine Learning Repository at the University of California, Irvine, and also some example datasets for use with Uplift Modelling.
  • Each .pmm file currently describes a single dataset, stored as a flat file, consisting of a number of records, each with a fixed set of named fields. Later enhancements will allow multiple flat files to be documented together.

Very Simple Example

Here is a very simple example of a .pmm file in the PMMIF format, hillstrom3.pmm:

{
    "pmmversion": "0.1",
    "name": "hillstrom",
    "recordcount": 64000,
    "fieldcount": 3,
    "fields": [
        {
            "name": "recency",
            "type": "integer",
            "role": "independent",
            "tags": [],
            "stats": {
                "nnulls": 0,
                "nuniques": 12,
                "min": 1,
                "max": 12,
                "mean": 5.763734375
            }
        },
        {
            "name": "history_segment",
            "type": "string",
            "role": "independent",
            "tags": [
                "ordinal"
            ],
            "stats": {
                "nnulls": 0,
                "nuniques": 7
            },
            "values": [
                "1) $0 - $100",
                "2) $100 - $200",
                "3) $200 - $350",
                "4) $350 - $500",
                "5) $500 - $750",
                "6) $750 - $1,000",
                "7) $1,000 +"
            ]
        },
        {
            "name": "conversion",
            "type": "boolean",
            "role": "dependent",
            "tags": [],
            "stats": {
                "nnulls": 0,
                "nuniques": 2,
                "min": 0,
                "max": 1,
                "mean": 0.00903125
            }
        },
    ],
    "data": {
        "flatfile": {
            "name": "hillstrom3.csv",
            "format": {
                "separator": ",",
                "quote": "\"",
                "escape": "\\",
                "nullmarker": "",
                "headerrowcount": 1
            }
        }
    }
}

Note that not all of the data shown here is required for every PMMIF file.

Validity and Well-Formedness

PMMIF borrows a useful pair of notions from XML, namely those of well-formedness and validity.

A PMMIF (.pmm) file is well-formed if it conforms syntactically to this definition, i.e. if it is valid JSON, in UTF-8, contains all the required elements and the types and structures of all the elements are correct. To be well-formed, an PMMIF file should also be internally consistent; for example, if a number of fields is specified, it should be the same as the number of fields actually described in the file. More generally, a .pmm file is well-formed if there are no errors that can be detected without looking at the datafiles it describes.

By contrast, an PMMIF file is valid only if the datafiles it refers to exist and are consistent with the metadata in the .pmm file.

Extensibility

Extra keys may be added to any of the dictionaries used to define PMMIF provided those keys begin with a capital letter. This allows users to embed arbitrary extra information in a .pmm file without significantly compromising the ability of PMMIF checkers to detect errors such as mis-typed keys.

Generation