Logo

Unified Schema Design

Unified Schema Design

  • The goal is to create a single, unified schema across datasets. This schema aims to strike a balance between flexibility to accommodate arbitrarily shaped data along with consistency in core tables.
  • Datasets are built around two concepts: entities and timeseries.
    • Entities are concrete things or objects (a geography, a company, a mortgage application).
    • Timeseries are abstract measures (ie. statistics) related to an entity and a date.
  • The core tables are:
    • entities: Contains the entities that are being tracked. For example, Spain, Madrid, etc.
      • Should be something like province_index or weather_station_index to be able to join with the timeseries.
      • This table contains permanent characteristics describing an entity. E.g: for Provinces, the name, the region.
      • Each row represents a distinct entity. The table is wide, in that immutable characteristics are expressed in their own fields.
    • attributes: Attributes are descriptors of a timeseries. An attribute is the equivalent of a characteristic except for the abstract timeseries rather than the concrete entity.
      • Columns:
        • variable_id: Unique identifier for the attribute
        • name: Name of the attribute
        • description: Description of the attribute
        • unit: Unit of the attribute
        • source: Source of the attribute
        • frequency: Frequency of the attribute (daily, monthly, etc.)
        • measurement_type: Type of measurement (e.g. nominal, ordinal, interval, ratio, percentage)
        • Metadata columns;
          • category: Category of the attribute
          • namespace: Namespace of the attribute
          • tags: JSON with tags of the attribute?
          • aggregation_function: Aggregation function to use when aggregating the attribute
    • timeseries: Timeseries are abstract measures (ie. statistics, metrics) related to an entity and a date. Timeseries are temporal statistics or measures centered around an entity and timestamp. For example, GDP of Spain, population of Madrid, etc. Timeseries are abstract concepts (ie. a measure) rather than a concrete thing.
      • Could be something like weather_timeseries to be able to join with the entities.
      • Columns:
        • variable_id: Unique identifier for the attribute
        • geography_id: Unique identifier for the geography
        • date: Date of the metric
        • value: Value of the metric
    • relationshipts: Contains the relationships between entities. For example, Spain is composed of provinces, Madrid is a province, etc.
      • Relationships can also be temporal – valid for an interval defined by specific start and end dates.
    • characteristics: Descriptors of an entity that are temporal. They have a start date and end date.