Unified Schema Design
Unified Schema Design
- The goal is to create a single, unified schema across datasets. This schema aims to strike a balance between flexibility to accommodate arbitrarily shaped data along with consistency in core tables.
- Datasets are built around two concepts: entities and timeseries.
- Entities are concrete things or objects (a geography, a company, a mortgage application).
- Timeseries are abstract measures (ie. statistics) related to an entity and a date.
- The core tables are:
entities
: Contains the entities that are being tracked. For example, Spain, Madrid, etc.- Should be something like
province_index
orweather_station_index
to be able to join with the timeseries. - This table contains permanent characteristics describing an entity. E.g: for Provinces, the name, the region.
- Each row represents a distinct entity. The table is wide, in that immutable characteristics are expressed in their own fields.
- Should be something like
attributes
: Attributes are descriptors of a timeseries. An attribute is the equivalent of a characteristic except for the abstract timeseries rather than the concrete entity.- Columns:
variable_id
: Unique identifier for the attributename
: Name of the attributedescription
: Description of the attributeunit
: Unit of the attributesource
: Source of the attributefrequency
: Frequency of the attribute (daily, monthly, etc.)measurement_type
: Type of measurement (e.g. nominal, ordinal, interval, ratio, percentage)- Metadata columns;
category
: Category of the attributenamespace
: Namespace of the attributetags
: JSON with tags of the attribute?aggregation_function
: Aggregation function to use when aggregating the attribute
- Columns:
timeseries
: Timeseries are abstract measures (ie. statistics, metrics) related to an entity and a date. Timeseries are temporal statistics or measures centered around an entity and timestamp. For example, GDP of Spain, population of Madrid, etc. Timeseries are abstract concepts (ie. a measure) rather than a concrete thing.- Could be something like
weather_timeseries
to be able to join with the entities. - Columns:
variable_id
: Unique identifier for the attributegeography_id
: Unique identifier for the geographydate
: Date of the metricvalue
: Value of the metric
- Could be something like
relationshipts
: Contains the relationships between entities. For example, Spain is composed of provinces, Madrid is a province, etc.- Relationships can also be temporal – valid for an interval defined by specific start and end dates.
characteristics
: Descriptors of an entity that are temporal. They have a start date and end date.