Data Quality Package

Inheritance

Submodules

Models

:copyright (c) 2014 - 2022, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Department of Energy) and contributors. All rights reserved. :author

exception seed.models.data_quality.ComparisonError

Bases: Exception

class seed.models.data_quality.DataQualityCheck(*args, **kwargs)

Bases: django.db.models.base.Model

Object that stores the high level configuration per organization of the DataQualityCheck

exception DoesNotExist

Bases: django.core.exceptions.ObjectDoesNotExist

exception MultipleObjectsReturned

Bases: django.core.exceptions.MultipleObjectsReturned

REQUIRED_FIELDS = {'PropertyState': ['address_line_1', 'custom_id_1', 'pm_property_id'], 'TaxLotState': ['address_line_1', 'custom_id_1', 'jurisdiction_tax_lot_id']}
add_invalid_geometry_entry_provided(row_id, rule, display_name, value)
add_result_comparison_error(row_id, rule, display_name, value, rule_check)
add_result_dimension_error(row_id, rule, display_name, value)
add_result_is_null(row_id, rule, display_name, value)
add_result_max_error(row_id, rule, display_name, value, rule_max)
add_result_min_error(row_id, rule, display_name, value, rule_min)
add_result_missing_and_none(row_id, rule, display_name, value)
add_result_missing_req(row_id, rule, display_name, value)
add_result_string_error(row_id, rule, display_name, value)
add_result_type_error(row_id, rule, display_name, value)
add_rule(rule)

Add a new rule to the Data Quality Checks

Parameters

rule – dict to be added as a new rule

Returns

None

add_rule_if_new(rule)

Add a new rule to the Data Quality Checks only if rule does not exist

Parameters

rule – dict to be added as a new rule

Returns

None

static cache_key(identifier, organization_id)

Static method to return the location of the data_quality results from redis.

Parameters

identifier – Import file primary key

Returns

check_data(record_type, rows)

Send in data as a queryset from the Property/Taxlot ids.

Parameters
  • record_type – one of PropertyState | TaxLotState

  • rows – rows of data to be checked for data quality

Returns

None

get_fieldnames(record_type)

Get fieldnames to apply to results.

id

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

static initialize_cache(identifier, organization_id)

Initialize the cache for storing the results. This is called before the celery tasks are chunked up.

The cache_key is different than the indentifier. The cache_key is where all the results are to be stored for the data quality checks, the identifier, is the random number (or specified value that is used to identifier both the progress and the data storage

Parameters

identifier – Identifier for cache, if None, then creates a random one

Returns

list, [cache_key and the identifier]

initialize_rules()

Initialize the default rules for a DataQualityCheck object

Returns

None

name

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

objects = <django.db.models.manager.Manager object>
organization

Accessor to the related object on the forward side of a many-to-one or one-to-one (via ForwardOneToOneDescriptor subclass) relation.

In the example:

class Child(Model):
    parent = ForeignKey(Parent, related_name='children')

Child.parent is a ForwardManyToOneDescriptor instance.

organization_id
remove_all_rules()

Removes all the rules associated with this DataQualityCheck instance.

Returns

None

remove_status_label(label_class, rule, linked_id)

Remove label because it did not match any of the range exceptions

Parameters
  • label_class – statuslabel object, either property label or taxlot label

  • rule – rule object

  • linked_id – id of propertystate or taxlotstate object

Returns

boolean, if labeled was applied

reset_all_rules()

Delete all rules and reinitialize the default set of rules

Returns

None

reset_default_rules()

Reset only the default rules

Returns

reset_results()
classmethod retrieve(organization_id)

DataQualityCheck was previously a simple object but has been migrated to a django model. This method ensures that the data quality model will be backwards compatible.

This is the preferred method to initialize a new object.

Parameters

organization – instance of Organization

Returns

obj, DataQualityCheck

retrieve_result_by_address(address)

Retrieve the results of the data quality checks for a specific address.

Parameters

address – string, address to find the result for

Returns

dict, results of data quality check for specific building

retrieve_result_by_tax_lot_id(tax_lot_id)

Retrieve the results of the data quality checks by the jurisdiction ID.

Parameters

tax_lot_id – string, jurisdiction tax lot id

Returns

dict, results of data quality check for specific building

rules

Accessor to the related objects manager on the reverse side of a many-to-one relation.

In the example:

class Child(Model):
    parent = ForeignKey(Parent, related_name='children')

Parent.children is a ReverseManyToOneDescriptor instance.

Most of the implementation is delegated to a dynamically defined manager class built by create_forward_many_to_many_manager() defined below.

save_to_cache(identifier, organization_id)

Save the results to the cache database. The data in the cache are stored as a list of dictionaries. The data in this class are stored as a dict of dict. This is important to remember because the data from the cache cannot be simply loaded into the above structure.

Parameters

identifier – Import file primary key

Returns

None

update_status_label(label_class, rule, linked_id, row_id, add_to_results=True)
Parameters
  • label_class – statuslabel object, either propertyview label or taxlotview label

  • rule – rule object

  • linked_id – id of propertyview or taxlotview object

  • row_id

  • add_to_results – bool

Returns

boolean, if labeled was applied

exception seed.models.data_quality.DataQualityTypeCastError

Bases: Exception

class seed.models.data_quality.Rule(*args, **kwargs)

Bases: django.db.models.base.Model

Rules for DataQualityCheck

DATA_TYPES = [(0, 'number'), (1, 'string'), (2, 'date'), (3, 'year'), (4, 'area'), (5, 'eui')]
DEFAULT_RULES = [{'table_name': 'PropertyState', 'field': 'address_line_1', 'data_type': 1, 'not_null': True, 'rule_type': 0, 'severity': 0, 'condition': 'not_null'}, {'table_name': 'PropertyState', 'field': 'pm_property_id', 'data_type': 1, 'not_null': True, 'rule_type': 0, 'severity': 0, 'condition': 'not_null'}, {'table_name': 'PropertyState', 'field': 'custom_id_1', 'not_null': True, 'rule_type': 0, 'severity': 0, 'condition': 'not_null'}, {'table_name': 'TaxLotState', 'field': 'jurisdiction_tax_lot_id', 'not_null': True, 'rule_type': 0, 'severity': 0, 'condition': 'not_null'}, {'table_name': 'TaxLotState', 'field': 'address_line_1', 'data_type': 1, 'not_null': True, 'rule_type': 0, 'severity': 0, 'condition': 'not_null'}, {'table_name': 'PropertyState', 'field': 'conditioned_floor_area', 'data_type': 4, 'rule_type': 0, 'min': 0, 'max': 7000000, 'severity': 0, 'units': 'ft**2', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'conditioned_floor_area', 'data_type': 4, 'rule_type': 0, 'min': 100, 'severity': 1, 'units': 'ft**2', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'energy_score', 'data_type': 0, 'rule_type': 0, 'min': 0, 'max': 100, 'severity': 0, 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'energy_score', 'data_type': 0, 'rule_type': 0, 'min': 10, 'severity': 1, 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'generation_date', 'data_type': 2, 'rule_type': 0, 'min': 18890101, 'max': 20201231, 'severity': 0, 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'gross_floor_area', 'data_type': 0, 'rule_type': 0, 'min': 100, 'max': 7000000, 'severity': 0, 'units': 'ft**2', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'occupied_floor_area', 'data_type': 0, 'rule_type': 0, 'min': 100, 'max': 7000000, 'severity': 0, 'units': 'ft**2', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'recent_sale_date', 'data_type': 2, 'rule_type': 0, 'min': 18890101, 'max': 20201231, 'severity': 0, 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'release_date', 'data_type': 2, 'rule_type': 0, 'min': 18890101, 'max': 20201231, 'severity': 0, 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'site_eui', 'data_type': 5, 'rule_type': 0, 'min': 0, 'max': 1000, 'severity': 0, 'units': 'kBtu/ft**2/year', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'site_eui', 'data_type': 5, 'rule_type': 0, 'min': 10, 'severity': 1, 'units': 'kBtu/ft**2/year', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'site_eui_weather_normalized', 'data_type': 5, 'rule_type': 0, 'min': 0, 'max': 1000, 'severity': 0, 'units': 'kBtu/ft**2/year', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'source_eui', 'data_type': 5, 'rule_type': 0, 'min': 0, 'max': 1000, 'severity': 0, 'units': 'kBtu/ft**2/year', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'source_eui', 'data_type': 5, 'rule_type': 0, 'min': 10, 'severity': 1, 'units': 'kBtu/ft**2/year', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'source_eui_weather_normalized', 'data_type': 5, 'rule_type': 0, 'min': 10, 'max': 1000, 'severity': 0, 'units': 'kBtu/ft**2/year', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'year_built', 'data_type': 3, 'rule_type': 0, 'min': 1700, 'max': 2019, 'severity': 0, 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'year_ending', 'data_type': 2, 'rule_type': 0, 'min': 18890101, 'max': 20201231, 'severity': 0, 'condition': 'range'}]
exception DoesNotExist

Bases: django.core.exceptions.ObjectDoesNotExist

exception MultipleObjectsReturned

Bases: django.core.exceptions.MultipleObjectsReturned

RULE_EXCLUDE = 'exclude'
RULE_INCLUDE = 'include'
RULE_NOT_NULL = 'not_null'
RULE_RANGE = 'range'
RULE_REQUIRED = 'required'
RULE_TYPE = [(0, 'default'), (1, 'custom')]
RULE_TYPE_CUSTOM = 1
RULE_TYPE_DEFAULT = 0
SEVERITY = [(0, 'error'), (1, 'warning'), (2, 'valid')]
SEVERITY_ERROR = 0
SEVERITY_VALID = 2
SEVERITY_WARNING = 1
TYPE_AREA = 4
TYPE_DATE = 2
TYPE_EUI = 5
TYPE_NUMBER = 0
TYPE_STRING = 1
TYPE_YEAR = 3
condition

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

data_quality_check

Accessor to the related object on the forward side of a many-to-one or one-to-one (via ForwardOneToOneDescriptor subclass) relation.

In the example:

class Child(Model):
    parent = ForeignKey(Parent, related_name='children')

Child.parent is a ForwardManyToOneDescriptor instance.

data_quality_check_id
data_type

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

description

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

enabled

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

field

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

for_derived_column

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

format_strings(value)
get_data_type_display(*, field=<django.db.models.fields.IntegerField: data_type>)
get_rule_type_display(*, field=<django.db.models.fields.IntegerField: rule_type>)
get_severity_display(*, field=<django.db.models.fields.IntegerField: severity>)
id

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

max

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

maximum_valid(value)

Validate that the value is not greater than the maximum specified by the rule.

Parameters

value – Value to validate rule against

Returns

bool, True is valid, False if the value is out of range

min

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

minimum_valid(value)

Validate that the value is not less than the minimum specified by the rule.

Parameters

value – Value to validate rule against

Returns

bool, True is valid, False if the value is out of range

name

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

not_null

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

objects = <django.db.models.manager.Manager object>
required

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

rule_type

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

severity

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

status_label

Accessor to the related object on the forward side of a many-to-one or one-to-one (via ForwardOneToOneDescriptor subclass) relation.

In the example:

class Child(Model):
    parent = ForeignKey(Parent, related_name='children')

Child.parent is a ForwardManyToOneDescriptor instance.

status_label_id
str_to_data_type(value)

If the check is coming from a field in the database then it will be typed correctly; however, for extra_data, the values are typically strings or unicode. Therefore, the values are typed before they are checked using the rule’s data type definition.

Parameters

value – variant, value to type

Returns

typed value

table_name

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

text_match

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

units

A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.

valid_text(value)

Validate the rule matches the specified text. Text is matched by regex.

Parameters

value – Value to validate rule against

Returns

bool, True is valid, False if the value does not match

exception seed.models.data_quality.UnitMismatchError

Bases: Exception

seed.models.data_quality.format_pint_violation(rule, source_value)

Format a pint min, max violation for human readability.

:param rule :param source_value : Quantity - value to format into range :return (formatted_value, formatted_min, formatted_max) : (String, String, String)

Tests

Views

:copyright (c) 2014 - 2022, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Department of Energy) and contributors. All rights reserved. :author

class seed.views.data_quality.DataQualityViews(**kwargs)

Bases: rest_framework.viewsets.ViewSet, seed.utils.api.OrgMixin

Handles Data Quality API operations within Inventory backend. (1) Post, wait, get… (2) Respond with what changed

create(request)

This API endpoint will create a new cleansing operation process in the background, on potentially a subset of properties/taxlots, and return back a query key — parameters:

  • name: organization_id description: Organization ID type: integer required: true paramType: query

  • name: data_quality_ids description: An object containing IDs of the records to perform data quality checks on.

    Should contain two keys- property_state_ids and taxlot_state_ids, each of which is an array of appropriate IDs.

    required: true paramType: body

type:
status:

type: string description: success or error required: true

csv(request, pk)

Download a csv of the data quality checks by the pk which is the cache_key — parameter_strategy: replace parameters:

  • name: pk description: Import file ID or cache key required: true paramType: path

data_quality_rules(request)

Returns the data_quality rules for an org. — parameters:

  • name: organization_id description: Organization ID type: integer required: true paramType: query

type:
status:

type: string required: true description: success or error

rules:

type: object required: true description: An object containing ‘properties’ and ‘taxlots’ arrays of rules

reset_all_data_quality_rules(request)

Resets an organization’s data data_quality rules — parameters:

  • name: organization_id description: Organization ID type: integer required: true paramType: query

type:
status:

type: string description: success or error required: true

in_range_checking:

type: array[string] required: true description: An array of in-range error rules

missing_matching_field:

type: array[string] required: true description: An array of fields to verify existence

missing_values:

type: array[string] required: true description: An array of fields to ignore missing values

reset_default_data_quality_rules(request)

Resets an organization’s data data_quality rules — parameters:

  • name: organization_id description: Organization ID type: integer required: true paramType: query

type:
status:

type: string description: success or error required: true

in_range_checking:

type: array[string] required: true description: An array of in-range error rules

missing_matching_field:

type: array[string] required: true description: An array of fields to verify existence

missing_values:

type: array[string] required: true description: An array of fields to ignore missing values

results(request)

Return the result of the data quality based on the ID that was given during the creation of the data quality task. Note that it is not related to the object in the database, since the results are stored in redis!

save_data_quality_rules(request, pk=None)

Saves an organization’s settings: name, query threshold, shared fields. The method passes in all the fields again, so it is okay to remove all the rules in the db, and just recreate them (albeit inefficient) — parameter_strategy: replace parameters:

  • name: organization_id description: Organization ID type: integer required: true paramType: query

  • name: body description: JSON body containing organization rules information paramType: body pytype: RulesSerializer required: true

type:
status:

type: string description: success or error required: true

message:

type: string description: error message, if any required: true

class seed.views.data_quality.RulesIntermediateSerializer(*args, **kwargs)

Bases: rest_framework.serializers.Serializer

class seed.views.data_quality.RulesSerializer(*args, **kwargs)

Bases: rest_framework.serializers.Serializer

class seed.views.data_quality.RulesSubSerializer(*args, **kwargs)

Bases: rest_framework.serializers.Serializer

class seed.views.data_quality.RulesSubSerializerB(*args, **kwargs)

Bases: rest_framework.serializers.Serializer