Data Quality Package¶
Inheritance¶
Submodules¶
Models¶
:copyright (c) 2014 - 2022, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Department of Energy) and contributors. All rights reserved. :author
- exception seed.models.data_quality.ComparisonError¶
Bases:
Exception
- class seed.models.data_quality.DataQualityCheck(*args, **kwargs)¶
Bases:
django.db.models.base.Model
Object that stores the high level configuration per organization of the DataQualityCheck
- exception DoesNotExist¶
Bases:
django.core.exceptions.ObjectDoesNotExist
- exception MultipleObjectsReturned¶
Bases:
django.core.exceptions.MultipleObjectsReturned
- REQUIRED_FIELDS = {'PropertyState': ['address_line_1', 'custom_id_1', 'pm_property_id'], 'TaxLotState': ['address_line_1', 'custom_id_1', 'jurisdiction_tax_lot_id']}¶
- add_invalid_geometry_entry_provided(row_id, rule, display_name, value)¶
- add_result_comparison_error(row_id, rule, display_name, value, rule_check)¶
- add_result_dimension_error(row_id, rule, display_name, value)¶
- add_result_is_null(row_id, rule, display_name, value)¶
- add_result_max_error(row_id, rule, display_name, value, rule_max)¶
- add_result_min_error(row_id, rule, display_name, value, rule_min)¶
- add_result_missing_and_none(row_id, rule, display_name, value)¶
- add_result_missing_req(row_id, rule, display_name, value)¶
- add_result_string_error(row_id, rule, display_name, value)¶
- add_result_type_error(row_id, rule, display_name, value)¶
- add_rule(rule)¶
Add a new rule to the Data Quality Checks
- Parameters
rule – dict to be added as a new rule
- Returns
None
- add_rule_if_new(rule)¶
Add a new rule to the Data Quality Checks only if rule does not exist
- Parameters
rule – dict to be added as a new rule
- Returns
None
- static cache_key(identifier, organization_id)¶
Static method to return the location of the data_quality results from redis.
- Parameters
identifier – Import file primary key
- Returns
- check_data(record_type, rows)¶
Send in data as a queryset from the Property/Taxlot ids.
- Parameters
record_type – one of PropertyState | TaxLotState
rows – rows of data to be checked for data quality
- Returns
None
- get_fieldnames(record_type)¶
Get fieldnames to apply to results.
- id¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- static initialize_cache(identifier, organization_id)¶
Initialize the cache for storing the results. This is called before the celery tasks are chunked up.
The cache_key is different than the indentifier. The cache_key is where all the results are to be stored for the data quality checks, the identifier, is the random number (or specified value that is used to identifier both the progress and the data storage
- Parameters
identifier – Identifier for cache, if None, then creates a random one
- Returns
list, [cache_key and the identifier]
- initialize_rules()¶
Initialize the default rules for a DataQualityCheck object
- Returns
None
- name¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- objects = <django.db.models.manager.Manager object>¶
- organization¶
Accessor to the related object on the forward side of a many-to-one or one-to-one (via ForwardOneToOneDescriptor subclass) relation.
In the example:
class Child(Model): parent = ForeignKey(Parent, related_name='children')
Child.parent
is aForwardManyToOneDescriptor
instance.
- organization_id¶
- remove_all_rules()¶
Removes all the rules associated with this DataQualityCheck instance.
- Returns
None
- remove_status_label(label_class, rule, linked_id)¶
Remove label because it did not match any of the range exceptions
- Parameters
label_class – statuslabel object, either property label or taxlot label
rule – rule object
linked_id – id of propertystate or taxlotstate object
- Returns
boolean, if labeled was applied
- reset_all_rules()¶
Delete all rules and reinitialize the default set of rules
- Returns
None
- reset_default_rules()¶
Reset only the default rules
- Returns
- reset_results()¶
- classmethod retrieve(organization_id)¶
DataQualityCheck was previously a simple object but has been migrated to a django model. This method ensures that the data quality model will be backwards compatible.
This is the preferred method to initialize a new object.
- Parameters
organization – instance of Organization
- Returns
obj, DataQualityCheck
- retrieve_result_by_address(address)¶
Retrieve the results of the data quality checks for a specific address.
- Parameters
address – string, address to find the result for
- Returns
dict, results of data quality check for specific building
- retrieve_result_by_tax_lot_id(tax_lot_id)¶
Retrieve the results of the data quality checks by the jurisdiction ID.
- Parameters
tax_lot_id – string, jurisdiction tax lot id
- Returns
dict, results of data quality check for specific building
- rules¶
Accessor to the related objects manager on the reverse side of a many-to-one relation.
In the example:
class Child(Model): parent = ForeignKey(Parent, related_name='children')
Parent.children
is aReverseManyToOneDescriptor
instance.Most of the implementation is delegated to a dynamically defined manager class built by
create_forward_many_to_many_manager()
defined below.
- save_to_cache(identifier, organization_id)¶
Save the results to the cache database. The data in the cache are stored as a list of dictionaries. The data in this class are stored as a dict of dict. This is important to remember because the data from the cache cannot be simply loaded into the above structure.
- Parameters
identifier – Import file primary key
- Returns
None
- update_status_label(label_class, rule, linked_id, row_id, add_to_results=True)¶
- Parameters
label_class – statuslabel object, either propertyview label or taxlotview label
rule – rule object
linked_id – id of propertyview or taxlotview object
row_id –
add_to_results – bool
- Returns
boolean, if labeled was applied
- exception seed.models.data_quality.DataQualityTypeCastError¶
Bases:
Exception
- class seed.models.data_quality.Rule(*args, **kwargs)¶
Bases:
django.db.models.base.Model
Rules for DataQualityCheck
- DATA_TYPES = [(0, 'number'), (1, 'string'), (2, 'date'), (3, 'year'), (4, 'area'), (5, 'eui')]¶
- DEFAULT_RULES = [{'table_name': 'PropertyState', 'field': 'address_line_1', 'data_type': 1, 'not_null': True, 'rule_type': 0, 'severity': 0, 'condition': 'not_null'}, {'table_name': 'PropertyState', 'field': 'pm_property_id', 'data_type': 1, 'not_null': True, 'rule_type': 0, 'severity': 0, 'condition': 'not_null'}, {'table_name': 'PropertyState', 'field': 'custom_id_1', 'not_null': True, 'rule_type': 0, 'severity': 0, 'condition': 'not_null'}, {'table_name': 'TaxLotState', 'field': 'jurisdiction_tax_lot_id', 'not_null': True, 'rule_type': 0, 'severity': 0, 'condition': 'not_null'}, {'table_name': 'TaxLotState', 'field': 'address_line_1', 'data_type': 1, 'not_null': True, 'rule_type': 0, 'severity': 0, 'condition': 'not_null'}, {'table_name': 'PropertyState', 'field': 'conditioned_floor_area', 'data_type': 4, 'rule_type': 0, 'min': 0, 'max': 7000000, 'severity': 0, 'units': 'ft**2', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'conditioned_floor_area', 'data_type': 4, 'rule_type': 0, 'min': 100, 'severity': 1, 'units': 'ft**2', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'energy_score', 'data_type': 0, 'rule_type': 0, 'min': 0, 'max': 100, 'severity': 0, 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'energy_score', 'data_type': 0, 'rule_type': 0, 'min': 10, 'severity': 1, 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'generation_date', 'data_type': 2, 'rule_type': 0, 'min': 18890101, 'max': 20201231, 'severity': 0, 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'gross_floor_area', 'data_type': 0, 'rule_type': 0, 'min': 100, 'max': 7000000, 'severity': 0, 'units': 'ft**2', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'occupied_floor_area', 'data_type': 0, 'rule_type': 0, 'min': 100, 'max': 7000000, 'severity': 0, 'units': 'ft**2', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'recent_sale_date', 'data_type': 2, 'rule_type': 0, 'min': 18890101, 'max': 20201231, 'severity': 0, 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'release_date', 'data_type': 2, 'rule_type': 0, 'min': 18890101, 'max': 20201231, 'severity': 0, 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'site_eui', 'data_type': 5, 'rule_type': 0, 'min': 0, 'max': 1000, 'severity': 0, 'units': 'kBtu/ft**2/year', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'site_eui', 'data_type': 5, 'rule_type': 0, 'min': 10, 'severity': 1, 'units': 'kBtu/ft**2/year', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'site_eui_weather_normalized', 'data_type': 5, 'rule_type': 0, 'min': 0, 'max': 1000, 'severity': 0, 'units': 'kBtu/ft**2/year', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'source_eui', 'data_type': 5, 'rule_type': 0, 'min': 0, 'max': 1000, 'severity': 0, 'units': 'kBtu/ft**2/year', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'source_eui', 'data_type': 5, 'rule_type': 0, 'min': 10, 'severity': 1, 'units': 'kBtu/ft**2/year', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'source_eui_weather_normalized', 'data_type': 5, 'rule_type': 0, 'min': 10, 'max': 1000, 'severity': 0, 'units': 'kBtu/ft**2/year', 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'year_built', 'data_type': 3, 'rule_type': 0, 'min': 1700, 'max': 2019, 'severity': 0, 'condition': 'range'}, {'table_name': 'PropertyState', 'field': 'year_ending', 'data_type': 2, 'rule_type': 0, 'min': 18890101, 'max': 20201231, 'severity': 0, 'condition': 'range'}]¶
- exception DoesNotExist¶
Bases:
django.core.exceptions.ObjectDoesNotExist
- exception MultipleObjectsReturned¶
Bases:
django.core.exceptions.MultipleObjectsReturned
- RULE_EXCLUDE = 'exclude'¶
- RULE_INCLUDE = 'include'¶
- RULE_NOT_NULL = 'not_null'¶
- RULE_RANGE = 'range'¶
- RULE_REQUIRED = 'required'¶
- RULE_TYPE = [(0, 'default'), (1, 'custom')]¶
- RULE_TYPE_CUSTOM = 1¶
- RULE_TYPE_DEFAULT = 0¶
- SEVERITY = [(0, 'error'), (1, 'warning'), (2, 'valid')]¶
- SEVERITY_ERROR = 0¶
- SEVERITY_VALID = 2¶
- SEVERITY_WARNING = 1¶
- TYPE_AREA = 4¶
- TYPE_DATE = 2¶
- TYPE_EUI = 5¶
- TYPE_NUMBER = 0¶
- TYPE_STRING = 1¶
- TYPE_YEAR = 3¶
- condition¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- data_quality_check¶
Accessor to the related object on the forward side of a many-to-one or one-to-one (via ForwardOneToOneDescriptor subclass) relation.
In the example:
class Child(Model): parent = ForeignKey(Parent, related_name='children')
Child.parent
is aForwardManyToOneDescriptor
instance.
- data_quality_check_id¶
- data_type¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- description¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- enabled¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- field¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- for_derived_column¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- format_strings(value)¶
- get_data_type_display(*, field=<django.db.models.fields.IntegerField: data_type>)¶
- get_rule_type_display(*, field=<django.db.models.fields.IntegerField: rule_type>)¶
- get_severity_display(*, field=<django.db.models.fields.IntegerField: severity>)¶
- id¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- max¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- maximum_valid(value)¶
Validate that the value is not greater than the maximum specified by the rule.
- Parameters
value – Value to validate rule against
- Returns
bool, True is valid, False if the value is out of range
- min¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- minimum_valid(value)¶
Validate that the value is not less than the minimum specified by the rule.
- Parameters
value – Value to validate rule against
- Returns
bool, True is valid, False if the value is out of range
- name¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- not_null¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- objects = <django.db.models.manager.Manager object>¶
- required¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- rule_type¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- severity¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- status_label¶
Accessor to the related object on the forward side of a many-to-one or one-to-one (via ForwardOneToOneDescriptor subclass) relation.
In the example:
class Child(Model): parent = ForeignKey(Parent, related_name='children')
Child.parent
is aForwardManyToOneDescriptor
instance.
- status_label_id¶
- str_to_data_type(value)¶
If the check is coming from a field in the database then it will be typed correctly; however, for extra_data, the values are typically strings or unicode. Therefore, the values are typed before they are checked using the rule’s data type definition.
- Parameters
value – variant, value to type
- Returns
typed value
- table_name¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- text_match¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- units¶
A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
- valid_text(value)¶
Validate the rule matches the specified text. Text is matched by regex.
- Parameters
value – Value to validate rule against
- Returns
bool, True is valid, False if the value does not match
- exception seed.models.data_quality.UnitMismatchError¶
Bases:
Exception
- seed.models.data_quality.format_pint_violation(rule, source_value)¶
Format a pint min, max violation for human readability.
:param rule :param source_value : Quantity - value to format into range :return (formatted_value, formatted_min, formatted_max) : (String, String, String)
Tests¶
Views¶
:copyright (c) 2014 - 2022, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Department of Energy) and contributors. All rights reserved. :author
- class seed.views.data_quality.DataQualityViews(**kwargs)¶
Bases:
rest_framework.viewsets.ViewSet
,seed.utils.api.OrgMixin
Handles Data Quality API operations within Inventory backend. (1) Post, wait, get… (2) Respond with what changed
- create(request)¶
This API endpoint will create a new cleansing operation process in the background, on potentially a subset of properties/taxlots, and return back a query key — parameters:
name: organization_id description: Organization ID type: integer required: true paramType: query
name: data_quality_ids description: An object containing IDs of the records to perform data quality checks on.
Should contain two keys- property_state_ids and taxlot_state_ids, each of which is an array of appropriate IDs.
required: true paramType: body
- type:
- status:
type: string description: success or error required: true
- csv(request, pk)¶
Download a csv of the data quality checks by the pk which is the cache_key — parameter_strategy: replace parameters:
name: pk description: Import file ID or cache key required: true paramType: path
- data_quality_rules(request)¶
Returns the data_quality rules for an org. — parameters:
name: organization_id description: Organization ID type: integer required: true paramType: query
- type:
- status:
type: string required: true description: success or error
- rules:
type: object required: true description: An object containing ‘properties’ and ‘taxlots’ arrays of rules
- reset_all_data_quality_rules(request)¶
Resets an organization’s data data_quality rules — parameters:
name: organization_id description: Organization ID type: integer required: true paramType: query
- type:
- status:
type: string description: success or error required: true
- in_range_checking:
type: array[string] required: true description: An array of in-range error rules
- missing_matching_field:
type: array[string] required: true description: An array of fields to verify existence
- missing_values:
type: array[string] required: true description: An array of fields to ignore missing values
- reset_default_data_quality_rules(request)¶
Resets an organization’s data data_quality rules — parameters:
name: organization_id description: Organization ID type: integer required: true paramType: query
- type:
- status:
type: string description: success or error required: true
- in_range_checking:
type: array[string] required: true description: An array of in-range error rules
- missing_matching_field:
type: array[string] required: true description: An array of fields to verify existence
- missing_values:
type: array[string] required: true description: An array of fields to ignore missing values
- results(request)¶
Return the result of the data quality based on the ID that was given during the creation of the data quality task. Note that it is not related to the object in the database, since the results are stored in redis!
- save_data_quality_rules(request, pk=None)¶
Saves an organization’s settings: name, query threshold, shared fields. The method passes in all the fields again, so it is okay to remove all the rules in the db, and just recreate them (albeit inefficient) — parameter_strategy: replace parameters:
name: organization_id description: Organization ID type: integer required: true paramType: query
name: body description: JSON body containing organization rules information paramType: body pytype: RulesSerializer required: true
- type:
- status:
type: string description: success or error required: true
- message:
type: string description: error message, if any required: true
- class seed.views.data_quality.RulesIntermediateSerializer(*args, **kwargs)¶
Bases:
rest_framework.serializers.Serializer
- class seed.views.data_quality.RulesSerializer(*args, **kwargs)¶
Bases:
rest_framework.serializers.Serializer
- class seed.views.data_quality.RulesSubSerializer(*args, **kwargs)¶
Bases:
rest_framework.serializers.Serializer
- class seed.views.data_quality.RulesSubSerializerB(*args, **kwargs)¶
Bases:
rest_framework.serializers.Serializer