This repository contains the source code and the dataset for the paper Towards Data-Driven Electricity Management: Multi-Region Harmonized Data and Knowledge Graph
The repository includes downloads for the datasets and all the neccesary code to run the pipeline for preprocessing the data and generating the knowledge graph. The knowledge graph is generated from a set of raw datasets containing electricity consumption data from multiple regions and households. The data is preprocessed and harmonized to generate a knowledge graph containing information about the households, appliances, and electricity consumption. We also provide a model training pipeline that can be used to train a model for on/off appliance classification.
data
- contains the raw datasets and their metadata and later preprocessed datascripts
- contains the scripts for preprocessing the data and populating the databasenotebooks
- contains notebooks for data exploration and model training(TODO may remove)src
- contains model training and evaluation code and some helper scriptsrequirements.txt
- contains the python dependencies for the project without tensorflowrequirements.tensorflow.txt
- contains the python dependencies for tensorflow
The scripts are described in more detail in the scripts README.
To use the pipeline, you need to download either the full harmonized data dump from our Figshare repository or the sample dataset. Additionally, the metadata needs to be downloaded and extracted into the data folder. The ./data/
folder should contain two subdirectories: one for the metadata and another for the harmonized datasets.
Before proceeding, ensure that the paths in the pipeline_config.py (PARSED_DATA_PATH and METADATA_PATH) file point to the correct folders for the harmonized data and metadata. Once everything is set up, you can start from step 4 below
-
Start a terminal and clone this repository
git clone https://github.com/sensorlab/energy-knowledge-graph
-
Unzip the
data_dump_full.tar.gz
file in the data directory usingtar -xvf data_dump_full.tar.gz -C data
, optionally you can use only the data_sample.tar.gz filetar -xvf data_sample.tar.gz -C data
instead -
Make sure that
./data/
folder contains contains required datasets and metadata data folder should be of structure./data/metadata/
containing metadata and./data/raw/
containing raw datasets -
Navigate into energy-knowledge-graph directory, enter conda or virtualenv, and install dependecies with
pip install -r requirements.txt
for data preprocessing andpip install -r requirements.tensorflow.txt --extra-index-url https://pypi.nvidia.com
if you want to use the machine learning part of the pipeline -
Create an .env file in the scripts directory with the following content:
DATABASE_USER=<username to access PostgreSQL database> DATABASE_PASSWORD=<password to access PostgreSQL database>
-
Check pipeline_config.py for the configuration of the pipeline leave as is for default configuration
-
Run
python scripts/process_data.py
by default this will preprocess the datasets defined in pipeline_config.py and store it in the database if we pass the command line argument--sample
it will preprocess only the datasets present in the sample dump and if we pass--full
it will preprocess all the datasets present in the full dump
The full raw dump contains all the datasets and their coresponding metadata, while the sample raw dump contains a subset of the full raw dump. The triples dump contains the triples forming the knowledge graph in turtle format, while the harmonized data dump contains the harmonized data in pickle files the harmonized data is the same as the output of the pipeline in step 1.
- Full raw dump(91.2 GB): data_dump_full.tar.gz
- Sample raw dump(10.4 GB): data_sample.tar.gz
- All raw datasets can be downloaded separately from the data folder
The following files are also available for download on figshare: (TODO link when available):
- Metadata dump(1.2 GB): metadata.tar.gz
- Triples dump (125 MB): triples.ttl
- Harmonized data dump(80 GB): harmonized.tar.gz
The datasets used in this project are unified from the following open research datasets:
- DEDDIAG
- DEKN
- DRED
- ECDUY
- ECO
- EEUD
- ENERTALK
- HEART
- HES
- HUE
- IDEAL
- IAWE
- LERTA
- PRECON
- REDD
- REFIT
- SUST1
- SUST2
- UCIML
- UKDALE
This project is meant for scientific and research purposes and so are the datasets.
The pipeline can be customized by changing the configuration in the pipeline_config file.
In the pipeline_config.py file you can set the following parameters:
STEPS
- list of data processsing steps to be executedDATASETS
- list of datasets to be preprocessedTRAINING_DATASETS
- list of datasets to be used to generate the training dataPREDICT_DATASETS
- list of unlabelled datasets to run the pretrained model onPOSTGRES_URL
- the url for the postgres database to store the data- various paths for where to store the data and where to read the data from this is explained in more detail in the pipeline_config file
The pipeline contains the following data processing steps:
- parse - This script runs the parsers for the datasets and stores the parsed datasets in pickle files
- loadprofiles - This script calcualtes the load profiles for the households
- metadata - This script generates metadata for the households and stores it in a dataframe as a parquet file
- consumption-data - This script calculates the electrictiy consumption data for the households and their appliances
- db-reset - This script resets and populates the database with the households metadata, load profiles and consumption data
and the following steps for predicting devices using a pretrained model (requires tensorflow):
- predict-devices - This script predicts the devices for the households using a pretrained model
- add_predicted_devices - This script adds the predicted devices to the knowledge graph
The training pipeline consists of the following steps:
- generate_training_data.py - This script generates the training data for on/off appliance classification from the training datasets specified in the model_config.py file the training datasets have to be parsed by running the parse step in the pipeline beforehand.
- train.py - This script trains the model on the training data generated in the previous step and saves the model in the specified path in the model_config.py file various hyperparameters and training settings can also be set in the config file.
- eval.py - This script evaluates the model on the test data and saves the evaluation metrics in the specified path in the model_config.py file.
Quick start guide for training pipeline:
# navigate to the scripts directory
cd energy-knowledge-graph/scripts
# open the model_config.py file and set the paths and to the data and select the datasets to be used for training
# generate training data
python generate_training_data.py
# open the model_config.py file and set the hyperparameters and training settings
# train the model
python train.py
# evaluate the model
python eval.py
We provide some example SPARQL queries that can be run on the knowledge graph. We also host a SPARQL endpoint where you can test your queries.
PREFIX voc: <https://elkg.ijs.si/ontology/>
PREFIX saref: <https://saref.etsi.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX schema: <https://schema.org/>
SELECT ?gdp ?location ?countryName WHERE {
?location rdf:type schema:Place .
?location voc:hasGDPOf ?gdp .
?location schema:containedInPlace ?country .
?country rdf:type schema:Country .
?country schema:name ?countryName .
FILTER(?gdp > 50000) .
}
PREFIX voc: <https://elkg.ijs.si/ontology/>
PREFIX saref: <https://saref.etsi.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX schema: <https://schema.org/>
SELECT DISTINCT ?house ?devices ?deviceNames ?houseName WHERE {
?house rdf:type schema:House .
?house schema:name ?houseName .
?house voc:containsDevice ?devices .
?devices schema:name ?deviceNames .
FILTER(?houseName = "LERTA_4").
}
Example 3: Query household "UKDALE_1" and the city it is in as well as the corresponding city in dbpedia and wikidata
PREFIX voc: <https://elkg.ijs.si/ontology/>
PREFIX saref: <https://saref.etsi.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX schema: <https://schema.org/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?houseName ?city ?dbpediaCity ?wikidataCity WHERE {
?house rdf:type schema:House .
?house schema:name ?houseName .
?house schema:containedInPlace ?place .
?place schema:containedInPlace ?city .
?city rdf:type schema:City .
OPTIONAL {
?city owl:sameAs ?linkedCity .
FILTER(STRSTARTS(STR(?linkedCity), "http://dbpedia.org/resource/"))
BIND(?linkedCity AS ?dbpediaCity)
}
OPTIONAL {
?city owl:sameAs ?linkedCity2 .
FILTER(STRSTARTS(STR(?linkedCity2), "http://www.wikidata.org/entity/"))
BIND(?linkedCity2 AS ?wikidataCity)
}
FILTER(?houseName = "UKDALE_1")
}
If you use this dataset or pipeline in your research, citation of the following paper, which also provides additional details about the dataset and the processing pipeline, would be greatly appreciated:
@article{hanzel2024datadriven,
title={Towards Data-Driven Electricity Management: Multi-Region Harmonized Data and Knowledge Graph},
author={Vid Hanžel and Blaž Bertalanič and Carolina Fortuna},
year={2024},
eprint={2405.18869},
archivePrefix={arXiv},
primaryClass={cs.LG}
}