Data management templates¶
Usage¶
Data templates help you load tables into Orca, create columns of derived data, or save tables or subsets of tables to disk.
from urbansim_templates.data import LoadTable
t = LoadTable()
t.table = 'buildings' # a name for the Orca table
t.source_type = 'csv'
t.path = 'buildings.csv'
t.csv_index_cols = 'building_id'
t.name = 'load_buildings' # a name for the model step that sets up the table
You can run this directly using t.run()
, or register the configured template to be part of a larger workflow:
from urbansim_templates import modelmanager
modelmanager.register(t)
Registration does two things: (a) it saves the configured template to disk as a yaml file, and (b) it creates a model step with logic for loading the table. Running the model step is equivalent to running the configured template object:
t.run()
# equivalent:
import orca
orca.run(['load_buildings'])
Strictly speaking, running the model step doesn’t load the data, it just sets up an Orca table with instructions for loading the data when it’s needed. (This is called lazy evaluation.)
orca.run(['load_buildings']) # now an Orca table named 'buildings' is registered
orca.get_table('buildings').to_frame() # now the data is read from disk
Because “running” the table-loading step is costless, it’s done automatically when you register a configured template. It’s also done automatically when you initialize a ModelManager session and table-loading configs are read from yaml. (If you’d like to disable this for a particular table, you can set t.autorun == False
.)
Recommended data schemas¶
The LoadTable
template will work with any data that can be loaded into a Pandas DataFrame. But we highly recommend following stricter data schema rules:
Each table should include a unique, named index column (a.k.a. primary key) or set of columns (multi-index, a.k.a composite key).
If a column is meant to be a join key for another table, it should have the same name as the index of that table.
Duplication of column names across tables (except for the join keys) is discouraged, for clarity.
If you follow these rules, tables can be automatically merged on the fly, for example to assemble estimation data or calculate indicators.
You can use validate_table()
or validate_all_tables()
to check whether these expectations are met. When templates merge tables on the fly, they use merge_tables()
.
These utility functions work with any Orca table that meets the schema expectations, whether or not it was created with a template.
Compatibility with Orca¶
From Orca’s perspective, tables set up using the LoadTable
template are equivalent to tables that are registered using orca.add_table()
or the @orca.table
decorator. Technically, they are orca.TableFuncWrapper
objects.
Unlike the templates, Orca relies on user-specified “broadcast” relationships to perform automatic merging of tables. LoadTable
does not register any broadcasts, because they’re not needed if tables follow the schema rules above. So if you use these tables in non-template model steps, you may need to add broadcasts separately.
Data loading API¶
-
class
urbansim_templates.data.
LoadTable
(table=None, source_type=None, path=None, csv_index_cols=None, extra_settings={}, cache=True, cache_scope='forever', copy_col=True, name=None, tags=[], autorun=True)[source]¶ Template for registering data tables from local CSV or HDF files. Parameters can be passed to the constructor or set as attributes.
An instance of this template class stores instructions for loading a data table, packaged into an Orca step. Running the instructions registers the table with Orca.
- Parameters
table (str, optional) – Name of the Orca table to be created. Must be provided before running the step.
source_type ('csv' or 'hdf', optional) – Source type. Must be provided before running the step.
path (str, optional) – Local file path to load data from, either absolute or relative to the ModelManager config directory. Please provide a Unix-style path (this will work on any platform, but a Windows-style path won’t, and they’re hard to normalize automatically).
url (str, optional - NOT YET IMPLEMENTED) – Remote url to download file from.
csv_index_cols (str or list of str, optional) – Required for tables loaded from csv.
extra_settings (dict, optional) – Additional arguments to pass to
pd.read_csv()
orpd.read_hdf()
. For example, you could automatically extract csv data from a gzip file using {‘compression’: ‘gzip’}, or specify the table identifier within a multi-object hdf store using {‘key’: ‘table-name’}. See Pandas documentation for additional settings.orca_test_spec (dict, optional - NOT YET IMPLEMENTED) – Data characteristics to be tested when the table is validated.
cache (bool, default True) – Passed to
orca.table()
. Note that the default is True, unlike in the underlying general-purpose Orca function, because tables read from disk should not need to be regenerated during the course of a model run.cache_scope ('step', 'iteration', or 'forever', default 'forever') – Passed to
orca.table()
. Default is ‘forever’, as in Orca.copy_col (bool, default True) – Passed to
orca.table()
. Default is True, as in Orca.name (str, optional) – Name of the model step.
tags (list of str, optional) – Tags, passed to ModelManager.
autorun (bool, default True) – Automatically run the step whenever it’s registered with ModelManager.
-
classmethod
from_dict
(d)[source]¶ Create an object instance from a saved dictionary representation.
- Parameters
d (dict) –
- Returns
- Return type
Table
Column creation API¶
-
class
urbansim_templates.data.
ColumnFromExpression
(meta=None, data=None, output=None)[source]¶ Template to register a column of derived data with Orca, based on an expression. Parameters may be passed to the constructor, but they are easier to set as attributes. The expression can refer to any columns in the same table, and will be evaluated using
df.eval()
. Values will be calculated lazily, only when the column is needed for a specific operation.- Parameters
meta (
CoreTemplateSettings
, optional) – Standard parameters. This template sets the default value ofmeta.autorun
to True.data (
ExpressionSettings
, optional) – Special parameters for this template.output (
OutputColumnSettings
, optional) – Parameters for the column that will be generated. This template usesdata.table
as the default value foroutput.table
.
-
classmethod
from_dict_0_2_dev5
(d)[source]¶ Converter to read saved data from 0.2.dev5 or earlier. Automatically invoked by
from_dict()
as needed.
-
class
urbansim_templates.data.
ExpressionSettings
(table=None, expression=None)[source]¶ Stores custom parameters used by the
ColumnFromExpression
template. Parameters can be passed to the constructor or set as attributes.- Parameters
table (str, optional) – Name of Orca table the expression will be evaluated on. Required before running then template.
expression (str, optional) – String describing operations on existing columns of the table, for example “a/log(b+c)”. Required before running. Supports arithmetic and math functions including sqrt, abs, log, log1p, exp, and expm1 – see Pandas
df.eval()
documentation for further details.
Data output API¶
-
class
urbansim_templates.data.
SaveTable
(table=None, columns=None, filters=None, output_type=None, path=None, extra_settings=None, name=None, tags=[])[source]¶ Template for saving Orca tables to local CSV or HDF5 files. Parameters can be passed to the constructor or set as attributes.
- Parameters
table (str, optional) – Name of the Orca table. Must be provided before running the step.
columns (str or list of str, optional) – Names of columns to include.
None
will return all columns. Indexes will always be included.filters (str or list of str, optional) – Filters to apply to the data before saving. Will be passed to
pd.DataFrame.query()
.output_type ('csv' or 'hdf', optional) – Type of file to be created. Must be provided before running the step.
path (str, optional) – Local file path to save the data to, either absolute or relative to the ModelManager config directory. Please provide a Unix-style path (this will work on any platform, but a Windows-style path won’t, and they’re hard to normalize automatically). For dynamic file names, you can include the characters “%RUN%”, “%ITER%”, or “%TS%”. These will be replaced by the run id, the model iteration value, or a timestamp when the output file is created.
extra_settings (dict, optional) – Additional arguments to pass to
pd.to_csv()
orpd.to_hdf()
. For example, you could automatically compress csv data using {‘compression’: ‘gzip’}, or specify a custom table name for an hdf store using {‘key’: ‘table-name’}. See Pandas documentation for additional settings.name (str, optional) – Name of the model step.
tags (list of str, optional) – Tags, passed to ModelManager.
-
classmethod
from_dict
(d)[source]¶ Create an object instance from a saved dictionary representation.
- Parameters
d (dict) –
- Returns
- Return type
Table
-
get_dynamic_filepath
()[source]¶ Substitute run id, model iteration, and/or timestamp into the filename.
For the run id and model iteration, we look for Orca injectables named
run_id
anditer_var
, respectively. If none is found, we use0
.The timestamp is UTC, formatted as
YYYYMMDD-HHMMSS
.- Returns
- Return type
str