Orca Core

Note

In the documentation below the following imports are implied:

import orca
import pandas as pd

Tables

Tables are Pandas DataFrames. Use the add_table() function to register a DataFrame under a given name:

df = pd.DataFrame({'a': [1, 2, 3]})
orca.add_table('my_table', df)

Or you can use the decorator table() to register a function that returns a DataFrame:

@orca.table('halve_my_table')
def halve_my_table(my_table):
    df = my_table.to_frame()
    return df / 2

The decorator argument, which specifies the name to register the table with, is optional. If left out, the table is registered under the name of the function that is being decorated. The decorator example above could be written more concisely:

@orca.table()
def halve_my_table(my_table):
    df = my_table.to_frame()
    return df / 2

Note that the decorator parentheses are still required.

By registering halve_my_table as a function, its values will always be half those in my_table, even if my_table is later changed. If you’d like a function to not be evaluated every time it is used, pass the cache=True keyword when registering it.

Here’s a demo of the above table definitions shown in IPython:

In [19]: wrapped = orca.get_table('halve_my_table')

In [20]: wrapped.to_frame()
Out[20]:
     a
0  0.5
1  1.0
2  1.5

Table Wrappers

Notice in the table function above that we had to call a to_frame() method before using the table in a math operation. The values injected into functions are not DataFrames, but specialized wrappers. The wrappers facilitate caching, computed columns, and lazy evaluation of table functions. Learn more in the API documentation:

Automated Merges

Certain analyses can be easiest when some tables are merged together, but in other places it may be best to keep the tables separate. Orca can make these on-demand merges easy by letting you define table relationships up front and then performing the merges for you as needed. We call these relationships “broadcasts” (as in a rule for how to broadcast one table onto another) and you register them using the broadcast() function.

For an example we’ll first define some DataFrames that contain links to one another and register them with Orca:

df_a = pd.DataFrame(
    {'a': [0, 1]},
    index=['a0', 'a1'])
df_b = pd.DataFrame(
    {'b': [2, 3, 4, 5, 6],
     'a_id': ['a0', 'a1', 'a1', 'a0', 'a1']},
    index=['b0', 'b1', 'b2', 'b3', 'b4'])
df_c = pd.DataFrame(
    {'c': [7, 8, 9]},
    index=['c0', 'c1', 'c2'])
df_d = pd.DataFrame(
    {'d': [10, 11, 12, 13, 15, 16, 16, 17, 18, 19],
     'b_id': ['b2', 'b0', 'b3', 'b3', 'b1', 'b4', 'b1', 'b4', 'b3', 'b3'],
     'c_id': ['c0', 'c1', 'c1', 'c0', 'c0', 'c2', 'c1', 'c2', 'c1', 'c2']},
    index=['d0', 'd1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9'])

orca.add_table('a', df_a)
orca.add_table('b', df_b)
orca.add_table('c', df_c)
orca.add_table('d', df_d)

The tables have data so that ‘a’ can be broadcast onto ‘b’, and ‘b’ and ‘c’ can be broadcast onto ‘d’. We use the broadcast() function to register those relationships:

orca.broadcast(cast='a', onto='b', cast_index=True, onto_on='a_id')
orca.broadcast(cast='b', onto='d', cast_index=True, onto_on='b_id')
orca.broadcast(cast='c', onto='d', cast_index=True, onto_on='c_id')

The syntax is similar to that of the pandas merge function, and indeed merge is used behind the scenes. Once the broadcasts are defined, use the merge_tables() function to get a merged DataFrame. Some examples in IPython:

In [4]: orca.merge_tables(target='b', tables=[a, b])
Out[4]:
   a_id  b  a
b0   a0  2  0
b3   a0  5  0
b1   a1  3  1
b2   a1  4  1
b4   a1  6  1

In [5]: orca.merge_tables(target='d', tables=[a, b, c, d])
Out[5]:
   b_id c_id   d  c a_id  b  a
d0   b2   c0  10  7   a1  4  1
d3   b3   c0  13  7   a0  5  0
d2   b3   c1  12  8   a0  5  0
d8   b3   c1  18  8   a0  5  0
d9   b3   c2  19  9   a0  5  0
d4   b1   c0  15  7   a1  3  1
d6   b1   c1  16  8   a1  3  1
d1   b0   c1  11  8   a0  2  0
d5   b4   c2  16  9   a1  6  1
d7   b4   c2  17  9   a1  6  1

Note that it’s the target table’s index that you find in the final merged table, though the order may have changed. merge_tables() has an optional columns= keyword that can contain column names from any the tables going into the merge so you can limit which columns end up in the final table. (Columns necessary for performing merges will be included whether or not they are in the columns= list.)

Note

merge_tables() calls pandas.merge with how='inner', meaning that only items that appear in both tables are kept in the merged table.

Columns

Often, not all the columns you need are preexisting on your tables. You may need to collect information from other tables or perform a calculation to generate a column. Orca allows you to register a Series or function as a column on a registered table. Use the add_column() function or the column() decorator:

s = pd.Series(['a', 'b', 'c'])
orca.add_column('my_table', 'my_col', s)

@orca.column('my_table')
def my_col_x2(my_table):
    df = my_table.to_frame(columns=['my_col'])
    return df['my_col'] * 2

In the my_col_x2 function we use the columns= keyword on to_frame() to get only the one column necessary for our calculation. This can be useful for avoiding unnecessary computation or to avoid recursion (as would happen in this case if we called to_frame() with no arguments).

Accessing columns on a table is such a common occurrence that there are additional ways to do so without first calling to_frame() to create an actual DataFrame.

DataFrameWrapper supports accessing individual columns in the same ways as DataFrames:

@orca.column('my_table')
def my_col_x2(my_table):
    return my_table['my_col'] * 2  # or my_table.my_col * 2

Or you can use an expression to have a single column injected into a function:

@orca.column('my_table')
def my_col_x2(data='my_table.my_col'):
    return data * 2

In this case, the label data, expressed as my_table.my_col, refers to the column my_col, which is a pandas Series within the table my_table.

A demonstration in IPython using the column definitions from above:

In [29]: wrapped = orca.get_table('my_table')

In [30]: wrapped.columns
Out[30]: ['a', 'my_col', 'my_col_x2']

In [31]: wrapped.local_columns
Out[31]: ['a']

In [32]: wrapped.to_frame()
Out[32]:
   a my_col_x2 my_col
0  1        aa      a
1  2        bb      b
2  3        cc      c

DataFrameWrapper has columns and local_columns attributes that, respectively, list all the columns on a table and only those columns that are part of the underlying DataFrame.

Columns are stored separate from tables so it is safe to define a column on a table and then replace that table with something else. The column will remain associated with the table.

Injectables

You will probably want to have things besides tables injected into functions, for which Orca has “injectables”. You can register anything and have it injected into functions. Use the add_injectable() function or the injectable() decorator:

orca.add_injectable('z', 5)

@orca.injectable(autocall=False)
def pow(x, y):
    return x ** y

@orca.injectable()
def zsquared(z, pow):
    return pow(z, 2)

@orca.table()
def ztable(my_table, zsquared):
    df = my_table.to_frame(columns=['a'])
    return df * zsquared

By default injectable functions are evaluated before injection and the return value is passed into other functions. Use autocall=False to disable this behavior and instead inject the function itself. Like tables and columns, injectable functions that are automatically evaluated can have their results cached with cache=True.

Functions that are not automatically evaluated can also have their results cached using the memoize=True keyword along with autocall=False. A memoized injectable will cache results based on the function inputs, so this only works if the function inputs are hashable (usable as dictionary keys). Memoized functions can have their caches cleared manually using their clear_cached function attribute. The caches of memoized functions are also hooked into the global Orca caching system, so you can also manage their caches via the cache_scope keyword argument and the clear_cache() function.

An example of the above injectables in IPython:

In [38]: wrapped = orca.get_table('ztable')

In [39]: wrapped.to_frame()
Out[39]:
    a
0  25
1  50
2  75

Caching

Orca has cache system so that function results can be stored for re-use when it is not necessary to recompute them every time they are used.

The decorators table(), column(), and injectable() all take two keyword arguments related to caching: cache and cache_scope.

By default results are not cached. Register functions with cache=True to enable caching of their results.

Cache Scope

Cached items have an associated “scope” that allows Orca to automatically manage how long functions have their results cached before re-evaluating them. The three scope settings are:

  • 'forever' (the default setting) - Results are cached until manually cleared by user commands.

  • 'iteration' - Results are cached for the remainder of the current pipeline iteration.

  • 'step' - Results are cached until the current pipeline step finishes.

Managing the Cache

We hope that users will be able to do most of their cache management via cache scopes, but there may be situations, especially during testing, when more manual management is required.

Caching can be turned off globally using the disable_cache() function (and turned back on by enable_cache()).

To run a block of commands with the cache disabled, but have it automatically re-enabled, use the cache_disabled() context manager:

with orca.cache_disabled():
    result = orca.eval_variable('my_table')

Finally, users can manually clear the cache using clear_cache().

Steps

A step is a function run by Orca with argument matching. Use the step() decorator to register a step function. Steps are generally important for their side-effects, their return values are discarded during pipeline runs. For example, a step might replace a column in a table (a new table, though similar to my_table above):

df = pd.DataFrame({'a': [1, 2, 3]})
orca.add_table('new_table', df)

@orca.step()
def replace_col(new_table):
    new_table['a'] = [4, 5, 6]

Or update some values in a column:

@orca.step()
def update_col(new_table):
    s = pd.Series([99], index=[1])
    new_table.update_col_from_series('a', s)

Or add rows to a table:

@orca.step()
def add_rows(new_table):
    new_rows = pd.DataFrame({'a': [100, 101]}, index=[3, 4])
    df = new_table.to_frame()
    df = pd.concat([df, new_rows])
    orca.add_table('new_table', df)

The first two of the above examples update my_tables’s underlying DataFrame and so require it to be a DataFrameWrapper. If your table is a wrapped function, not a DataFrame, you can update columns by replacing them entirely with a new Series using the add_column() function.

A demonstration of running the above steps:

In [68]: orca.run(['replace_col', 'update_col', 'add_rows'])
Running step 'replace_col'
Running step 'update_col'
Running step 'add_rows'

In [69]: orca.get_table('new_table').to_frame()
Out[69]:
     a
0    4
1   99
2    6
3  100
4  101

In the context of a simulation steps can be thought of as model steps that will often advance the simulation by updating data. Steps are plain Python functions, though, and there is no restriction on what they are allowed to do.

Running Pipelines

You start pipelines by calling the run() function and listing which steps you want to run. Calling run() with just a list of steps, as in the above example, will run through the steps once. To run the pipeline over some a sequence, provide those values as a sequence to run() using the iter_vars argument.

The iter_var injectable stores the current value from the iter_vars argument to run() function. The iter_step injectable is a namedtuple with fields named step_num and step_name, stored in that order. step_num is a zero-based index based on the list of step names passed to the run() function.

In [77]: @orca.step()
   ....: def print_year(iter_var,iter_step):
   ....:         print '*** the iteration value is {} ***'.format(iter_var)
   ....:         print '*** step number {0} is named {1} ***'.format(iter_step.step_num, iter_step.step_name)
   ....:

In [78]: orca.run(['print_year'], iter_vars=range(2010, 2015))
Running iteration 1 with iteration value 2010
Running step 'print_year'
*** the iteration value is 2010 ***
*** step number 0 is named print_year ***
Time to execute step 'print_year': 0.00 s
Total time to execute iteration 1 with iteration value 2010: 0.00 s
Running iteration 2 with iteration value 2011
Running step 'print_year'
*** the iteration value is 2011 ***
*** step number 0 is named print_year ***
Time to execute step 'print_year': 0.00 s
Total time to execute iteration 2 with iteration value 2011: 0.00 s
Running iteration 3 with iteration value 2012
Running step 'print_year'
*** the iteration value is 2012 ***
*** step number 0 is named print_year ***
Time to execute step 'print_year': 0.00 s
Total time to execute iteration 3 with iteration value 2012: 0.00 s
Running iteration 4 with iteration value 2013
Running step 'print_year'
*** the iteration value is 2013 ***
*** step number 0 is named print_year ***
Time to execute step 'print_year': 0.00 s
Total time to execute iteration 4 with iteration value 2013: 0.00 s
Running iteration 5 with iteration value 2014
Running step 'print_year'
*** the iteration value is 2014 ***
*** step number 0 is named print_year ***
Time to execute step 'print_year': 0.00 s
Total time to execute iteration 5 with iteration value 2014: 0.00 s

Running Orca Components a la Carte

It can be useful to have Orca evaluate single variables and steps, especially during development and testing. To achieve this, use the eval_variable() and eval_step() functions.

eval_variable takes the name of a variable (including variable expressions) and returns that variable as it would be injected into a function Orca. eval_step takes the name of a step, runs that step with variable injection, and returns any result.

Note

Most steps don’t have return values because Orca ignores them, but they can be useful for testing.

Both eval_variable() and eval_step() take arbitrary keyword arguments that are temporarily turned into injectables within Orca while the evaluation is taking place. When the evaluation is complete Orca’s state is reset to whatever it was before calling the eval function.

An example of eval_variable():

In [15]: @orca.injectable()
   ....: def func(x, y):
   ....:     return x + y
   ....:

In [16]: orca.eval_variable('func', x=1, y=2)
Out[16]: 3

The keyword arguments are only temporarily set as injectables, which can lead to errors in a situation like this with a table where the evaluation of the table is delayed until to_frame() is called:

In [12]: @orca.table()
   ....: def table(x, y):
   ....:     return pd.DataFrame({'a': [x], 'b': [y]})
   ....:

In [13]: orca.eval_variable('table', x=1, y=2)
Out[13]: <orca.TableFuncWrapper at 0x100733850>

In [14]: orca.eval_variable('table', x=1, y=2).to_frame()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-14-5bf660fb07b7> in <module>()
----> 1 orca.eval_variable('table', x=1, y=2).to_frame()

<truncated>

KeyError: 'y'

In order to get the injectables to be set for a controlled term you can use the injectables() context manager to set the injectables:

In [12]: @orca.table()
   ....: def table(x, y):
   ....:     return pd.DataFrame({'a': [x], 'b': [y]})
   ....:

In [20]: with orca.injectables(x=1, y=2):
   ....:     df = orca.eval_variable('table').to_frame()
   ....:

In [21]: df
Out[21]:
   a  b
0  1  2

Archiving Data

An option to the run() function is to have it save table data at set intervals. Tables (and only tables) are saved as DataFrames to an HDF5 file via pandas’ HDFStore feature. If Orca is running only one loop the tables are stored under their registered names. If it is running multiple iterations the tables are stored under names like '<iter_var>/<table name>'. For example, if iter_var is 2020 the “buildings” table would be stored as '2020/buildings'. The out_interval keyword to run() controls how often the tables are saved out. For example, out_interval=5 saves tables every fifth iteration. In addition, the final data is always saved under the key 'final/<table name>'.

Argument Matching

A key feature of Orca is that it matches the names of function arguments to the names of registered variables in order to inject variables when evaluating functions. For that reason, it’s important that variables be registered with names that are also valid Python variables.

Variable Expressions

Argument matching is extended by a feature we call “variable expressions”. Expressions allow you to specify a variable to inject with Python keyword arguments. Here’s an example redone from above using variable expressions:

@orca.table()
def halve_my_table(data='my_table'):
    df = data.to_frame()
    return df / 2

The variable registered as 'my_table' is injected into this function as the argument data.

Expressions can also be used to refer to columns within a registered table:

@orca.column('my_table')
def halved(data='my_table.a'):
    return data / 2

In this case, the expression my_table.a refers to the column a, which is a pandas Series within the table my_table. We return a new Series to register a new column on my_table using the column() decorator. We can take a look in IPython:

In [21]: orca.get_table('my_table').to_frame()
Out[21]:
   a  halved
0  1     0.5
1  2     1.0
2  3     1.5

Expressions referring to columns may be useful in situations where a function requires only a single column from a table and the user would like to specifically document that in the function’s arguments.

API

Table API

add_table(table_name, table[, cache, …])

Register a table with Orca.

table([table_name, cache, cache_scope, copy_col])

Decorates functions that return DataFrames.

get_table(table_name)

Get a registered table.

list_tables()

List of table names.

DataFrameWrapper(name, frame[, copy_col])

Wraps a DataFrame so it can provide certain columns and handle computed columns.

TableFuncWrapper(name, func[, cache, …])

Wrap a function that provides a DataFrame.

Column API

add_column(table_name, column_name, column)

Add a new column to a table from a Series or callable.

column(table_name[, column_name, cache, …])

Decorates functions that return a Series.

list_columns()

List of (table name, registered column name) pairs.

Injectable API

add_injectable(name, value[, autocall, …])

Add a value that will be injected into other functions.

injectable([name, autocall, cache, …])

Decorates functions that will be injected into other functions.

get_injectable(name)

Get an injectable by name.

list_injectables()

List of registered injectables.

Merge API

broadcast(cast, onto[, cast_on, onto_on, …])

Register a rule for merging two tables by broadcasting one onto the other.

list_broadcasts()

List of registered broadcasts as (cast table name, onto table name).

merge_tables(target, tables[, columns, …])

Merge a number of tables onto a target table.

Step API

add_step(step_name, func)

Add a step function to Orca.

step([step_name])

Decorates functions that will be called by the run function.

get_step(step_name)

Get a wrapped step by name.

list_steps()

List of registered step names.

run(steps[, iter_vars, data_out, …])

Run steps in series, optionally repeatedly over some sequence.

Cache API

clear_cache([scope])

Clear all cached data.

disable_cache()

Turn off caching across Orca, even for registered variables that have caching enabled.

enable_cache()

Allow caching of registered variables that explicitly have caching enabled.

cache_on()

Whether caching is currently enabled or disabled.

cache_disabled()

API Docs

class orca.orca.Broadcast(cast, onto, cast_on, onto_on, cast_index, onto_index)
cast

Alias for field number 0

cast_index

Alias for field number 4

cast_on

Alias for field number 2

onto

Alias for field number 1

onto_index

Alias for field number 5

onto_on

Alias for field number 3

class orca.orca.CacheItem(name, value, scope)
name

Alias for field number 0

scope

Alias for field number 2

value

Alias for field number 1

class orca.orca.DataFrameWrapper(name, frame, copy_col=True)

Wraps a DataFrame so it can provide certain columns and handle computed columns.

Parameters
namestr

Name for the table.

framepandas.DataFrame
copy_colbool, optional

Whether to return copies when evaluating columns.

Attributes
namestr

Table name.

copy_colbool

Whether to return copies when evaluating columns.

localpandas.DataFrame

The wrapped DataFrame.

clear_cached()

Remove cached results from this table’s computed columns.

column_type(column_name)

Report column type as one of ‘local’, ‘series’, or ‘function’.

Parameters
column_namestr
Returns
col_type{‘local’, ‘series’, ‘function’}

‘local’ means that the column is part of the registered table, ‘series’ means the column is a registered Pandas Series, and ‘function’ means the column is a registered function providing a Pandas Series.

property columns

Columns in this table.

get_column(column_name)

Returns a column as a Series.

Parameters
column_namestr
Returns
columnpandas.Series
property index

Table index.

property local_columns

Columns that are part of the wrapped DataFrame.

to_frame(columns=None)

Make a DataFrame with the given columns.

Will always return a copy of the underlying table.

Parameters
columnssequence or string, optional

Sequence of the column names desired in the DataFrame. A string can also be passed if only one column is desired. If None all columns are returned, including registered columns.

Returns
framepandas.DataFrame
update_col(column_name, series)

Add or replace a column in the underlying DataFrame.

Parameters
column_namestr

Column to add or replace.

seriespandas.Series or sequence

Column data.

update_col_from_series(column_name, series, cast=False)

Update existing values in a column from another series. Index values must match in both column and series. Optionally casts data type to match the existing column.

Parameters
column_namestr
seriespanas.Series
cast: bool, optional, default False
exception orca.orca.OrcaError
class orca.orca.TableFuncWrapper(name, func, cache=False, cache_scope='forever', copy_col=True)

Wrap a function that provides a DataFrame.

Parameters
namestr

Name for the table.

funccallable

Callable that returns a DataFrame.

cachebool, optional

Whether to cache the results of calling the wrapped function.

cache_scope{‘step’, ‘iteration’, ‘forever’}, optional

Scope for which to cache data. Default is to cache forever (or until manually cleared). ‘iteration’ caches data for each complete iteration of the pipeline, ‘step’ caches data for a single step of the pipeline.

copy_colbool, optional

Whether to return copies when evaluating columns.

Attributes
namestr

Table name.

cachebool

Whether caching is enabled for this table.

copy_colbool

Whether to return copies when evaluating columns.

clear_cached()

Remove this table’s cached result and that of associated columns.

column_type(column_name)

Report column type as one of ‘local’, ‘series’, or ‘function’.

Parameters
column_namestr
Returns
col_type{‘local’, ‘series’, ‘function’}

‘local’ means that the column is part of the registered table, ‘series’ means the column is a registered Pandas Series, and ‘function’ means the column is a registered function providing a Pandas Series.

property columns

Columns in this table. (May contain only computed columns if the wrapped function has not been called yet.)

func_source_data()

Return data about the wrapped function source, including file name, line number, and source code.

Returns
filenamestr
linenoint

The line number on which the function starts.

sourcestr
get_column(column_name)

Returns a column as a Series.

Parameters
column_namestr
Returns
columnpandas.Series
property index

Index of the underlying table. Will be None if that index is unknown.

property local_columns

Only the columns contained in the DataFrame returned by the wrapped function. (No registered columns included.)

to_frame(columns=None)

Make a DataFrame with the given columns.

Will always return a copy of the underlying table.

Parameters
columnssequence, optional

Sequence of the column names desired in the DataFrame. If None all columns are returned.

Returns
framepandas.DataFrame
orca.orca.add_column(table_name, column_name, column, cache=False, cache_scope='forever')

Add a new column to a table from a Series or callable.

Parameters
table_namestr

Table with which the column will be associated.

column_namestr

Name for the column.

columnpandas.Series or callable

Series should have an index matching the table to which it is being added. If a callable, the function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca. The function should return a Series.

cachebool, optional

Whether to cache the results of a provided callable. Does not apply if column is a Series.

cache_scope{‘step’, ‘iteration’, ‘forever’}, optional

Scope for which to cache data. Default is to cache forever (or until manually cleared). ‘iteration’ caches data for each complete iteration of the pipeline, ‘step’ caches data for a single step of the pipeline.

orca.orca.add_injectable(name, value, autocall=True, cache=False, cache_scope='forever', memoize=False)

Add a value that will be injected into other functions.

Parameters
namestr
value

If a callable and autocall is True then the function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca. The return value will be passed to any functions using this injectable. In all other cases, value will be passed through untouched.

autocallbool, optional

Set to True to have injectable functions automatically called (with argument matching) and the result injected instead of the function itself.

cachebool, optional

Whether to cache the return value of an injectable function. Only applies when value is a callable and autocall is True.

cache_scope{‘step’, ‘iteration’, ‘forever’}, optional

Scope for which to cache data. Default is to cache forever (or until manually cleared). ‘iteration’ caches data for each complete iteration of the pipeline, ‘step’ caches data for a single step of the pipeline.

memoizebool, optional

If autocall is False it is still possible to cache function results by setting this flag to True. Cached values are stored in a dictionary keyed by argument values, so the argument values must be hashable. Memoized functions have their caches cleared according to the same rules as universal caching.

orca.orca.add_step(step_name, func)

Add a step function to Orca.

The function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca. The argument name “iter_var” may be used to have the current iteration variable injected.

Parameters
step_namestr
funccallable
orca.orca.add_table(table_name, table, cache=False, cache_scope='forever', copy_col=True)

Register a table with Orca.

Parameters
table_namestr

Should be globally unique to this table.

tablepandas.DataFrame or function

If a function, the function should return a DataFrame. The function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca.

cachebool, optional

Whether to cache the results of a provided callable. Does not apply if table is a DataFrame.

cache_scope{‘step’, ‘iteration’, ‘forever’}, optional

Scope for which to cache data. Default is to cache forever (or until manually cleared). ‘iteration’ caches data for each complete iteration of the pipeline, ‘step’ caches data for a single step of the pipeline.

copy_colbool, optional

Whether to return copies when evaluating columns.

Returns
wrappedDataFrameWrapper or TableFuncWrapper
orca.orca.broadcast(cast, onto, cast_on=None, onto_on=None, cast_index=False, onto_index=False)

Register a rule for merging two tables by broadcasting one onto the other.

Parameters
cast, ontostr

Names of registered tables.

cast_on, onto_onstr, optional

Column names used for merge, equivalent of left_on/right_on parameters of pandas.merge.

cast_index, onto_indexbool, optional

Whether to use table indexes for merge. Equivalent of left_index/right_index parameters of pandas.merge.

orca.orca.cache_on()

Whether caching is currently enabled or disabled.

Returns
onbool

True if caching is enabled.

orca.orca.clear_all()

Clear any and all stored state from Orca.

orca.orca.clear_cache(scope=None)

Clear all cached data.

Parameters
scope{None, ‘step’, ‘iteration’, ‘forever’}, optional

Clear cached values with a given scope. By default all cached values are removed.

orca.orca.column(table_name, column_name=None, cache=False, cache_scope='forever')

Decorates functions that return a Series.

Decorator version of add_column. Series index must match the named table. Column name defaults to name of function.

The function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca. The argument name “iter_var” may be used to have the current iteration variable injected. The index of the returned Series must match the named table.

orca.orca.column_map(tables, columns)

Take a list of tables and a list of column names and resolve which columns come from which table.

Parameters
tablessequence of _DataFrameWrapper or _TableFuncWrapper

Could also be sequence of modified pandas.DataFrames, the important thing is that they have .name and .columns attributes.

columnssequence of str

The column names of interest.

Returns
col_mapdict

Maps table names to lists of column names.

orca.orca.disable_cache()

Turn off caching across Orca, even for registered variables that have caching enabled.

orca.orca.enable_cache()

Allow caching of registered variables that explicitly have caching enabled.

orca.orca.eval_step(name, **kwargs)

Evaluate a step as would be done within the pipeline environment and return the result. Any keyword arguments are temporarily set as injectables.

Parameters
namestr

Name of step to run.

Returns
object

Anything returned by a step. (Though note that in Orca runs return values from steps are ignored.)

orca.orca.eval_variable(name, **kwargs)

Execute a single variable function registered with Orca and return the result. Any keyword arguments are temporarily set as injectables. This gives the value as would be injected into a function.

Parameters
namestr

Name of variable to evaluate. Use variable expressions to specify columns.

Returns
object

For injectables and columns this directly returns whatever object is returned by the registered function. For tables this returns a DataFrameWrapper as if the table had been injected into a function.

orca.orca.get_broadcast(cast_name, onto_name)

Get a single broadcast.

Broadcasts are stored data about how to do a Pandas join. A Broadcast object is a namedtuple with these attributes:

  • cast: the name of the table being broadcast

  • onto: the name of the table onto which “cast” is broadcast

  • cast_on: The optional name of a column on which to join. None if the table index will be used instead.

  • onto_on: The optional name of a column on which to join. None if the table index will be used instead.

  • cast_index: True if the table index should be used for the join.

  • onto_index: True if the table index should be used for the join.

Parameters
cast_namestr

The name of the table being braodcast.

onto_namestr

The name of the table onto which cast_name is broadcast.

Returns
broadcastBroadcast
orca.orca.get_injectable(name)

Get an injectable by name. Does not evaluate wrapped functions.

Parameters
namestr
Returns
injectable

Original value or evaluated value of an _InjectableFuncWrapper.

orca.orca.get_injectable_func_source_data(name)

Return data about an injectable function’s source, including file name, line number, and source code.

Parameters
namestr
Returns
filenamestr
linenoint

The line number on which the function starts.

sourcestr
orca.orca.get_raw_column(table_name, column_name)

Get a wrapped, registered column.

This function cannot return columns that are part of wrapped DataFrames, it’s only for columns registered directly through Orca.

Parameters
table_namestr
column_namestr
Returns
wrapped_SeriesWrapper or _ColumnFuncWrapper
orca.orca.get_raw_injectable(name)

Return a raw, possibly wrapped injectable.

Parameters
namestr
Returns
inj_InjectableFuncWrapper or object
orca.orca.get_raw_table(table_name)

Get a wrapped table by name and don’t do anything to it.

Parameters
table_namestr
Returns
tableDataFrameWrapper or TableFuncWrapper
orca.orca.get_step(step_name)

Get a wrapped step by name.

orca.orca.get_step_table_names(steps)

Returns a list of table names injected into the provided steps.

Parameters
steps: list of str

Steps to gather table inputs from.

Returns
list of str
orca.orca.get_table(table_name)

Get a registered table.

Decorated functions will be converted to DataFrameWrapper.

Parameters
table_namestr
Returns
tableDataFrameWrapper
orca.orca.injectable(name=None, autocall=True, cache=False, cache_scope='forever', memoize=False)

Decorates functions that will be injected into other functions.

Decorator version of add_injectable. Name defaults to name of function.

The function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca. The argument name “iter_var” may be used to have the current iteration variable injected.

orca.orca.injectable_type(name)

Classify an injectable as either ‘variable’ or ‘function’.

Parameters
namestr
Returns
inj_type{‘variable’, ‘function’}

If the injectable is an automatically called function or any other type of callable the type will be ‘function’, all other injectables will be have type ‘variable’.

orca.orca.injectables(**kwargs)

Temporarily add injectables to the pipeline environment. Takes only keyword arguments.

Injectables will be returned to their original state when the context manager exits.

orca.orca.is_broadcast(cast_name, onto_name)

Checks whether a relationship exists for broadcast cast_name onto onto_name.

orca.orca.is_expression(name)

Checks whether a given name is a simple variable name or a compound variable expression.

Parameters
namestr
Returns
is_exprbool
orca.orca.is_injectable(name)

Checks whether a given name can be mapped to an injectable.

orca.orca.is_step(step_name)

Check whether a given name refers to a registered step.

orca.orca.is_table(name)

Returns whether a given name refers to a registered table.

class orca.orca.iter_step(step_num, step_name)
step_name

Alias for field number 1

step_num

Alias for field number 0

orca.orca.list_broadcasts()

List of registered broadcasts as (cast table name, onto table name).

orca.orca.list_columns()

List of (table name, registered column name) pairs.

orca.orca.list_columns_for_table(table_name)

Return a list of all the extra columns registered for a given table.

Parameters
table_namestr
Returns
columnslist of str
orca.orca.list_injectables()

List of registered injectables.

orca.orca.list_steps()

List of registered step names.

orca.orca.list_tables()

List of table names.

orca.orca.merge_tables(target, tables, columns=None, drop_intersection=True)

Merge a number of tables onto a target table. Tables must have registered merge rules via the broadcast function.

Parameters
targetstr, DataFrameWrapper, or TableFuncWrapper

Name of the table (or wrapped table) onto which tables will be merged.

tableslist of DataFrameWrapper, TableFuncWrapper, or str

All of the tables to merge. Should include the target table.

columnslist of str, optional

If given, columns will be mapped to tables and only those columns will be requested from each table. The final merged table will have only these columns. By default all columns are used from every table.

drop_intersectionbool

If True, keep the left most occurence of any column name if it occurs on more than one table. This prevents getting back the same column with suffixes applied by pd.merge. If false, columns names will be suffixed with the table names - e.g. zone_id_buildings and zone_id_parcels.

Returns
mergedpandas.DataFrame
orca.orca.run(steps, iter_vars=None, data_out=None, out_interval=1, out_base_tables=None, out_run_tables=None, compress=False, out_base_local=True, out_run_local=True)

Run steps in series, optionally repeatedly over some sequence. The current iteration variable is set as a global injectable called iter_var.

Parameters
stepslist of str

List of steps to run identified by their name.

iter_varsiterable, optional

The values of iter_vars will be made available as an injectable called iter_var when repeatedly running steps.

data_outstr, optional

An optional filename to which all tables injected into any step in steps will be saved every out_interval iterations. File will be a pandas HDF data store.

out_intervalint, optional

Iteration interval on which to save data to data_out. For example, 2 will save out every 2 iterations, 5 every 5 iterations. Default is every iteration. The results of the first and last iterations are always included. The input (base) tables are also included and prefixed with base/, these represent the state of the system before any steps have been executed. The interval is defined relative to the first iteration. For example, a run begining in 2015 with an out_interval of 2, will write out results for 2015, 2017, etc.

out_base_tables: list of str, optional, default None

List of base tables to write. If not provided, tables injected into ‘steps’ will be written.

out_run_tables: list of str, optional, default None

List of run tables to write. If not provided, tables injected into ‘steps’ will be written.

compress: boolean, optional, default False

Whether to compress output file using standard HDF5 zlib compression. Compression yields much smaller files using slightly more CPU.

out_base_local: boolean, optional, default True

For tables in out_base_tables, whether to store only local columns (True) or both, local and computed columns (False).

out_run_local: boolean, optional, default True

For tables in out_run_tables, whether to store only local columns (True) or both, local and computed columns (False).

orca.orca.step(step_name=None)

Decorates functions that will be called by the run function.

Decorator version of add_step. step name defaults to name of function.

The function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca. The argument name “iter_var” may be used to have the current iteration variable injected.

orca.orca.table(table_name=None, cache=False, cache_scope='forever', copy_col=True)

Decorates functions that return DataFrames.

Decorator version of add_table. Table name defaults to name of function.

The function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca. The argument name “iter_var” may be used to have the current iteration variable injected.

orca.orca.table_type(table_name)

Returns the type of a registered table.

The type can be either “dataframe” or “function”.

Parameters
table_namestr
Returns
table_type{‘dataframe’, ‘function’}
orca.orca.temporary_tables(**kwargs)

Temporarily set DataFrames as registered tables.

Tables will be returned to their original state when the context manager exits. Caching is not enabled for tables registered via this function.

orca.orca.write_tables(fname, table_names=None, prefix=None, compress=False, local=False)

Writes tables to a pandas.HDFStore file.

Parameters
fnamestr

File name for HDFStore. Will be opened in append mode and closed at the end of this function.

table_names: list of str, optional, default None

List of tables to write. If None, all registered tables will be written.

prefix: str

If not None, used to prefix the output table names so that multiple iterations can go in the same file.

compress: boolean

Whether to compress output file using standard HDF5-readable zlib compression, default False.