Orca Core¶
Note
In the documentation below the following imports are implied:
import orca
import pandas as pd
Tables¶
Tables are Pandas DataFrames.
Use the add_table()
function to register
a DataFrame under a given name:
df = pd.DataFrame({'a': [1, 2, 3]})
orca.add_table('my_table', df)
Or you can use the decorator table()
to register a function that returns a DataFrame:
@orca.table('halve_my_table')
def halve_my_table(my_table):
df = my_table.to_frame()
return df / 2
The decorator argument, which specifies the name to register the table with, is optional. If left out, the table is registered under the name of the function that is being decorated. The decorator example above could be written more concisely:
@orca.table()
def halve_my_table(my_table):
df = my_table.to_frame()
return df / 2
Note that the decorator parentheses are still required.
By registering halve_my_table
as a function, its values will always be
half those in my_table
, even if my_table
is later changed.
If you’d like a function to not be evaluated every time it
is used, pass the cache=True
keyword when registering it.
Here’s a demo of the above table definitions shown in IPython:
In [19]: wrapped = orca.get_table('halve_my_table')
In [20]: wrapped.to_frame()
Out[20]:
a
0 0.5
1 1.0
2 1.5
Table Wrappers¶
Notice in the table function above that we had to call a
to_frame()
method
before using the table in a math operation. The values injected into
functions are not DataFrames, but specialized wrappers.
The wrappers facilitate caching, computed columns,
and lazy evaluation of table functions. Learn more in the API documentation:
Automated Merges¶
Certain analyses can be easiest when some tables are merged together,
but in other places it may be best to keep the tables separate.
Orca can make these on-demand merges easy by letting you define table
relationships up front and then performing the merges for you as needed.
We call these relationships “broadcasts” (as in a rule for how to broadcast
one table onto another) and you register them using the
broadcast()
function.
For an example we’ll first define some DataFrames that contain links to one another and register them with Orca:
df_a = pd.DataFrame(
{'a': [0, 1]},
index=['a0', 'a1'])
df_b = pd.DataFrame(
{'b': [2, 3, 4, 5, 6],
'a_id': ['a0', 'a1', 'a1', 'a0', 'a1']},
index=['b0', 'b1', 'b2', 'b3', 'b4'])
df_c = pd.DataFrame(
{'c': [7, 8, 9]},
index=['c0', 'c1', 'c2'])
df_d = pd.DataFrame(
{'d': [10, 11, 12, 13, 15, 16, 16, 17, 18, 19],
'b_id': ['b2', 'b0', 'b3', 'b3', 'b1', 'b4', 'b1', 'b4', 'b3', 'b3'],
'c_id': ['c0', 'c1', 'c1', 'c0', 'c0', 'c2', 'c1', 'c2', 'c1', 'c2']},
index=['d0', 'd1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9'])
orca.add_table('a', df_a)
orca.add_table('b', df_b)
orca.add_table('c', df_c)
orca.add_table('d', df_d)
The tables have data so that ‘a’ can be broadcast onto ‘b’,
and ‘b’ and ‘c’ can be broadcast onto ‘d’.
We use the broadcast()
function
to register those relationships:
orca.broadcast(cast='a', onto='b', cast_index=True, onto_on='a_id')
orca.broadcast(cast='b', onto='d', cast_index=True, onto_on='b_id')
orca.broadcast(cast='c', onto='d', cast_index=True, onto_on='c_id')
The syntax is similar to that of the
pandas merge function,
and indeed merge
is used behind the scenes.
Once the broadcasts are defined, use the
merge_tables()
function to get a
merged DataFrame. Some examples in IPython:
In [4]: orca.merge_tables(target='b', tables=[a, b])
Out[4]:
a_id b a
b0 a0 2 0
b3 a0 5 0
b1 a1 3 1
b2 a1 4 1
b4 a1 6 1
In [5]: orca.merge_tables(target='d', tables=[a, b, c, d])
Out[5]:
b_id c_id d c a_id b a
d0 b2 c0 10 7 a1 4 1
d3 b3 c0 13 7 a0 5 0
d2 b3 c1 12 8 a0 5 0
d8 b3 c1 18 8 a0 5 0
d9 b3 c2 19 9 a0 5 0
d4 b1 c0 15 7 a1 3 1
d6 b1 c1 16 8 a1 3 1
d1 b0 c1 11 8 a0 2 0
d5 b4 c2 16 9 a1 6 1
d7 b4 c2 17 9 a1 6 1
Note that it’s the target table’s index that you find in the final merged
table, though the order may have changed.
merge_tables()
has an optional
columns=
keyword that can contain column names from any the tables
going into the merge so you can limit which columns end up in the final table.
(Columns necessary for performing merges will be included whether or not
they are in the columns=
list.)
Note
merge_tables()
calls
pandas.merge
with how='inner'
, meaning that only items that
appear in both tables are kept in the merged table.
Columns¶
Often, not all the columns you need are preexisting on your tables.
You may need to collect information from other tables
or perform a calculation to generate a column. Orca allows you to
register a Series or function as a column on a registered table.
Use the add_column()
function or
the column()
decorator:
s = pd.Series(['a', 'b', 'c'])
orca.add_column('my_table', 'my_col', s)
@orca.column('my_table')
def my_col_x2(my_table):
df = my_table.to_frame(columns=['my_col'])
return df['my_col'] * 2
In the my_col_x2
function we use the columns=
keyword on
to_frame()
to get only
the one column necessary for our calculation. This can be useful for
avoiding unnecessary computation or to avoid recursion (as would happen
in this case if we called to_frame()
with no arguments).
Accessing columns on a table is such a common occurrence that there
are additional ways to do so without first calling to_frame()
to create an actual DataFrame
.
DataFrameWrapper
supports accessing
individual columns in the same ways as DataFrames
:
@orca.column('my_table')
def my_col_x2(my_table):
return my_table['my_col'] * 2 # or my_table.my_col * 2
Or you can use an expression to have a single column injected into a function:
@orca.column('my_table')
def my_col_x2(data='my_table.my_col'):
return data * 2
In this case, the label data
, expressed as my_table.my_col
,
refers to the column my_col
, which is a pandas Series within
the table my_table
.
A demonstration in IPython using the column definitions from above:
In [29]: wrapped = orca.get_table('my_table')
In [30]: wrapped.columns
Out[30]: ['a', 'my_col', 'my_col_x2']
In [31]: wrapped.local_columns
Out[31]: ['a']
In [32]: wrapped.to_frame()
Out[32]:
a my_col_x2 my_col
0 1 aa a
1 2 bb b
2 3 cc c
DataFrameWrapper
has
columns
and local_columns
attributes that, respectively, list all the columns on a table and
only those columns that are part of the underlying DataFrame.
Columns are stored separate from tables so it is safe to define a column on a table and then replace that table with something else. The column will remain associated with the table.
Injectables¶
You will probably want to have things besides tables injected into functions,
for which Orca has “injectables”. You can register anything and have
it injected into functions.
Use the add_injectable()
function or the
injectable()
decorator:
orca.add_injectable('z', 5)
@orca.injectable(autocall=False)
def pow(x, y):
return x ** y
@orca.injectable()
def zsquared(z, pow):
return pow(z, 2)
@orca.table()
def ztable(my_table, zsquared):
df = my_table.to_frame(columns=['a'])
return df * zsquared
By default injectable functions are evaluated before injection and the return
value is passed into other functions. Use autocall=False
to disable this
behavior and instead inject the function itself.
Like tables and columns, injectable functions that are automatically evaluated
can have their results cached with cache=True
.
Functions that are not automatically evaluated can also have their results
cached using the memoize=True
keyword along with autocall=False
.
A memoized injectable will cache results based on the function inputs,
so this only works if the function inputs are hashable
(usable as dictionary keys).
Memoized functions can have their caches cleared manually using their
clear_cached
function attribute.
The caches of memoized functions are also hooked into the global Orca
caching system,
so you can also manage their caches via the cache_scope
keyword argument
and the clear_cache()
function.
An example of the above injectables in IPython:
In [38]: wrapped = orca.get_table('ztable')
In [39]: wrapped.to_frame()
Out[39]:
a
0 25
1 50
2 75
Caching¶
Orca has cache system so that function results can be stored for re-use when it is not necessary to recompute them every time they are used.
The decorators
table()
,
column()
, and
injectable()
all take two keyword arguments related to caching:
cache
and cache_scope
.
By default results are not cached. Register functions with cache=True
to enable caching of their results.
Cache Scope¶
Cached items have an associated “scope” that allows Orca to automatically manage how long functions have their results cached before re-evaluating them. The three scope settings are:
'forever'
(the default setting) - Results are cached until manually cleared by user commands.'iteration'
- Results are cached for the remainder of the current pipeline iteration.'step'
- Results are cached until the current pipeline step finishes.
An item’s cache scope can be modified using
update_injectable_scope()
,
update_table_scope()
, or
update_column_scope()
. Omitting the scope or passing None
turns caching off for the item. These functions were added in Orca v1.6.
Disabling Caching¶
There may be situations, especially during testing, that require disabling the caching system.
Caching can be turned off globally using the
disable_cache()
function
(and turned back on by enable_cache()
).
To run a block of commands with the cache disabled, but have it automatically
re-enabled, use the cache_disabled()
context manager:
with orca.cache_disabled():
result = orca.eval_variable('my_table')
Manually Clearing Cache¶
Orca’s entire cache can be cleared using clear_cache()
.
Cache can also be cleared manually for individual items, to allow finer control over re-computation. These functions were added in Orca v1.6.
To clear the cached value of an injectable, use
clear_injectable()
. To clear the cached copy of an entire
table, use clear_table()
.
A dynamically generated column can be cleared using
clear_column()
:
orca.clear_column('my_table', 'my_col')
To clear all dynamically generated columns from a table, use
clear_columns()
:
orca.clear_columns('my_table')
Or clear a subset of the columns like this:
orca.clear_columns('my_table', ['col1', 'col2'])
Steps¶
A step is a function run by Orca with argument matching.
Use the step()
decorator to register a step function.
Steps are generally important for their side-effects, their
return values are discarded during pipeline runs.
For example, a step might replace a column
in a table (a new table, though similar to my_table
above):
df = pd.DataFrame({'a': [1, 2, 3]})
orca.add_table('new_table', df)
@orca.step()
def replace_col(new_table):
new_table['a'] = [4, 5, 6]
Or update some values in a column:
@orca.step()
def update_col(new_table):
s = pd.Series([99], index=[1])
new_table.update_col_from_series('a', s)
Or add rows to a table:
@orca.step()
def add_rows(new_table):
new_rows = pd.DataFrame({'a': [100, 101]}, index=[3, 4])
df = new_table.to_frame()
df = pd.concat([df, new_rows])
orca.add_table('new_table', df)
The first two of the above examples update my_tables
’s underlying
DataFrame and so require it to be a DataFrameWrapper
.
If your table is a wrapped function, not a DataFrame, you can update
columns by replacing them entirely with a new Series using the
add_column()
function.
A demonstration of running the above steps:
In [68]: orca.run(['replace_col', 'update_col', 'add_rows'])
Running step 'replace_col'
Running step 'update_col'
Running step 'add_rows'
In [69]: orca.get_table('new_table').to_frame()
Out[69]:
a
0 4
1 99
2 6
3 100
4 101
In the context of a simulation steps can be thought of as model steps that will often advance the simulation by updating data. Steps are plain Python functions, though, and there is no restriction on what they are allowed to do.
Running Pipelines¶
You start pipelines by calling the run()
function and
listing which steps you want to run.
Calling run()
with just a list of steps,
as in the above example, will run through the steps once.
To run the pipeline over some a sequence, provide those values as a sequence
to run()
using the iter_vars
argument.
The iter_var
injectable stores the current value from the iter_vars
argument to run()
function.
The iter_step
injectable is a namedtuple
with fields named step_num
and step_name
,
stored in that order.
step_num
is a zero-based index based on the list of step names passed to the run()
function.
In [77]: @orca.step()
....: def print_year(iter_var,iter_step):
....: print '*** the iteration value is {} ***'.format(iter_var)
....: print '*** step number {0} is named {1} ***'.format(iter_step.step_num, iter_step.step_name)
....:
In [78]: orca.run(['print_year'], iter_vars=range(2010, 2015))
Running iteration 1 with iteration value 2010
Running step 'print_year'
*** the iteration value is 2010 ***
*** step number 0 is named print_year ***
Time to execute step 'print_year': 0.00 s
Total time to execute iteration 1 with iteration value 2010: 0.00 s
Running iteration 2 with iteration value 2011
Running step 'print_year'
*** the iteration value is 2011 ***
*** step number 0 is named print_year ***
Time to execute step 'print_year': 0.00 s
Total time to execute iteration 2 with iteration value 2011: 0.00 s
Running iteration 3 with iteration value 2012
Running step 'print_year'
*** the iteration value is 2012 ***
*** step number 0 is named print_year ***
Time to execute step 'print_year': 0.00 s
Total time to execute iteration 3 with iteration value 2012: 0.00 s
Running iteration 4 with iteration value 2013
Running step 'print_year'
*** the iteration value is 2013 ***
*** step number 0 is named print_year ***
Time to execute step 'print_year': 0.00 s
Total time to execute iteration 4 with iteration value 2013: 0.00 s
Running iteration 5 with iteration value 2014
Running step 'print_year'
*** the iteration value is 2014 ***
*** step number 0 is named print_year ***
Time to execute step 'print_year': 0.00 s
Total time to execute iteration 5 with iteration value 2014: 0.00 s
Running Orca Components a la Carte¶
It can be useful to have Orca evaluate single variables and steps,
especially during development and testing.
To achieve this, use the
eval_variable()
and
eval_step()
functions.
eval_variable
takes the name of a variable (including variable expressions)
and returns that variable as it would be injected into a function Orca.
eval_step
takes the name of a step, runs that
step with variable injection, and returns any result.
Note
Most steps don’t have return values because Orca ignores them, but they can be useful for testing.
Both eval_variable()
and eval_step()
take arbitrary keyword arguments that are temporarily turned into injectables
within Orca while the evaluation is taking place.
When the evaluation is complete Orca’s state is reset to whatever
it was before calling the eval
function.
An example of eval_variable()
:
In [15]: @orca.injectable()
....: def func(x, y):
....: return x + y
....:
In [16]: orca.eval_variable('func', x=1, y=2)
Out[16]: 3
The keyword arguments are only temporarily set as injectables,
which can lead to errors in a situation like this with a table
where the evaluation of the table is delayed until
to_frame()
is called:
In [12]: @orca.table()
....: def table(x, y):
....: return pd.DataFrame({'a': [x], 'b': [y]})
....:
In [13]: orca.eval_variable('table', x=1, y=2)
Out[13]: <orca.TableFuncWrapper at 0x100733850>
In [14]: orca.eval_variable('table', x=1, y=2).to_frame()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-14-5bf660fb07b7> in <module>()
----> 1 orca.eval_variable('table', x=1, y=2).to_frame()
<truncated>
KeyError: 'y'
In order to get the injectables to be set for a controlled term you can
use the injectables()
context manager
to set the injectables:
In [12]: @orca.table()
....: def table(x, y):
....: return pd.DataFrame({'a': [x], 'b': [y]})
....:
In [20]: with orca.injectables(x=1, y=2):
....: df = orca.eval_variable('table').to_frame()
....:
In [21]: df
Out[21]:
a b
0 1 2
Archiving Data¶
An option to the run()
function is to have
it save table data at set intervals.
Tables (and only tables) are saved as DataFrames to an HDF5 file via pandas’
HDFStore
feature. If Orca is running only one loop the tables are stored
under their registered names. If it is running multiple iterations the tables are
stored under names like '<iter_var>/<table name>'
.
For example, if iter_var
is 2020
the “buildings” table would be stored
as '2020/buildings'
.
The out_interval
keyword to run()
controls how often the tables are saved out. For example, out_interval=5
saves tables every fifth iteration.
In addition, the final data is always saved
under the key 'final/<table name>'
.
Argument Matching¶
A key feature of Orca is that it matches the names of function arguments to the names of registered variables in order to inject variables when evaluating functions. For that reason, it’s important that variables be registered with names that are also valid Python variables.
Variable Expressions¶
Argument matching is extended by a feature we call “variable expressions”. Expressions allow you to specify a variable to inject with Python keyword arguments. Here’s an example redone from above using variable expressions:
@orca.table()
def halve_my_table(data='my_table'):
df = data.to_frame()
return df / 2
The variable registered as 'my_table'
is injected into this function
as the argument data
.
Expressions can also be used to refer to columns within a registered table:
@orca.column('my_table')
def halved(data='my_table.a'):
return data / 2
In this case, the expression my_table.a
refers to the column a
,
which is a pandas Series within the table my_table
. We return
a new Series to register a new column on my_table
using the
column()
decorator. We can take a
look in IPython:
In [21]: orca.get_table('my_table').to_frame()
Out[21]:
a halved
0 1 0.5
1 2 1.0
2 3 1.5
Expressions referring to columns may be useful in situations where a function requires only a single column from a table and the user would like to specifically document that in the function’s arguments.
API¶
Table API¶
|
Register a table with Orca. |
|
Decorates functions that return DataFrames. |
|
Get a registered table. |
List of table names. |
|
|
Wraps a DataFrame so it can provide certain columns and handle computed columns. |
|
Wrap a function that provides a DataFrame. |
Column API¶
|
Add a new column to a table from a Series or callable. |
|
Decorates functions that return a Series. |
List of (table name, registered column name) pairs. |
Injectable API¶
|
Add a value that will be injected into other functions. |
|
Decorates functions that will be injected into other functions. |
|
Get an injectable by name. |
List of registered injectables. |
Merge API¶
|
Register a rule for merging two tables by broadcasting one onto the other. |
List of registered broadcasts as (cast table name, onto table name). |
|
|
Merge a number of tables onto a target table. |
Step API¶
|
Add a step function to Orca. |
|
Decorates functions that will be called by the run function. |
|
Get a wrapped step by name. |
List of registered step names. |
|
|
Run steps in series, optionally repeatedly over some sequence. |
Cache API¶
|
Clear all cached data. |
Turn off caching across Orca, even for registered variables that have caching enabled. |
|
Allow caching of registered variables that explicitly have caching enabled. |
|
|
Whether caching is currently enabled or disabled. |
|
Clear the cached value of an injectable. |
|
Clear the cached copy of an entire table. |
|
Clear the cached copy of a dynamically generated column. |
|
Clear all (or a specified list) of the dynamically generated columns associated with a table. |
|
Update the cache scope for a wrapped injectable function. |
|
Update the cache scope for a wrapped table function. |
|
Update the cache scope for a wrapped column function. |
API Docs¶
-
class
orca.orca.
Broadcast
(cast, onto, cast_on, onto_on, cast_index, onto_index)¶ -
cast
¶ Alias for field number 0
-
cast_index
¶ Alias for field number 4
-
cast_on
¶ Alias for field number 2
-
onto
¶ Alias for field number 1
-
onto_index
¶ Alias for field number 5
-
onto_on
¶ Alias for field number 3
-
-
class
orca.orca.
CacheItem
(name, value, scope)¶ -
name
¶ Alias for field number 0
-
scope
¶ Alias for field number 2
-
value
¶ Alias for field number 1
-
-
class
orca.orca.
DataFrameWrapper
(name, frame, copy_col=True)¶ Wraps a DataFrame so it can provide certain columns and handle computed columns.
- Parameters
- namestr
Name for the table.
- framepandas.DataFrame
- copy_colbool, optional
Whether to return copies when evaluating columns.
- Attributes
- namestr
Table name.
- copy_colbool
Whether to return copies when evaluating columns.
- localpandas.DataFrame
The wrapped DataFrame.
-
clear_cached
()¶ Remove cached results from this table’s computed columns.
-
column_type
(column_name)¶ Report column type as one of ‘local’, ‘series’, or ‘function’.
- Parameters
- column_namestr
- Returns
- col_type{‘local’, ‘series’, ‘function’}
‘local’ means that the column is part of the registered table, ‘series’ means the column is a registered Pandas Series, and ‘function’ means the column is a registered function providing a Pandas Series.
-
property
columns
¶ Columns in this table.
-
get_column
(column_name)¶ Returns a column as a Series.
- Parameters
- column_namestr
- Returns
- columnpandas.Series
-
property
index
¶ Table index.
-
property
local_columns
¶ Columns that are part of the wrapped DataFrame.
-
to_frame
(columns=None)¶ Make a DataFrame with the given columns.
Will always return a copy of the underlying table.
- Parameters
- columnssequence or string, optional
Sequence of the column names desired in the DataFrame. A string can also be passed if only one column is desired. If None all columns are returned, including registered columns.
- Returns
- framepandas.DataFrame
-
update_col
(column_name, series)¶ Add or replace a column in the underlying DataFrame.
- Parameters
- column_namestr
Column to add or replace.
- seriespandas.Series or sequence
Column data.
-
update_col_from_series
(column_name, series, cast=False)¶ Update existing values in a column from another series. Index values must match in both column and series. Optionally casts data type to match the existing column.
- Parameters
- column_namestr
- seriespanas.Series
- cast: bool, optional, default False
-
exception
orca.orca.
OrcaError
¶
-
class
orca.orca.
TableFuncWrapper
(name, func, cache=False, cache_scope='forever', copy_col=True)¶ Wrap a function that provides a DataFrame.
- Parameters
- namestr
Name for the table.
- funccallable
Callable that returns a DataFrame.
- cachebool, optional
Whether to cache the results of calling the wrapped function.
- cache_scope{‘step’, ‘iteration’, ‘forever’}, optional
Scope for which to cache data. Default is to cache forever (or until manually cleared). ‘iteration’ caches data for each complete iteration of the pipeline, ‘step’ caches data for a single step of the pipeline.
- copy_colbool, optional
Whether to return copies when evaluating columns.
- Attributes
- namestr
Table name.
- cachebool
Whether caching is enabled for this table.
- copy_colbool
Whether to return copies when evaluating columns.
-
clear_cached
()¶ Remove this table’s cached result and that of associated columns.
-
column_type
(column_name)¶ Report column type as one of ‘local’, ‘series’, or ‘function’.
- Parameters
- column_namestr
- Returns
- col_type{‘local’, ‘series’, ‘function’}
‘local’ means that the column is part of the registered table, ‘series’ means the column is a registered Pandas Series, and ‘function’ means the column is a registered function providing a Pandas Series.
-
property
columns
¶ Columns in this table. (May contain only computed columns if the wrapped function has not been called yet.)
-
func_source_data
()¶ Return data about the wrapped function source, including file name, line number, and source code.
- Returns
- filenamestr
- linenoint
The line number on which the function starts.
- sourcestr
-
get_column
(column_name)¶ Returns a column as a Series.
- Parameters
- column_namestr
- Returns
- columnpandas.Series
-
property
index
¶ Index of the underlying table. Will be None if that index is unknown.
-
property
local_columns
¶ Only the columns contained in the DataFrame returned by the wrapped function. (No registered columns included.)
-
to_frame
(columns=None)¶ Make a DataFrame with the given columns.
Will always return a copy of the underlying table.
- Parameters
- columnssequence, optional
Sequence of the column names desired in the DataFrame. If None all columns are returned.
- Returns
- framepandas.DataFrame
-
orca.orca.
add_column
(table_name, column_name, column, cache=False, cache_scope='forever')¶ Add a new column to a table from a Series or callable.
- Parameters
- table_namestr
Table with which the column will be associated.
- column_namestr
Name for the column.
- columnpandas.Series or callable
Series should have an index matching the table to which it is being added. If a callable, the function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca. The function should return a Series.
- cachebool, optional
Whether to cache the results of a provided callable. Does not apply if column is a Series.
- cache_scope{‘step’, ‘iteration’, ‘forever’}, optional
Scope for which to cache data. Default is to cache forever (or until manually cleared). ‘iteration’ caches data for each complete iteration of the pipeline, ‘step’ caches data for a single step of the pipeline.
-
orca.orca.
add_injectable
(name, value, autocall=True, cache=False, cache_scope='forever', memoize=False)¶ Add a value that will be injected into other functions.
- Parameters
- namestr
- value
If a callable and autocall is True then the function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca. The return value will be passed to any functions using this injectable. In all other cases, value will be passed through untouched.
- autocallbool, optional
Set to True to have injectable functions automatically called (with argument matching) and the result injected instead of the function itself.
- cachebool, optional
Whether to cache the return value of an injectable function. Only applies when value is a callable and autocall is True.
- cache_scope{‘step’, ‘iteration’, ‘forever’}, optional
Scope for which to cache data. Default is to cache forever (or until manually cleared). ‘iteration’ caches data for each complete iteration of the pipeline, ‘step’ caches data for a single step of the pipeline.
- memoizebool, optional
If autocall is False it is still possible to cache function results by setting this flag to True. Cached values are stored in a dictionary keyed by argument values, so the argument values must be hashable. Memoized functions have their caches cleared according to the same rules as universal caching.
-
orca.orca.
add_step
(step_name, func)¶ Add a step function to Orca.
The function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca. The argument name “iter_var” may be used to have the current iteration variable injected.
- Parameters
- step_namestr
- funccallable
-
orca.orca.
add_table
(table_name, table, cache=False, cache_scope='forever', copy_col=True)¶ Register a table with Orca.
- Parameters
- table_namestr
Should be globally unique to this table.
- tablepandas.DataFrame or function
If a function, the function should return a DataFrame. The function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca.
- cachebool, optional
Whether to cache the results of a provided callable. Does not apply if table is a DataFrame.
- cache_scope{‘step’, ‘iteration’, ‘forever’}, optional
Scope for which to cache data. Default is to cache forever (or until manually cleared). ‘iteration’ caches data for each complete iteration of the pipeline, ‘step’ caches data for a single step of the pipeline.
- copy_colbool, optional
Whether to return copies when evaluating columns.
- Returns
- wrappedDataFrameWrapper or TableFuncWrapper
-
orca.orca.
broadcast
(cast, onto, cast_on=None, onto_on=None, cast_index=False, onto_index=False)¶ Register a rule for merging two tables by broadcasting one onto the other.
- Parameters
- cast, ontostr
Names of registered tables.
- cast_on, onto_onstr, optional
Column names used for merge, equivalent of
left_on
/right_on
parameters of pandas.merge.- cast_index, onto_indexbool, optional
Whether to use table indexes for merge. Equivalent of
left_index
/right_index
parameters of pandas.merge.
-
orca.orca.
cache_on
()¶ Whether caching is currently enabled or disabled.
- Returns
- onbool
True if caching is enabled.
-
orca.orca.
clear_all
()¶ Clear any and all stored state from Orca.
-
orca.orca.
clear_cache
(scope=None)¶ Clear all cached data.
- Parameters
- scope{None, ‘step’, ‘iteration’, ‘forever’}, optional
Clear cached values with a given scope. By default all cached values are removed.
-
orca.orca.
clear_column
(table_name, column_name)¶ Clear the cached copy of a dynamically generated column. Added in Orca v1.6.
- Parameters
- table_name: str
Table containing the column to clear.
- column_name: str
Name of the column to clear.
-
orca.orca.
clear_columns
(table_name, columns=None)¶ Clear all (or a specified list) of the dynamically generated columns associated with a table. Added in Orca v1.6.
- Parameters
- table_name: str
Table name.
- columns: list of str, optional, default None
List of columns to clear. If None, all extra/computed columns in the table will be cleeared.
-
orca.orca.
clear_injectable
(injectable_name)¶ Clear the cached value of an injectable. Added in Orca v1.6.
- Parameters
- name: str
Name of injectable to clear.
-
orca.orca.
clear_table
(table_name)¶ Clear the cached copy of an entire table. Added in Orca v1.6.
- Parameters
- name: str
Name of table to clear.
-
orca.orca.
column
(table_name, column_name=None, cache=False, cache_scope='forever')¶ Decorates functions that return a Series.
Decorator version of add_column. Series index must match the named table. Column name defaults to name of function.
The function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca. The argument name “iter_var” may be used to have the current iteration variable injected. The index of the returned Series must match the named table.
-
orca.orca.
column_map
(tables, columns)¶ Take a list of tables and a list of column names and resolve which columns come from which table.
- Parameters
- tablessequence of _DataFrameWrapper or _TableFuncWrapper
Could also be sequence of modified pandas.DataFrames, the important thing is that they have
.name
and.columns
attributes.- columnssequence of str
The column names of interest.
- Returns
- col_mapdict
Maps table names to lists of column names.
-
orca.orca.
disable_cache
()¶ Turn off caching across Orca, even for registered variables that have caching enabled.
-
orca.orca.
enable_cache
()¶ Allow caching of registered variables that explicitly have caching enabled.
-
orca.orca.
eval_step
(name, **kwargs)¶ Evaluate a step as would be done within the pipeline environment and return the result. Any keyword arguments are temporarily set as injectables.
- Parameters
- namestr
Name of step to run.
- Returns
- object
Anything returned by a step. (Though note that in Orca runs return values from steps are ignored.)
-
orca.orca.
eval_variable
(name, **kwargs)¶ Execute a single variable function registered with Orca and return the result. Any keyword arguments are temporarily set as injectables. This gives the value as would be injected into a function.
- Parameters
- namestr
Name of variable to evaluate. Use variable expressions to specify columns.
- Returns
- object
For injectables and columns this directly returns whatever object is returned by the registered function. For tables this returns a DataFrameWrapper as if the table had been injected into a function.
-
orca.orca.
get_broadcast
(cast_name, onto_name)¶ Get a single broadcast.
Broadcasts are stored data about how to do a Pandas join. A Broadcast object is a namedtuple with these attributes:
cast: the name of the table being broadcast
onto: the name of the table onto which “cast” is broadcast
cast_on: The optional name of a column on which to join. None if the table index will be used instead.
onto_on: The optional name of a column on which to join. None if the table index will be used instead.
cast_index: True if the table index should be used for the join.
onto_index: True if the table index should be used for the join.
- Parameters
- cast_namestr
The name of the table being braodcast.
- onto_namestr
The name of the table onto which cast_name is broadcast.
- Returns
- broadcastBroadcast
-
orca.orca.
get_injectable
(name)¶ Get an injectable by name. Does not evaluate wrapped functions.
- Parameters
- namestr
- Returns
- injectable
Original value or evaluated value of an _InjectableFuncWrapper.
-
orca.orca.
get_injectable_func_source_data
(name)¶ Return data about an injectable function’s source, including file name, line number, and source code.
- Parameters
- namestr
- Returns
- filenamestr
- linenoint
The line number on which the function starts.
- sourcestr
-
orca.orca.
get_raw_column
(table_name, column_name)¶ Get a wrapped, registered column.
This function cannot return columns that are part of wrapped DataFrames, it’s only for columns registered directly through Orca.
- Parameters
- table_namestr
- column_namestr
- Returns
- wrapped_SeriesWrapper or _ColumnFuncWrapper
-
orca.orca.
get_raw_injectable
(name)¶ Return a raw, possibly wrapped injectable.
- Parameters
- namestr
- Returns
- inj_InjectableFuncWrapper or object
-
orca.orca.
get_raw_table
(table_name)¶ Get a wrapped table by name and don’t do anything to it.
- Parameters
- table_namestr
- Returns
- tableDataFrameWrapper or TableFuncWrapper
-
orca.orca.
get_step
(step_name)¶ Get a wrapped step by name.
-
orca.orca.
get_step_table_names
(steps)¶ Returns a list of table names injected into the provided steps.
- Parameters
- steps: list of str
Steps to gather table inputs from.
- Returns
- list of str
-
orca.orca.
get_table
(table_name)¶ Get a registered table.
Decorated functions will be converted to DataFrameWrapper.
- Parameters
- table_namestr
- Returns
- tableDataFrameWrapper
-
orca.orca.
injectable
(name=None, autocall=True, cache=False, cache_scope='forever', memoize=False)¶ Decorates functions that will be injected into other functions.
Decorator version of add_injectable. Name defaults to name of function.
The function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca. The argument name “iter_var” may be used to have the current iteration variable injected.
-
orca.orca.
injectable_type
(name)¶ Classify an injectable as either ‘variable’ or ‘function’.
- Parameters
- namestr
- Returns
- inj_type{‘variable’, ‘function’}
If the injectable is an automatically called function or any other type of callable the type will be ‘function’, all other injectables will be have type ‘variable’.
-
orca.orca.
injectables
(**kwargs)¶ Temporarily add injectables to the pipeline environment. Takes only keyword arguments.
Injectables will be returned to their original state when the context manager exits.
-
orca.orca.
is_broadcast
(cast_name, onto_name)¶ Checks whether a relationship exists for broadcast cast_name onto onto_name.
-
orca.orca.
is_expression
(name)¶ Checks whether a given name is a simple variable name or a compound variable expression.
- Parameters
- namestr
- Returns
- is_exprbool
-
orca.orca.
is_injectable
(name)¶ Checks whether a given name can be mapped to an injectable.
-
orca.orca.
is_step
(step_name)¶ Check whether a given name refers to a registered step.
-
orca.orca.
is_table
(name)¶ Returns whether a given name refers to a registered table.
-
class
orca.orca.
iter_step
(step_num, step_name)¶ -
step_name
¶ Alias for field number 1
-
step_num
¶ Alias for field number 0
-
-
orca.orca.
list_broadcasts
()¶ List of registered broadcasts as (cast table name, onto table name).
-
orca.orca.
list_columns
()¶ List of (table name, registered column name) pairs.
-
orca.orca.
list_columns_for_table
(table_name)¶ Return a list of all the extra columns registered for a given table.
- Parameters
- table_namestr
- Returns
- columnslist of str
-
orca.orca.
list_injectables
()¶ List of registered injectables.
-
orca.orca.
list_steps
()¶ List of registered step names.
-
orca.orca.
list_tables
()¶ List of table names.
-
orca.orca.
merge_tables
(target, tables, columns=None, drop_intersection=True)¶ Merge a number of tables onto a target table. Tables must have registered merge rules via the broadcast function.
- Parameters
- targetstr, DataFrameWrapper, or TableFuncWrapper
Name of the table (or wrapped table) onto which tables will be merged.
- tableslist of DataFrameWrapper, TableFuncWrapper, or str
All of the tables to merge. Should include the target table.
- columnslist of str, optional
If given, columns will be mapped to tables and only those columns will be requested from each table. The final merged table will have only these columns. By default all columns are used from every table.
- drop_intersectionbool
If True, keep the left most occurence of any column name if it occurs on more than one table. This prevents getting back the same column with suffixes applied by pd.merge. If false, columns names will be suffixed with the table names - e.g. zone_id_buildings and zone_id_parcels.
- Returns
- mergedpandas.DataFrame
-
orca.orca.
run
(steps, iter_vars=None, data_out=None, out_interval=1, out_base_tables=None, out_run_tables=None, compress=False, out_base_local=True, out_run_local=True)¶ Run steps in series, optionally repeatedly over some sequence. The current iteration variable is set as a global injectable called
iter_var
.- Parameters
- stepslist of str
List of steps to run identified by their name.
- iter_varsiterable, optional
The values of iter_vars will be made available as an injectable called
iter_var
when repeatedly running steps.- data_outstr, optional
An optional filename to which all tables injected into any step in steps will be saved every out_interval iterations. File will be a pandas HDF data store.
- out_intervalint, optional
Iteration interval on which to save data to data_out. For example, 2 will save out every 2 iterations, 5 every 5 iterations. Default is every iteration. The results of the first and last iterations are always included. The input (base) tables are also included and prefixed with base/, these represent the state of the system before any steps have been executed. The interval is defined relative to the first iteration. For example, a run begining in 2015 with an out_interval of 2, will write out results for 2015, 2017, etc.
- out_base_tables: list of str, optional, default None
List of base tables to write. If not provided, tables injected into ‘steps’ will be written.
- out_run_tables: list of str, optional, default None
List of run tables to write. If not provided, tables injected into ‘steps’ will be written.
- compress: boolean, optional, default False
Whether to compress output file using standard HDF5 zlib compression. Compression yields much smaller files using slightly more CPU.
- out_base_local: boolean, optional, default True
For tables in out_base_tables, whether to store only local columns (True) or both, local and computed columns (False).
- out_run_local: boolean, optional, default True
For tables in out_run_tables, whether to store only local columns (True) or both, local and computed columns (False).
-
orca.orca.
step
(step_name=None)¶ Decorates functions that will be called by the run function.
Decorator version of add_step. step name defaults to name of function.
The function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca. The argument name “iter_var” may be used to have the current iteration variable injected.
-
orca.orca.
table
(table_name=None, cache=False, cache_scope='forever', copy_col=True)¶ Decorates functions that return DataFrames.
Decorator version of add_table. Table name defaults to name of function.
The function’s argument names and keyword argument values will be matched to registered variables when the function needs to be evaluated by Orca. The argument name “iter_var” may be used to have the current iteration variable injected.
-
orca.orca.
table_type
(table_name)¶ Returns the type of a registered table.
The type can be either “dataframe” or “function”.
- Parameters
- table_namestr
- Returns
- table_type{‘dataframe’, ‘function’}
-
orca.orca.
temporary_tables
(**kwargs)¶ Temporarily set DataFrames as registered tables.
Tables will be returned to their original state when the context manager exits. Caching is not enabled for tables registered via this function.
-
orca.orca.
update_column_scope
(table_name, column_name, new_scope=None)¶ Update the cache scope for a wrapped column function. Clears the cache if the new scope is more granular than the existing. Added in Orca v1.6.
- Parameters
- table_name: str
Name of the table.
- column_name: str
Name of the column to update.
- new_scope: str, optional default None
Valid values: None, ‘forever’, ‘iteration’, ‘step’ None implies no caching.
-
orca.orca.
update_injectable_scope
(name, new_scope=None)¶ Update the cache scope for a wrapped injectable function. Clears the cache if the new scope is more granular than the existing. Added in Orca v1.6.
- Parameters
- name: str
Name of the injectable to update.
- new_scope: str, optional default None
Valid values: None, ‘forever’, ‘iteration’, ‘step’ None implies no caching.
-
orca.orca.
update_table_scope
(name, new_scope=None)¶ Update the cache scope for a wrapped table function. Clears the cache if the new scope is more granular than the existing. Added in Orca v1.6.
- Parameters
- name: str
Name of the table to update.
- new_scope: str, optional default None
Valid values: None, ‘forever’, ‘iteration’, ‘step’ None implies no caching.
-
orca.orca.
write_tables
(fname, table_names=None, prefix=None, compress=False, local=False)¶ Writes tables to a pandas.HDFStore file.
- Parameters
- fnamestr
File name for HDFStore. Will be opened in append mode and closed at the end of this function.
- table_names: list of str, optional, default None
List of tables to write. If None, all registered tables will be written.
- prefix: str
If not None, used to prefix the output table names so that multiple iterations can go in the same file.
- compress: boolean
Whether to compress output file using standard HDF5-readable zlib compression, default False.