Orca Core ========= .. note:: In the documentation below the following imports are implied:: import orca import pandas as pd Tables ------ Tables are Pandas DataFrames_. Use the :py:func:`~orca.orca.add_table` function to register a DataFrame under a given name:: df = pd.DataFrame({'a': [1, 2, 3]}) orca.add_table('my_table', df) Or you can use the decorator :py:func:`~orca.orca.table` to register a function that returns a DataFrame:: @orca.table('halve_my_table') def halve_my_table(my_table): df = my_table.to_frame() return df / 2 The decorator argument, which specifies the name to register the table with, is optional. If left out, the table is registered under the name of the function that is being decorated. The decorator example above could be written more concisely:: @orca.table() def halve_my_table(my_table): df = my_table.to_frame() return df / 2 Note that the decorator parentheses are still required. By registering ``halve_my_table`` as a function, its values will always be half those in ``my_table``, even if ``my_table`` is later changed. If you'd like a function to *not* be evaluated every time it is used, pass the ``cache=True`` keyword when registering it. Here's a demo of the above table definitions shown in IPython: .. code-block:: python In [19]: wrapped = orca.get_table('halve_my_table') In [20]: wrapped.to_frame() Out[20]: a 0 0.5 1 1.0 2 1.5 Table Wrappers ~~~~~~~~~~~~~~ Notice in the table function above that we had to call a :py:meth:`~orca.orca.DataFrameWrapper.to_frame` method before using the table in a math operation. The values injected into functions are not DataFrames, but specialized wrappers. The wrappers facilitate caching, `computed columns <#columns>`__, and lazy evaluation of table functions. Learn more in the API documentation: * :py:class:`~orca.orca.DataFrameWrapper` * :py:class:`~orca.orca.TableFuncWrapper` Automated Merges ~~~~~~~~~~~~~~~~ Certain analyses can be easiest when some tables are merged together, but in other places it may be best to keep the tables separate. Orca can make these on-demand merges easy by letting you define table relationships up front and then performing the merges for you as needed. We call these relationships "broadcasts" (as in a rule for how to broadcast one table onto another) and you register them using the :py:func:`~orca.orca.broadcast` function. For an example we'll first define some DataFrames that contain links to one another and register them with Orca:: df_a = pd.DataFrame( {'a': [0, 1]}, index=['a0', 'a1']) df_b = pd.DataFrame( {'b': [2, 3, 4, 5, 6], 'a_id': ['a0', 'a1', 'a1', 'a0', 'a1']}, index=['b0', 'b1', 'b2', 'b3', 'b4']) df_c = pd.DataFrame( {'c': [7, 8, 9]}, index=['c0', 'c1', 'c2']) df_d = pd.DataFrame( {'d': [10, 11, 12, 13, 15, 16, 16, 17, 18, 19], 'b_id': ['b2', 'b0', 'b3', 'b3', 'b1', 'b4', 'b1', 'b4', 'b3', 'b3'], 'c_id': ['c0', 'c1', 'c1', 'c0', 'c0', 'c2', 'c1', 'c2', 'c1', 'c2']}, index=['d0', 'd1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9']) orca.add_table('a', df_a) orca.add_table('b', df_b) orca.add_table('c', df_c) orca.add_table('d', df_d) The tables have data so that 'a' can be broadcast onto 'b', and 'b' and 'c' can be broadcast onto 'd'. We use the :py:func:`~orca.orca.broadcast` function to register those relationships:: orca.broadcast(cast='a', onto='b', cast_index=True, onto_on='a_id') orca.broadcast(cast='b', onto='d', cast_index=True, onto_on='b_id') orca.broadcast(cast='c', onto='d', cast_index=True, onto_on='c_id') The syntax is similar to that of the `pandas merge function `__, and indeed ``merge`` is used behind the scenes. Once the broadcasts are defined, use the :py:func:`~orca.orca.merge_tables` function to get a merged DataFrame. Some examples in IPython: .. code-block:: python In [4]: orca.merge_tables(target='b', tables=[a, b]) Out[4]: a_id b a b0 a0 2 0 b3 a0 5 0 b1 a1 3 1 b2 a1 4 1 b4 a1 6 1 In [5]: orca.merge_tables(target='d', tables=[a, b, c, d]) Out[5]: b_id c_id d c a_id b a d0 b2 c0 10 7 a1 4 1 d3 b3 c0 13 7 a0 5 0 d2 b3 c1 12 8 a0 5 0 d8 b3 c1 18 8 a0 5 0 d9 b3 c2 19 9 a0 5 0 d4 b1 c0 15 7 a1 3 1 d6 b1 c1 16 8 a1 3 1 d1 b0 c1 11 8 a0 2 0 d5 b4 c2 16 9 a1 6 1 d7 b4 c2 17 9 a1 6 1 Note that it's the target table's index that you find in the final merged table, though the order may have changed. :py:func:`~orca.orca.merge_tables` has an optional ``columns=`` keyword that can contain column names from any the tables going into the merge so you can limit which columns end up in the final table. (Columns necessary for performing merges will be included whether or not they are in the ``columns=`` list.) .. note:: :py:func:`~orca.orca.merge_tables` calls `pandas.merge `__ with ``how='inner'``, meaning that only items that appear in both tables are kept in the merged table. Columns ------- Often, not all the columns you need are preexisting on your tables. You may need to collect information from other tables or perform a calculation to generate a column. Orca allows you to register a Series_ or function as a column on a registered table. Use the :py:func:`~orca.orca.add_column` function or the :py:func:`~orca.orca.column` decorator:: s = pd.Series(['a', 'b', 'c']) orca.add_column('my_table', 'my_col', s) @orca.column('my_table') def my_col_x2(my_table): df = my_table.to_frame(columns=['my_col']) return df['my_col'] * 2 In the ``my_col_x2`` function we use the ``columns=`` keyword on :py:meth:`~orca.orca.DataFrameWrapper.to_frame` to get only the one column necessary for our calculation. This can be useful for avoiding unnecessary computation or to avoid recursion (as would happen in this case if we called ``to_frame()`` with no arguments). Accessing columns on a table is such a common occurrence that there are additional ways to do so without first calling ``to_frame()`` to create an actual ``DataFrame``. :py:class:`~orca.orca.DataFrameWrapper` supports accessing individual columns in the same ways as ``DataFrames``:: @orca.column('my_table') def my_col_x2(my_table): return my_table['my_col'] * 2 # or my_table.my_col * 2 Or you can use an expression to have a single column injected into a function:: @orca.column('my_table') def my_col_x2(data='my_table.my_col'): return data * 2 In this case, the label ``data``, expressed as ``my_table.my_col``, refers to the column ``my_col``, which is a pandas Series_ within the table ``my_table``. A demonstration in IPython using the column definitions from above: .. code-block:: python In [29]: wrapped = orca.get_table('my_table') In [30]: wrapped.columns Out[30]: ['a', 'my_col', 'my_col_x2'] In [31]: wrapped.local_columns Out[31]: ['a'] In [32]: wrapped.to_frame() Out[32]: a my_col_x2 my_col 0 1 aa a 1 2 bb b 2 3 cc c :py:class:`~orca.orca.DataFrameWrapper` has :py:attr:`~orca.orca.DataFrameWrapper.columns` and :py:attr:`~orca.orca.DataFrameWrapper.local_columns` attributes that, respectively, list all the columns on a table and only those columns that are part of the underlying DataFrame. Columns are stored separate from tables so it is safe to define a column on a table and then replace that table with something else. The column will remain associated with the table. Injectables ----------- You will probably want to have things besides tables injected into functions, for which Orca has "injectables". You can register *anything* and have it injected into functions. Use the :py:func:`~orca.orca.add_injectable` function or the :py:func:`~orca.orca.injectable` decorator:: orca.add_injectable('z', 5) @orca.injectable(autocall=False) def pow(x, y): return x ** y @orca.injectable() def zsquared(z, pow): return pow(z, 2) @orca.table() def ztable(my_table, zsquared): df = my_table.to_frame(columns=['a']) return df * zsquared By default injectable functions are evaluated before injection and the return value is passed into other functions. Use ``autocall=False`` to disable this behavior and instead inject the function itself. Like tables and columns, injectable functions that are automatically evaluated can have their results cached with ``cache=True``. Functions that are not automatically evaluated can also have their results cached using the ``memoize=True`` keyword along with ``autocall=False``. A memoized injectable will cache results based on the function inputs, so this only works if the function inputs are hashable (usable as dictionary keys). Memoized functions can have their caches cleared manually using their ``clear_cached`` function attribute. The caches of memoized functions are also hooked into the global Orca caching system, so you can also manage their caches via the ``cache_scope`` keyword argument and the :py:func:`~orca.orca.clear_cache` function. An example of the above injectables in IPython: .. code-block:: python In [38]: wrapped = orca.get_table('ztable') In [39]: wrapped.to_frame() Out[39]: a 0 25 1 50 2 75 Caching ------- Orca has cache system so that function results can be stored for re-use when it is not necessary to recompute them every time they are used. The decorators :py:func:`~orca.orca.table`, :py:func:`~orca.orca.column`, and :py:func:`~orca.orca.injectable` all take two keyword arguments related to caching: ``cache`` and ``cache_scope``. By default results are not cached. Register functions with ``cache=True`` to enable caching of their results. Cache Scope ~~~~~~~~~~~ Cached items have an associated "scope" that allows Orca to automatically manage how long functions have their results cached before re-evaluating them. The three scope settings are: * ``'forever'`` (the default setting) - Results are cached until manually cleared by user commands. * ``'iteration'`` - Results are cached for the remainder of the current pipeline iteration. * ``'step'`` - Results are cached until the current pipeline step finishes. An item's cache scope can be modified using :py:func:`~orca.orca.update_injectable_scope`, :py:func:`~orca.orca.update_table_scope`, or :py:func:`~orca.orca.update_column_scope`. Omitting the scope or passing ``None`` turns caching off for the item. These functions were added in Orca v1.6. Disabling Caching ~~~~~~~~~~~~~~~~~ There may be situations, especially during testing, that require disabling the caching system. Caching can be turned off globally using the :py:func:`~orca.orca.disable_cache` function (and turned back on by :py:func:`~orca.orca.enable_cache`). To run a block of commands with the cache disabled, but have it automatically re-enabled, use the :py:func:`~orca.orca.cache_disabled` context manager:: with orca.cache_disabled(): result = orca.eval_variable('my_table') Manually Clearing Cache ~~~~~~~~~~~~~~~~~~~~~~~ Orca's entire cache can be cleared using :py:func:`~orca.orca.clear_cache`. Cache can also be cleared manually for individual items, to allow finer control over re-computation. These functions were added in Orca v1.6. To clear the cached value of an injectable, use :py:func:`~orca.orca.clear_injectable`. To clear the cached copy of an entire table, use :py:func:`~orca.orca.clear_table`. A dynamically generated column can be cleared using :py:func:`~orca.orca.clear_column`:: orca.clear_column('my_table', 'my_col') To clear all dynamically generated columns from a table, use :py:func:`~orca.orca.clear_columns`:: orca.clear_columns('my_table') Or clear a subset of the columns like this:: orca.clear_columns('my_table', ['col1', 'col2']) Steps ----- A step is a function run by Orca with argument matching. Use the :py:func:`~orca.orca.step` decorator to register a step function. Steps are generally important for their side-effects, their return values are discarded during pipeline runs. For example, a step might replace a column in a table (a new table, though similar to ``my_table`` above):: df = pd.DataFrame({'a': [1, 2, 3]}) orca.add_table('new_table', df) @orca.step() def replace_col(new_table): new_table['a'] = [4, 5, 6] Or update some values in a column:: @orca.step() def update_col(new_table): s = pd.Series([99], index=[1]) new_table.update_col_from_series('a', s) Or add rows to a table:: @orca.step() def add_rows(new_table): new_rows = pd.DataFrame({'a': [100, 101]}, index=[3, 4]) df = new_table.to_frame() df = pd.concat([df, new_rows]) orca.add_table('new_table', df) The first two of the above examples update ``my_tables``'s underlying DataFrame and so require it to be a :py:class:`~orca.orca.DataFrameWrapper`. If your table is a wrapped function, not a DataFrame, you can update columns by replacing them entirely with a new Series_ using the :py:func:`~orca.orca.add_column` function. A demonstration of running the above steps: .. code-block:: python In [68]: orca.run(['replace_col', 'update_col', 'add_rows']) Running step 'replace_col' Running step 'update_col' Running step 'add_rows' In [69]: orca.get_table('new_table').to_frame() Out[69]: a 0 4 1 99 2 6 3 100 4 101 In the context of a simulation steps can be thought of as model steps that will often advance the simulation by updating data. Steps are plain Python functions, though, and there is no restriction on what they are allowed to do. Running Pipelines ----------------- You start pipelines by calling the :py:func:`~orca.orca.run` function and listing which steps you want to run. Calling :py:func:`~orca.orca.run` with just a list of steps, as in the above example, will run through the steps once. To run the pipeline over some a sequence, provide those values as a sequence to :py:func:`~orca.orca.run` using the ``iter_vars`` argument. The ``iter_var`` injectable stores the current value from the ``iter_vars`` argument to :py:func:`~orca.orca.run` function. The ``iter_step`` injectable is a ``namedtuple`` with fields named ``step_num`` and ``step_name``, stored in that order. ``step_num`` is a zero-based index based on the list of step names passed to the :py:func:`~orca.orca.run` function. .. code-block:: python In [77]: @orca.step() ....: def print_year(iter_var,iter_step): ....: print '*** the iteration value is {} ***'.format(iter_var) ....: print '*** step number {0} is named {1} ***'.format(iter_step.step_num, iter_step.step_name) ....: In [78]: orca.run(['print_year'], iter_vars=range(2010, 2015)) Running iteration 1 with iteration value 2010 Running step 'print_year' *** the iteration value is 2010 *** *** step number 0 is named print_year *** Time to execute step 'print_year': 0.00 s Total time to execute iteration 1 with iteration value 2010: 0.00 s Running iteration 2 with iteration value 2011 Running step 'print_year' *** the iteration value is 2011 *** *** step number 0 is named print_year *** Time to execute step 'print_year': 0.00 s Total time to execute iteration 2 with iteration value 2011: 0.00 s Running iteration 3 with iteration value 2012 Running step 'print_year' *** the iteration value is 2012 *** *** step number 0 is named print_year *** Time to execute step 'print_year': 0.00 s Total time to execute iteration 3 with iteration value 2012: 0.00 s Running iteration 4 with iteration value 2013 Running step 'print_year' *** the iteration value is 2013 *** *** step number 0 is named print_year *** Time to execute step 'print_year': 0.00 s Total time to execute iteration 4 with iteration value 2013: 0.00 s Running iteration 5 with iteration value 2014 Running step 'print_year' *** the iteration value is 2014 *** *** step number 0 is named print_year *** Time to execute step 'print_year': 0.00 s Total time to execute iteration 5 with iteration value 2014: 0.00 s Running Orca Components a la Carte ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It can be useful to have Orca evaluate single variables and steps, especially during development and testing. To achieve this, use the :py:func:`~orca.orca.eval_variable` and :py:func:`~orca.orca.eval_step` functions. ``eval_variable`` takes the name of a variable (including variable expressions) and returns that variable as it would be injected into a function Orca. ``eval_step`` takes the name of a step, runs that step with variable injection, and returns any result. .. note:: Most steps don't have return values because Orca ignores them, but they can be useful for testing. Both :py:func:`~orca.orca.eval_variable` and :py:func:`~orca.orca.eval_step` take arbitrary keyword arguments that are temporarily turned into injectables within Orca while the evaluation is taking place. When the evaluation is complete Orca's state is reset to whatever it was before calling the ``eval`` function. An example of :py:func:`~orca.orca.eval_variable`: .. code-block:: python In [15]: @orca.injectable() ....: def func(x, y): ....: return x + y ....: In [16]: orca.eval_variable('func', x=1, y=2) Out[16]: 3 The keyword arguments are only temporarily set as injectables, which can lead to errors in a situation like this with a table where the evaluation of the table is delayed until :py:meth:`~orca.orca.DataFrameWrapper.to_frame` is called: .. code-block:: python In [12]: @orca.table() ....: def table(x, y): ....: return pd.DataFrame({'a': [x], 'b': [y]}) ....: In [13]: orca.eval_variable('table', x=1, y=2) Out[13]: In [14]: orca.eval_variable('table', x=1, y=2).to_frame() --------------------------------------------------------------------------- KeyError Traceback (most recent call last) in () ----> 1 orca.eval_variable('table', x=1, y=2).to_frame() KeyError: 'y' In order to get the injectables to be set for a controlled term you can use the :py:func:`~orca.orca.injectables` context manager to set the injectables: .. code-block:: python In [12]: @orca.table() ....: def table(x, y): ....: return pd.DataFrame({'a': [x], 'b': [y]}) ....: In [20]: with orca.injectables(x=1, y=2): ....: df = orca.eval_variable('table').to_frame() ....: In [21]: df Out[21]: a b 0 1 2 Archiving Data ~~~~~~~~~~~~~~ An option to the :py:func:`~orca.orca.run` function is to have it save table data at set intervals. Tables (and only tables) are saved as DataFrames_ to an HDF5 file via pandas' `HDFStore `__ feature. If Orca is running only one loop the tables are stored under their registered names. If it is running multiple iterations the tables are stored under names like ``'/'``. For example, if ``iter_var`` is ``2020`` the "buildings" table would be stored as ``'2020/buildings'``. The ``out_interval`` keyword to :py:func:`~orca.orca.run` controls how often the tables are saved out. For example, ``out_interval=5`` saves tables every fifth iteration. In addition, the final data is always saved under the key ``'final/
'``. Argument Matching ----------------- A key feature of Orca is that it matches the names of function arguments to the names of registered variables in order to inject variables when evaluating functions. For that reason, it's important that variables be registered with names that are also `valid Python variables `__. Variable Expressions ~~~~~~~~~~~~~~~~~~~~ Argument matching is extended by a feature we call "variable expressions". Expressions allow you to specify a variable to inject with Python keyword arguments. Here's an example redone from above using variable expressions:: @orca.table() def halve_my_table(data='my_table'): df = data.to_frame() return df / 2 The variable registered as ``'my_table'`` is injected into this function as the argument ``data``. Expressions can also be used to refer to columns within a registered table:: @orca.column('my_table') def halved(data='my_table.a'): return data / 2 In this case, the expression ``my_table.a`` refers to the column ``a``, which is a pandas Series_ within the table ``my_table``. We return a new Series to register a new column on ``my_table`` using the :py:func:`~orca.orca.column` decorator. We can take a look in IPython: .. code-block:: python In [21]: orca.get_table('my_table').to_frame() Out[21]: a halved 0 1 0.5 1 2 1.0 2 3 1.5 Expressions referring to columns may be useful in situations where a function requires only a single column from a table and the user would like to specifically document that in the function's arguments. API --- .. currentmodule:: orca.orca Table API ~~~~~~~~~ .. autosummary:: add_table table get_table list_tables DataFrameWrapper TableFuncWrapper Column API ~~~~~~~~~~ .. autosummary:: add_column column list_columns Injectable API ~~~~~~~~~~~~~~ .. autosummary:: add_injectable injectable get_injectable list_injectables Merge API ~~~~~~~~~ .. autosummary:: broadcast list_broadcasts merge_tables Step API ~~~~~~~~ .. autosummary:: add_step step get_step list_steps run Cache API ~~~~~~~~~ .. autosummary:: clear_cache disable_cache enable_cache cache_on clear_injectable clear_table clear_column clear_columns update_injectable_scope update_table_scope update_column_scope API Docs ~~~~~~~~ .. automodule:: orca.orca :members: .. _DataFrame: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe .. _DataFrames: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe .. _Series: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#series