What is a Node?
Nodes are the building block of every ETL. They will contain the code to either generate or process Items. Items are the units of data that are passed between connected Nodes. Since cupyd is a Python framework, an Item can be any pickleable Python object.
Categories
Nodes are categorized like this:
ExtractorTransformerFilterLoaderBulkerDeBulker
When implementing your own Nodes, you will inherit from one of those (except Bulker & DeBulker).
Lifecycle
All Nodes have a lifecycle, which is specified by these functions:
-
start()- The Node set ups its variables and client connections (aka state).
-
The Node runs continuously. Each Node type will have its own function, to be implemented:
- Extractor:
extract() - Transformer:
transform(item) - Filter:
filter(item) - Loader:
load(item)
- Extractor:
-
finalize()- The Node frees up its resources and closes its connections. In case an exception occurs in
the Node, the function
handle_exception(exception: NodeException)is called, which, by default, will callfinalize().
- The Node frees up its resources and closes its connections. In case an exception occurs in
the Node, the function
Every Node could be interrupted at any of those lifecycle points if an exception is raised or the ETL is stopped.
Write your own Nodes
Some guidelines about how to write you own Nodes:
- Define all variables at
__init__ - You can set custom parameters at
__init__ - Variables should be initialized at
start() - All clients/connections should be created at
start() - All clients/connections should be closed at
finalize()
Example
todo
Restrictions
Since the usage of cupyd resides on subclassing the archetype Nodes when creating your own nodes you need to respect the following restrictions regarding attribute/functions naming & overriding.
Attributes you CANNOT override:
_id: autogenerated, internal ID for the Node._name: name of the Node. Will be autogenerated if none was provided. Can be set with thename()property._input: Node that will send Items to this Node._outputs: Nodes that will receive outputted Items from this Node._configuration: stores the configuration parameterscupyddefines for this Node based on its category. Can be updated with theconfiguration()property
Functions you CANNOT override:
__str__: used to represent a node as str with its name, if provided, or use the CamelCase Node klass name the name.__repr__: calls above function. Useful when debugging.__rshift__: used to make connections between nodes aka setting the_inputand_outputs.