Skip to content

What is a Node?

Nodes are the building block of every ETL. They will contain the code to either generate or process Items. Items are the units of data that are passed between connected Nodes. Since cupyd is a Python framework, an Item can be any pickleable Python object.

Categories

Nodes are categorized like this:

  • Extractor
  • Transformer
  • Filter
  • Loader
  • Bulker
  • DeBulker

When implementing your own Nodes, you will inherit from one of those (except Bulker & DeBulker).

Lifecycle

All Nodes have a lifecycle, which is specified by these functions:

  1. start()

    • The Node set ups its variables and client connections (aka state).
  2. The Node runs continuously. Each Node type will have its own function, to be implemented:

    • Extractor: extract()
    • Transformer: transform(item)
    • Filter: filter(item)
    • Loader: load(item)
  3. finalize()

    • The Node frees up its resources and closes its connections. In case an exception occurs in the Node, the function handle_exception(exception: NodeException) is called, which, by default, will call finalize().

Every Node could be interrupted at any of those lifecycle points if an exception is raised or the ETL is stopped.

Write your own Nodes

Some guidelines about how to write you own Nodes:

  • Define all variables at __init__
  • You can set custom parameters at __init__
  • Variables should be initialized at start()
  • All clients/connections should be created at start()
  • All clients/connections should be closed at finalize()

Example

todo

Restrictions

Since the usage of cupyd resides on subclassing the archetype Nodes when creating your own nodes you need to respect the following restrictions regarding attribute/functions naming & overriding.

Attributes you CANNOT override:

  • _id: autogenerated, internal ID for the Node.
  • _name: name of the Node. Will be autogenerated if none was provided. Can be set with the name() property.
  • _input: Node that will send Items to this Node.
  • _outputs: Nodes that will receive outputted Items from this Node.
  • _configuration: stores the configuration parameters cupyd defines for this Node based on its category. Can be updated with the configuration() property

Functions you CANNOT override:

  • __str__: used to represent a node as str with its name, if provided, or use the CamelCase Node klass name the name.
  • __repr__: calls above function. Useful when debugging.
  • __rshift__: used to make connections between nodes aka setting the _input and _outputs.