Quick start

Let's get started. This page gives you a brief overview of the basic concepts and usage.

The idea of ETLBox

ETLBox is a lightweight ETL (extract, transform, load) library and data integration toolbox for .NET. You can use it to read almost any data from different sources, connect these sources with highly customizable transformations and finally export your data into the most common databases, file types or other destinations. While data is processed in memory, you can choose between different transformation components to harmonize, filter, aggregate, validate and clean your data.

ETLBox is a fully functional alternative to other ETL tools like SqlServer Integrations Services (SSIS) or Azure Data Factory. The main difference to other toolsets is that it doesn’t come with a User Interface. But no worries! Working with ETLBox will convince you that creating data flows programmatically is the number one choice for most of your ETL related scenarios.

Getting ETLBox

ETLBox is written in C# and targets .NET Standard 2.0 and higher. It can be used by almost every .NET version   out there (as long as they is still supported by Microsoft). If you are still using .NET Framework, we recommend to use a Framework version > 4.7.2.

You will find all ETLBox packages hosted on nuget   . There is one package which you will always need when working with ETLBox: the core ETLBox   package which contains default connectors and transformations.

Add it to your project using your package manager.

dotnet add package ETLBox

The connectors are in separate packages - depending on your needs, choose the right connector package from the list (e.g. pick the ETLBox.SqlServer   and ETLBox.Csv   packages if you want to copy data from Csv file into a SqlServer table).

Data Flow and Control Flow

ETLBox is split into two main components: Data Flow components and Control Flow Tasks. The Data Flow part offers the core ETL components. The tasks in the Control Flow allow you to manage your databases with a simple syntax. Both components come with customizable logging functionalities based on NLog   .

Creating your first data flow

Let’s dive directly into creating our first ETL code.

ETLBox comes with a big set of Data Flow components to construct your own ETL pipeline. To create a data flow pipeline, you basically need to follow three steps:

  1. Define your dataflow components
  2. Link the components
  3. Start reading the data from the source

Now we just need to wait at the destination for all data to arrive, while the source will start reading and post its data into the components connected to its output. As soon as a connected component retrieves any data in its input, the component will start with processing the data and then send it further down the line to its own connected components. The dataflow will finish when all data from the sources are read, processed by the transformations and arrived in the destinations.

Transformations are not always needed - you can directly connect a source to a destination. Normally, each source has one output, each destination one input and each transformation at least one input and one or more outputs.

All data is processed asynchronously by the components. Each component has its own set of buffers, so while the source is still reading data, the transformations can already process it and the destinations can start writing the processed information into their target. So in an optimal flow only the currently processed row is stored in memory. Depending on the processing speed of your components, the buffer of each component can store additional rows to optimize throughput. E.g., database destinations will always wait until their buffer contains enough data to perform a bulk insert which is much faster than insert each row one by one.

Setup

This example data flow will read data from a MySql database, modify a value and then store the modified data in a Sql Server table and a csv file, depending on a filter expression.

This setup requires the following: A source table in a MySql database, a destination table in a SqlServer database and some space for the Csv file.

Let’s start with creating the tables on the corresponding databases. If you have docker installed, it is quite easy to run a MySql database in a container   - which is also true for SqlServer   .

--Run on mySql:
CREATE TABLE  `SourceTable` (
    Id INT NOT NULL,
    Value INT NULL
)

INSERT INTO SourceTable (Id,Value) VALUES (1, 10)
INSERT INTO SourceTable (Id,Value) VALUES (1, 20)

--Run on SqlServer
CREATE TABLE  DestinationTable (
    Id INT NOT NULL,
    Result INT NULL
)

Now let’s create a strongly typed object, also know as POCO (Plain old component object) that we can later use in our data flow.

public class MyRow {

    public int Id { get;set; }
    
    [ColumnMap("Result")]
    public int Value { get;set; }
}

You may have noticed that the Destination table doesn’t have a column Value, but instead the column Result. By default ETLBox will try to match the column names with your property names. As we have a property called Id, this will match with the column Id in the destination table. The ColumnMap attribute above the Property Value will take care of doing the right mapping.

Step 1 - creating the components

Step 1 is to create a source, the transformations and destinations. This is the code snipped for this task:

string mySqlConnString = 
  @"Server=localhost;Database=ETLBox;Uid=user;Pwd=pw;";
string sqlServerConnString = 
  @"Data Source=.;Trusted_Connection=true;Initial Catalog=ETLBox;";

var mySqlCon = new MySqlConnectionManager(mySqlConnString);
var sqlConn = new SqlConnectionManager(sqlServerConnString);

var mySqlSource = new DbSource<MyRow>(mySqlCon, "SourceTable");
var rowTrans = new RowTransformation<MyRow>(
    row => {  
        row.Value += 5;
        return row;
    });
var multicast = new Multicast<MyRow>();
var sqlDest = new DbDestination<MyRow>(sqlConn, "DestinationTable");
var csvDest = new CsvDestination<MyRow>("example.csv");

The first lines create the connection managers needed to establish the connection the proper database. The Connection Manager for MySql and SqlServer are then used to construct our destination components: A MySql database source and a SqlServer database destination. Also, we construct a RowTransformation that changes the value of each rows read from the source. The multicast will be used to brodacast our data into two destination: The SqlServer destination and additionally a csv destination.

Step 2 - linking

Now we are all setup with the setup. Let’s continue with step 2 and link the components together.

mySqlSource.LinkTo(rowTrans);
rowTrans.LinkTo(multicast);
multicast.LinkTo(sqlDest, row => row.FilterValue > 0);
multicast.LinkTo(csvDest, row => row.FilterValue < 0);

We use the LinkTo - method to create our data flow network. The MySql Source will send data to the RowTransformation. The transformed value is send to the Multicast which sends a copy of each row into the Sql Server destination and the Csv destination.

Step 3 - execution

Finally, step 3 is to start the dataflow at the source (and wait for the destinations to receive all data). To simplify your programmer’s life, this can be either with a synchronous call that blocks execution until the flow has finished.

Network.Execute(source);

Now let’s see our producued output. If we run SELECT * FROM DestinationTable on our SqlServer database, we should get the following resutl:

IdResult
115
225

And our csv file example.csvcontains this data:

Id,Result
1,15
2,25

Using the Control Flow tasks

Control Flow Tasks let you directly communicate with your database: they allow you to create or delete databases, tables, procedures, schemas or other objects in your database. There are also tasks to truncate a tables, count the rows or execute just some Sql code. The advantage of the tasks is that they work on every supported database (e.g. creating a table could need different Sql on MySql and SqlServer), but with only one line of easy-to-read C# code. This improves the readability of your code a lot, encapsulates your logic and gives you more time to focus on your real work. It also elinates the need to write or wrap the “boilerplate” code which is needed for ADO.NET.

Here is some example code that shows how run database commands:

//Create a connection manager 
var conn = new SqlConnectionManager
  ("Server=.;Trusted_Connection=true;Initial Catalog=ETLBox");
//Execute some Sql
SqlTask.ExecuteNonQuery(conn, "Do some sql",$@"EXEC myProc");
//Count rows
int count = RowCountTask.Count(conn, "demo.table1").Value;
//Create a table (works on all supported databases)
var col1 = new TableColumn(name:"key", dataType:"INT", 
        allowNulls:false, isPrimaryKey:true, isIdentity:true);
var col2 = new TableColumn(name:"value",dataType:"NVARCHAR(100)",
        allowNulls:true);
CreateTableTask.Create(conn, "Table1", 
    new List<TableColumn>() { col1, col2 });

Logging

By default, ETLBox uses and extends NLog   . ETLBox already comes with NLog as dependency - so you don’t need to include additional packages from nuget. In order to have the logging activating, you just have to set up a nlog configuration called nlog.config, and create a target and a logger rule. After adding this, you will already get logging output for all tasks and components in ETLBox.

If you add a file nlog.config with the below content to your project (don’t forget to enable “Copy to output”), then all ETLBox components and task will send log output to the console window.

<?xml version="1.0" encoding="utf-8"?>
<nlog xmlns="http://www.nlog-project.org/schemas/NLog.xsd"
      xsi:schemaLocation="NLog NLog.xsd"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <rules>
    <logger name="*" minlevel="Info" writeTo="console" />
  </rules>
  <targets>
    <target name="console" xsi:type="Console" />     
  </targets>
</nlog>

Where to continue

Now you are ready to try out ETLBox. All ETLBox packages are available on nuget   . The free versions allows you to process up to 10.000 for a component - this should be enough to perform tests in your test environment. If you are ready to deploy ETLBox in your production environment, please purchase the adequate license. If you want to do some performamce test, you can request a trial key.

To learn more about ETLBox, just continue reading.