Parquet

The ETLBox Parquet Connector makes it easy to read and write Parquet files in your ETL pipelines. It works with both strongly typed objects and dynamic data, so you can flexibly integrate columnar storage into your workflows. Whether you're handling data from local files, web services, or cloud storage, this connector helps you process large datasets efficiently.

Overview

The ETLBox.Parquet package provides the ParquetSource and ParquetDestination components for reading and writing Parquet files in ETL workflows. Parquet is a columnar storage format optimized for efficient data compression and retrieval, making it well-suited for handling large datasets.

ETLBox integrates with the Parquet.NET library   to process Parquet files in a row-based manner, allowing easy interaction with ETL components.

Shared Features

Common functionalities such as resource types (files, HTTP, Azure Blob), streaming, and row modifications are shared across all streaming connectors. See Shared Features for details.

ParquetSource

ParquetSource reads data from Parquet files and converts it into structured rows. Internally, the columnar format is translated into rows to integrate with ETLBox’s row-based processing.

Reading Parquet as POCOs

You can deserialize Parquet files into strongly typed objects (POCOs).

Example: Reading a Parquet File

If Demo.parquet contains two columns: Col1 (integer) and Col2 (string), the following code reads the file into an in-memory list:

public class MyRow
{
    public int Col1 { get; set; }
    public string Col2 { get; set; }
}

ParquetSource<MyRow> source = new ParquetSource<MyRow>() {
    Uri = "demo.parquet"
};
MemoryDestination<MyRow> dest = new MemoryDestination<MyRow>();

source.LinkTo(dest);
Network.Execute(source);
// All rows are available in dest.Data

Mapping Columns with Attributes

If column names in the Parquet file differ from your POCO property names, use the ParquetColumn attribute:

public class MyRow
{
    [ParquetColumn(ColumnName = "Col1")]
    public int Id { get; set; }

    [ParquetColumn(ColumnName = "Col2")]
    public string Value { get; set; }
}

This maps Col1Id and Col2Value, ensuring correct column-to-property mapping.

Using Dynamic Objects

If you do not want to define a fixed object structure, ParquetSource supports ExpandoObject for dynamic processing.

ParquetSource source = new ParquetSource() {
    Uri = "demo.parquet"
};
MemoryDestination dest = new MemoryDestination();

source.LinkTo(dest);
Network.Execute(source);
// Data is stored as a list of ExpandoObjects

Custom Column Mapping for Dynamic Objects

You can manually specify column mappings when using dynamic objects:

ParquetSource source = new ParquetSource("demo.parquet");
source.ParquetColumns = new[]
{
    new ParquetColumn { PropertyName = "Id", ColumnName = "Col1" },
    new ParquetColumn { PropertyName = "Value", ColumnName = "Col2" }
};

MemoryDestination dest = new MemoryDestination();
source.LinkTo(dest);
Network.Execute(source);

This ensures that the Id and Value properties correctly correspond to the columns in the Parquet file.

ParquetDestination

ParquetDestination writes structured data into Parquet files, converting rows into a columnar format. By default, row groups are created for every 1000 records, but this can be adjusted using the BatchSize property.

Writing Data to a Parquet File

public class MyRow
{
    public int Col1 { get; set; }
    public string Col2 { get; set; }
}

var source = new MemorySource<MyRow>();
source.DataAsList.Add(new MyRow { Col1 = 1, Col2 = "Test1" });
source.DataAsList.Add(new MyRow { Col1 = 2, Col2 = null });
source.DataAsList.Add(new MyRow { Col1 = 3, Col2 = "Test3" });

var dest = new ParquetDestination<MyRow>("output.parquet");
source.LinkTo(dest);
Network.Execute(source);

Mapping Columns for Writing

You can change column names and control the order of columns in the Parquet file using the ParquetColumn attribute:

public class MyOrderedRow
{
    [ParquetColumn(ColumnName = "Col1", WriteOrder = 2)]
    public int Id { get; set; }

    public string Clutter { get; set; }

    [ParquetColumn(ColumnName = "Col2", WriteOrder = 1)]
    public string Value { get; set; }
}

var source = new MemorySource<MyOrderedRow>();
source.DataAsList.Add(new MyOrderedRow { Id = 1, Value = "Test1" });
source.DataAsList.Add(new MyOrderedRow { Id = 2, Value = null });
source.DataAsList.Add(new MyOrderedRow { Id = 3, Value = "Test3" });

var dest = new ParquetDestination<MyOrderedRow>("OrderedOutput.parquet");
source.LinkTo(dest);
Network.Execute(source);

This stores Col2 before Col1 in the Parquet file.

Writing Dynamic Objects

Instead of using a fixed object structure, you can store ExpandoObjects in Parquet:

var source = new MemorySource();
dynamic r1 = new ExpandoObject();
r1.Col1 = 1;
r1.Col2 = "Test1";
source.DataAsList.Add(r1);

dynamic r2 = new ExpandoObject();
r2.Col1 = 2;
r2.Col2 = null;
source.DataAsList.Add(r2);

var dest = new ParquetDestination("dynamicOutput.parquet");
source.LinkTo(dest);
Network.Execute(source);

Defining Column Attributes Manually

For dynamic objects, you can manually set column attributes:

var source = new MemorySource();
dynamic r1 = new ExpandoObject();
r1.Id = 1;
r1.Value = "Test1";
source.DataAsList.Add(r1);

var dest = new ParquetDestination("CustomAttributes.parquet");
dest.ParquetColumns = new[]
{
    new ParquetColumn { PropertyName = "Id", ColumnName = "Col1", WriteOrder = 2 },
    new ParquetColumn { PropertyName = "Value", ColumnName = "Col2" }
};
source.LinkTo(dest);
Network.Execute(source);

This explicitly sets column names and write order for dynamic objects.