Distinct

The Distinct transformation will filter out duplicate records.

Overview

The Distinct transformation will filter out duplicate records. For each incoming row, the distinct will create a hash value based on the values in the properties. This hash value will be stored in an internal list. If another record with the same hash values arrives, this record will be filtered out as it is a duplicate. By default, all public properties are used for the hash value generation. You can use the attribute DistinctColumn to specify particular properties.

Buffer

The Distinct is a non blocking transformation - every row is send to the next component after the hash values was generated. The component has a slight bigger memory footprint as it stores a hash value for each incoming row in a memory table.

Code snippet

public class MyRow
{
    [DistinctColumn]
    public int DistinctId { get; set; }

    public string OtherValue { get; set; } 
}

Distinct<MyRow> trans = new Distinct<MyRow>();

Example

We can define a simple object with two properties: Id and Value. Now we set the DistinctColumn attribute for each property.

Now we send the following example data through the Distinct transformation:

IdValue
1A
2A
2B
1A
2A

We would expect that we receive only three rows from the output of the Distinct:

IdValue
1A
2A
2B

Here is the code to prove this:

public class MyRow
{
    [DistinctColumn]
    public int Id { get; set; }
    [DistinctColumn]
    public string Value { get; set; }
}

public static void Main()
{
    var source = new MemorySource<MyRow>();
    source.DataAsList.Add(new MyRow() { Id = 1, Value = "A" });
    source.DataAsList.Add(new MyRow() { Id = 2, Value = "A" });
    source.DataAsList.Add(new MyRow() { Id = 2, Value = "B" });
    source.DataAsList.Add(new MyRow() { Id = 1, Value = "A" });
    source.DataAsList.Add(new MyRow() { Id = 2, Value = "A" });

    var trans = new Distinct<MyRow>();
    var dest = new MemoryDestination<MyRow>();

    source.LinkTo(trans);
    trans.LinkTo(dest);
    Network.Execute(source);

    foreach (var row in dest.Data)
        Console.WriteLine($"Id:{row.Id} Value:{row.Value}");

    //Output
    //Id:1 Value:A
    //Id:2 Value:A
    //Id:2 Value:B
}

Processing the duplicates

The Distinct also allows to send the duplicates into another data flow. This can be easily implemented with the LinkDuplicatesTo method. We could enhance the example above to redirect the duplicates into another destination. Of course we could also use another more complex data flow to process the duplicates.

public class MyRow
{
    [DistinctColumn]
    public int Id { get; set; }
    [DistinctColumn]
    public string Value { get; set; }
}

public static void Main()
{
    var source = new MemorySource<MyRow>();
    source.DataAsList.Add(new MyRow() { Id = 1, Value = "A" });
    source.DataAsList.Add(new MyRow() { Id = 2, Value = "A" });
    source.DataAsList.Add(new MyRow() { Id = 2, Value = "B" });
    source.DataAsList.Add(new MyRow() { Id = 1, Value = "A" });
    source.DataAsList.Add(new MyRow() { Id = 2, Value = "A" });

    var trans = new Distinct<MyRow>();
    var dest = new MemoryDestination<MyRow>();
    var duplicateDest = new MemoryDestination<MyRow>();

    source.LinkTo(trans);
    trans.LinkTo(dest);
    trans.LinkDuplicatesTo(duplicateDest);
    Network.Execute(source);

    foreach (var row in dest.Data)
        Console.WriteLine($"Id:{row.Id} Value:{row.Value}");

    foreach (var row in duplicateDest.Data)
        Console.WriteLine($"Duplicate - Id:{row.Id} Value:{row.Value}");

    //Output
    //Id:1 Value:A
    //Id:2 Value:A
    //Id:2 Value:B
    //Duplicate - Id:1 Value: A
    //Duplicate - Id:2 Value: A
}