Lookup Transformation
Improve your data with the LookupTransformation. This guide shows you how to add more information to your data as it moves through the process, making sure your data is complete and useful.
Overview
The LookupTransformation
in ETLBox is a powerful component used to enhance your ETL processes by enriching input data with additional information from a secondary data source, referred to as the lookup source. This is essential when you need to combine data from different sources to ensure your dataset is complete and comprehensive.
The lookup is a partially blocking transformation. When a row arrives at the LookupTransformation, the component will start to load the lookup source data. It will block execution until either all lookup data has been loaded (Full cache mode) or the relevant lookup data has been loaded into memory (partial cache mode).
Basic Lookup
The core functionality of the LookupTransformation
is to merge data from the lookup source with your primary data flow. You start by defining your input and lookup data models. For example:
You then create the sources for both input and lookup data:
Configure the LookupTransformation
to specify how the input data should be enriched:
Finally, execute the transformation by linking the sources and destination:
After executing the transformation, you can access the enriched data from the destination:
Lookup source types
As a source for the lookup transformation any ETLBox source component will do. So you could use the DbSource
to read the lookup data from the a database table, or you could use the CsvSource
or JsonSource
to gather lookup data from a file. When you use the streaming connectors (like JsonSource
, CsvSource
, XmlSource
, etc.) you are not limited to read data from files - you could also change the ResourceType to ResourceType.Http
or ResourceType.AzureBlob
to read from some web endpoint or Azure blob. If you need to have your own logic to retrieve your lookup data, you could use the CustomSource to define your own logic how to gather your lookup source data.
Caching Modes
The LookupTransformation
offers two caching modes to optimize performance.
Full Cache
Full caching loads all the lookup data into memory before processing starts, making it ideal for smaller datasets. To enable full caching, set the CacheMode
property:
This will work for all source types.
Partial Cache
Note
Partial caching loads only the necessary lookup data as needed, which is suitable for larger datasets. This helps manage memory usage efficiently:
The partial cache will only work in combination with the DbSource
as source for the Lookup.
Cache Eviction Policies
When using partial caching, it’s important to manage the cache size. The LookupTransformation
provides several eviction policies to control how data is removed from the cache when it reaches its maximum size:
- Least Recently Used (LRU): Evicts the least recently accessed items from the cache.
- Least Frequently Used (LFU): Evicts the least frequently accessed items from the cache.
- First In First Out (FIFO): Evicts the oldest items in the cache first.
- Last In First Out (LIFO): Evicts the most recently added items first.
Set the eviction policy using the EvictionPolicy
property:
Partial Cache with custom SQL
By default, the lookup transformation doesn’t know how to retrieve the data from the database source. You can define your own Sql code that describe how to retrieve the corresponding columns from that database to refill your partial lookup cache:
This is only necessary if you want to have full control how the lookup retrieves data from the database source. If you provide the MatchColumns
and RetrieveColumns
, the Sql for retrieving data from the lookup database source is generated automatically.
Handling Multiple Matches
If your lookup key can match multiple records, you can enable handling multiple matches:
Note
ApplyRetrievedCacheForMultipleOutputs
and ApplyRetrievedCacheForInput
is that the first one can return an array of rows instead of a single row. This allows you to return multiple rows if necessary that can be processed by the next component. If you always want to return a single row, use the ApplyRetrievedCacheForInput
Using the CachedData collections
You may have notice that the ApplyRetrievedCacheToInput
or ApplyRetrievedCacheForMultipleOutputs
function are giving you a CachedData object that holds multiple collections with data from the lookup source. One is called List
and is an IEnumerable
, which you could to see your retrieved data. But accessing data in this collection can sometimes take a while, depending on the size of your lookup source data. Consider this enumerable as a convienient view on your data.
Alternatively, you can use the Items or ItemCollections. The first one will only contain data if you have set PermitMultipleEntriesPerKey
to false. If this is set to true, you are allowing the Lookup to have more than one match for a key. Then you will find all data that matches with the key in the ItemsCollection
dictionary. Both are dictionaries, and accessing data in here will be much faster than using the List property (which is only a wrapper for the values of the current applicable dictionary)
This only applies if your are using the ApplyRetrievedCacheToInput
or ApplyRetrievedCacheForMultipleOutputs
and access the CachedData
object within this function. When you use the LookupTransformation via the Match/Retrieve attributes only, you don’t have to worry about performance. Internally, the lookup cache will be access via the dictionary keys, and everything should run fast.
Defining your own keys for matching
The lookup expects that the incoming and the lookup object have a unique key that then is used for matching. You can define your own custom logic how the key is retrieved from both.
Dynamic Object Support
The LookupTransformation
can handle dynamic objects, allowing flexible data structures. This is particularly useful for handling data with varying schemas or for on-the-fly manipulation:
Error Handling
Handling errors during the lookup process is crucial for maintaining data integrity and ensuring the ETL process completes successfully. The LookupTransformation
allows you to define custom error handling logic:
This error handler will be called whenever an exception occurs, allowing you to log the error, skip the problematic row, or take other appropriate actions.
Attribute-Based Configuration
Attributes can be used to define which columns should be used for matching and retrieving data, simplifying the setup of the LookupTransformation
. Define your data models with custom attributes:
Multiple Match and Retrieve Columns
You can define models with multiple match and retrieve columns, allowing for complex data transformations:
The LookupTransformation
can handle nullable values in both input and lookup data models, ensuring that all relevant data is considered during the lookup process.
Attributes with dynamic objects
Instead of using classic objects for the flow, the same can be achieved using the (dynamic) ExpandoObject. This object allows you to dynamically add properties to it, and there won’t be any compile-time type checks for this object.
This is how the code looks like using the ExpandoObject:
Additional Properties
The MaxBufferSize
property sets the maximum buffer size for the lookup transformation. This number determines how many rows are processed for each lookup operation. Default is 1000.
The UseExactNumericTypes
property determines whether exact numeric types should be used. E.g. the string “7” and the integer 7 have different data types, but will still be identifed as a match from the Lookup. Settings this to true will take the data types into account.