Ingesting Files with Different Metadata Automatically in SSIS

When ingesting multiple files in a folder using SSIS, a Foreach Loop task is commonly used with a data flow inside the loop. However, this approach assumes that all the files have the same columns, allowing them to be ingested into a single table. Unfortunately, files that need to be ingested often have different metadata, meaning they contain different columns in different files. One possible solution is to create a separate package (data flow) for each type of destination table and use the Execute Package task within the loop to call each package (as described in this article). However, this approach requires creating multiple packages and workflows, mapping all the columns in each data flow, which can be cumbersome. A more desirable approach would be preferred.

Solution

We will utilize BULK INSERT within the Foreach Loop task. However, it is necessary to create all the target tables beforehand, ensuring that the columns in the tables match those in the files. Although SSIS provides a Bulk Insert task, its options are limited. To achieve maximum flexibility, we will employ an Execute SQL task and construct our own BULK INSERT statement.

Assumptions

        1.       All the files are standard csv files with the first row as the header.

        2.       The file names will be used as the corresponding table names in the database.

        3.       The necessary permissions have been granted to perform BULK INSERT.

Detailed Steps

        1.       Create the target tables:

·         Create all the necessary target tables in the database, ensuring that the columns in the tables match the columns in the files. Use appropriate data types wherever possible.

        2.       Create a connection manager:

·         In the SSIS package, create a connection manager named "STG" to connect to the database.

        3.       Define variables:

·         In the SSIS package, create the following two variables:

o   current_file (string): Leave it blank.

o   sql_ingest_file (string): Set its expression as follows:

"TRUNCATE TABLE dbo." +   @[User::current_file] + "; " + "BULK INSERT dbo." + @[User::current_file] + " FROM '" + @[User::current_file] + "' WITH (FORMAT = 'CSV', FIRSTROW = 2);”

        4.       Configure the Foreach Loop task:

·         Within the SSIS package, add a Foreach Loop task and configure it as follows:

o   In the Collection tab:

§  Choose Foreach File Enumerator.

§  Specify the folder containing the files.

§  Select "Names only" under "Retrieve file name".

o   In the Variable Mappings tab:

§  Map the variable current_file to Index 0.

        5.       Add an Execute SQL task:

·         Inside the Foreach Loop task, add an Execute SQL task called “Ingest File” with the following specifications in the General tab:

o   Connection: STG.

o   SQLSourceType: Variable.

o   SourceVariable: User::sql_ingest_file.

Adding ETL Timestamp into Tables

While BULK INSERT indeed offers a simpler method for ingesting multiple files compared to using workflows, it does come with certain limitations. One such limitation is that the columns in the target table must match the columns in the file being ingested. However, it is common practice to include an ETL timestamp in the target table. Although this feature cannot be implemented directly within the BULK INSERT statement, we can still achieve the desired outcome by incorporating a few additional tasks. Here are the detailed steps to accomplish this goal:

        1.       Add a datatime column etl_timestamp in each table in the database.

        2.       Modify the sql_ingest_file variable expression:

·         Append the following to the end of the expression in the sql_ingest_file variable:

“ALTER TABLE dbo." + @[User::current_file] + " ADD etl_timestamp DATETIME;”

        3.       Create two additional variables:

·         sql_add_timestamp: Define this variable with the expression: "UPDATE " + @[User::current_file] + " SET etl_timestamp = GETDATE();"

·         sql_drop_timestamp: Define this variable with the expression: "ALTER TABLE dbo." + @[User::current_file] + " DROP COLUMN etl_timestamp;"

        4.       Add Execute SQL tasks for timestamp management:

·         Within the Foreach Loop task, add an Execute SQL task called "Drop Timestamp" before the existing "Ingest File" task. Configure it as follows:

o   Connection: STG.

o   SQLSourceType: Variable.

o   SourceVariable: User::drop_timestamp.

·         Add another Execute SQL task called "Add Timestamp" after the existing "Ingest File" task. Configure it as follows:

o   Connection: STG.

o   SQLSourceType: Variable.

o   SourceVariable: User::sql_add_timestamp.

Conclusion

By following these steps, you can efficiently handle files with different metadata in SSIS, automatically ingesting them into the corresponding tables while maintaining the option to add an ETL timestamp if necessary.

Comments