Ingesting Files with Different Metadata Automatically in SSIS
When ingesting multiple files in a folder using SSIS, a Foreach Loop task is commonly used with a data flow inside the loop. However, this approach assumes that all the files have the same columns, allowing them to be ingested into a single table. Unfortunately, files that need to be ingested often have different metadata, meaning they contain different columns in different files. One possible solution is to create a separate package (data flow) for each type of destination table and use the Execute Package task within the loop to call each package (as described in this article). However, this approach requires creating multiple packages and workflows, mapping all the columns in each data flow, which can be cumbersome. A more desirable approach would be preferred.
Solution
We will utilize BULK INSERT within the Foreach Loop task.
However, it is necessary to create all the target tables beforehand, ensuring
that the columns in the tables match those in the files. Although SSIS provides
a Bulk Insert task, its options are limited. To achieve maximum flexibility, we
will employ an Execute SQL task and construct our own BULK INSERT statement.
Assumptions
1.
All the files are standard csv files with the
first row as the header.
2.
The file names will be used as the corresponding
table names in the database.
3.
The necessary permissions have been granted to
perform BULK INSERT.
Detailed Steps
1.
Create the target tables:
·
Create all the necessary target tables in the
database, ensuring that the columns in the tables match the columns in the
files. Use appropriate data types wherever possible.
2.
Create a connection manager:
·
In the SSIS package, create a connection manager
named "STG" to connect to the database.
3.
Define variables:
·
In the SSIS package, create the following two
variables:
o
current_file (string): Leave it blank.
o
sql_ingest_file (string): Set its expression as
follows:
"TRUNCATE TABLE dbo."
+ @[User::current_file] + ";
" + "BULK INSERT dbo." + @[User::current_file] + " FROM
'" + @[User::current_file] + "' WITH (FORMAT = 'CSV', FIRSTROW = 2);”
4.
Configure the Foreach Loop task:
·
Within the SSIS package, add a Foreach Loop task
and configure it as follows:
o
In the Collection tab:
§
Choose Foreach File Enumerator.
§
Specify the folder containing the files.
§
Select "Names only" under
"Retrieve file name".
o
In the Variable Mappings tab:
§
Map the variable current_file to Index 0.
5.
Add an Execute SQL task:
·
Inside the Foreach Loop task, add an Execute SQL
task called “Ingest File” with the following specifications in the General tab:
o
Connection: STG.
o
SQLSourceType: Variable.
o
SourceVariable: User::sql_ingest_file.
Adding ETL Timestamp into
Tables
While BULK INSERT indeed offers a simpler method for
ingesting multiple files compared to using workflows, it does come with certain
limitations. One such limitation is that the columns in the target table must
match the columns in the file being ingested. However, it is common practice to
include an ETL timestamp in the target table. Although this feature cannot be
implemented directly within the BULK INSERT statement, we can still achieve the
desired outcome by incorporating a few additional tasks. Here are the detailed
steps to accomplish this goal:
1.
Add a datatime column etl_timestamp in each
table in the database.
2.
Modify the sql_ingest_file variable expression:
·
Append the following to the end of the
expression in the sql_ingest_file variable:
“ALTER TABLE
dbo." + @[User::current_file] + " ADD etl_timestamp DATETIME;”
3.
Create two additional variables:
·
sql_add_timestamp: Define this variable with the
expression: "UPDATE " + @[User::current_file] + " SET
etl_timestamp = GETDATE();"
·
sql_drop_timestamp: Define this variable with
the expression: "ALTER TABLE dbo." + @[User::current_file] + "
DROP COLUMN etl_timestamp;"
4.
Add Execute SQL tasks for timestamp management:
·
Within the Foreach Loop task, add an Execute SQL
task called "Drop Timestamp" before the existing "Ingest File"
task. Configure it as follows:
o
Connection: STG.
o
SQLSourceType: Variable.
o
SourceVariable: User::drop_timestamp.
·
Add another Execute SQL task called "Add
Timestamp" after the existing "Ingest File" task. Configure it
as follows:
o
Connection: STG.
o
SQLSourceType: Variable.
o
SourceVariable: User::sql_add_timestamp.
Conclusion
By following these steps, you can efficiently handle files
with different metadata in SSIS, automatically ingesting them into the
corresponding tables while maintaining the option to add an ETL timestamp if
necessary.
Comments
Post a Comment