Fetch data from multiple sources in a parallel manner in pyspark
So, the other day I was working on writing one data sanity check framework. This framework is supposed to fetch data from multiple tables. For example, the SQL query looks like this —
SELECT .. FROM table_1
UNION ALL
SELECT .. FROM table_2
UNION ALL
SELECT .. FROM table_3 (may fail 🔥🔥)
UNION ALL
.
.
.
SELECT .. FROM table_200 (may fail 🔥🔥)
Now, obviously, I can run this query and spark
will be smart enough to make a physical plan where it will run the queries in parallel and then union them (i.e. concatenate vertically). But my problem was, that some of the SELECT . FROM table
could potentially fail, leading the whole query to fail. So I had to split the query by UNION ALL
and run them.
So, there is a possibility to parallelize them. Also, this is not applicable for tables from a single database, you can potentially UNION ALL
multiple data sources like Teradata, Postgres, Hive, etc.
Here is an example for Teradata —
Here I have shown the implementation for Teradata
only so when you will fire a query Terata(**credentials).run_query(query)
it will split them and run in parallel. You can modify the code for Postgresql
and Hive
similarly, I have provided the implementation.