Pyspark: fetch data parallelly

Probhakar
2 min readApr 12, 2022

Fetch data from multiple sources in a parallel manner in pyspark

So, the other day I was working on writing one data sanity check framework. This framework is supposed to fetch data from multiple tables. For example, the SQL query looks like this —

SELECT .. FROM table_1
UNION ALL
SELECT .. FROM table_2
UNION ALL
SELECT .. FROM table_3 (may fail 🔥🔥)
UNION ALL
.
.
.
SELECT .. FROM table_200 (may fail 🔥🔥)

Now, obviously, I can run this query and spark will be smart enough to make a physical plan where it will run the queries in parallel and then union them (i.e. concatenate vertically). But my problem was, that some of the SELECT . FROM table could potentially fail, leading the whole query to fail. So I had to split the query by UNION ALL and run them.

So, there is a possibility to parallelize them. Also, this is not applicable for tables from a single database, you can potentially UNION ALL multiple data sources like Teradata, Postgres, Hive, etc.

Here is an example for Teradata —

Here I have shown the implementation for Teradata only so when you will fire a query Terata(**credentials).run_query(query) it will split them and run in parallel. You can modify the code for Postgresql and Hive similarly, I have provided the implementation.

--

--