![how to install spark locally how to install spark locally](https://cdn-images-1.medium.com/max/1600/1*oL_dKy0v0rh_y7oArisngw.png)
From a pyspark SparkSession object, I think we could just run spark.sql("SELECT * FROM FOOBAR") with basically the same result as is achieved with Thrift. In order to skip the Thrift requirement, I think one would have to replace references to pyhive with pyspark and then connect the cluster's spark endpoint (7077) instead of the thrift jdbc endpoint (10000).
How to install spark locally code#
I have not yet found a way around the Thrift requirement, but if you already have a spark context, the code sample here might be helpful: The container launches spark and then thrift and then runs some dbt tasks connecting to it's own thrift endpoint. Run dbt-spark from within a customized spark container.Run a docker container locally that hosts spark and thrift, then you can run DBT locally using the container's thrift port.I'd appreciate any feedback on the approach so far, hopefully I'll have code to share soon (so far I did minor refactoring to put a structure in place for additional methods of running - I have gotten this working successfully in a docker container and I have gotten these two options to work: The other way I could go about it, would be to launch the Spark shell as a subprocess, but then the user would have to configure the command to use to launch it, and he/she wouldn't be able to see the queries being sent to the REPL. Some IDEs offer that functionality, and I like the fact that we would be able to see SQL queries appear on the terminal.
How to install spark locally how to#
I wanted to try and connect to an existing REPL and send dynamically the code there from the DBT process but I didn't figure out how to do that yet! (the code sent would change depending on the "method" set in the profile (an enum of "spark", "java", "R" etc.)). This would not be performant but it would have the advantage of being flexible and not introducing extra-dependencies (the local environment's objective would be "sandbox" anyways!) Ideally I would love to support all of them, by possibly sending commands to a subprocess and modifying the SQL sent to be wrapped in the appropriate programming interface, and writing the results to a temporary CSV file to fetch the results. So I went down the route that was suggesting, by looking at one of the possible shell interfaces to Spark (scala/python/java/R/spark-sql). I've been looking into this issue myself and I wanted to discuss some implementation details/questions I have.įirst, I think setting up a thrift server locally is quite hard, and I'm not sure there are other options to connect through a JDBC-like interface for Spark. Hi there! I agree that I think it would be great if we had a way to provide a local Spark-dbt environment.