Spark SQL性能優化
性能優化參數
針對Spark SQL 性能調優參數如下:
代碼示例
import java.util.List;import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.api.java.JavaSQLContext; import org.apache.spark.sql.api.java.Row; import org.apache.spark.sql.hive.api.java.JavaHiveContext;
public class PerformanceTuneDemo { public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName("simpledemo").setMaster("local"); conf.set("spark.sql.codegen", "false"); conf.set("spark.sql.inMemoryColumnarStorage.compressed", "false"); conf.set("spark.sql.inMemoryColumnarStorage.batchSize", "1000"); conf.set("spark.sql.parquet.compression.codec", "snappy"); JavaSparkContext sc = new JavaSparkContext(conf);
JavaSQLContext sqlCtx = new JavaSQLContext(sc); JavaHiveContext hiveCtx = new JavaHiveContext(sc);
List<Row> result = hiveCtx.sql("SELECT foo,bar,name from pokes2 limit 10").collect(); for (Row row : result) { System.out.println(row.getString(0) + "," + row.getString(1) + "," + row.getString(2)); } }
}</pre>
Beeline 命令行設置優化參數
beeline> set spark.sql.codegen=true; SET spark.sql.codegen=true spark.sql.codegen=true Time taken: 1.196 seconds重要參數說明
spark.sql.codegen Spark SQL在每次執行次,先把SQL查詢編譯JAVA字節碼。針對執行時間長的SQL查詢或頻繁執行的SQL查詢,此配置能加快查詢速度,因為它產生特殊的字節碼去執行。但是針對很短(1 - 2秒)的臨時查詢,這可能增加開銷,因為它必須先編譯每一個查詢。
spark.sql.inMemoryColumnarStorage.batchSize:
When caching SchemaRDDs, Spark SQL groups together the records in the RDD in batches of the size given by this option (default: 1000), and compresses each batch. Very small batch sizes lead to low compression, but on the other hand very large sizes can also be problematic, as each batch might be too large to build up in memory.