Spark Sql

Spark sql

Spark中dataframe执行sql操作

Spark中join类型

Spark中的join格式

def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

right
Right side of the join.

joinExprs
Join expression.

joinType
Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.

left and left_outer joins are the same.

left and left_outer joins are the same.

outer, full and full_outer joins are the same.

    case "inner" => Inner
    case "outer" | "full" | "fullouter" => FullOuter
    case "leftouter" | "left" => LeftOuter
    case "rightouter" | "right" => RightOuter
    case "leftsemi" => LeftSemi
    case "leftanti" => LeftAnti
    case "cross" => Cross

例子

val df_remark_ref_cover = df_remark_ref.join(df_sn_sep, df_remark_ref("scid_albumid") === df_sn_sep("mixsongid"), "left")

Spark执行hive sql

Spark sql需要分段执行

//亲测能够直接成功,因为是action,无需等待。
spark.sql("use temp")
spark.sql("DROP TABLE IF EXISTS temp.jdummy_dt")
spark.sql("create table temp.jdummy_dt(dt string)")

操作外表

要么直接create后,insert到数据仓库中,成为一张新表

spark.sql("insert into table temp.jdummy_dt values(1, '2017-11-29')") //注意不能values后有空格,否则会报错:ERROR KeyProviderCache: Could not find uri with key

要么直接先对dataframe进行cache,然后再生成临时表进行使用(第一次时会仍然进行操作的),但是之后会加快速度。

case class TempRow(label: Int, dt: String)
val date_period = getDaysPeriod(date_end, period)
val date_start = date_period.takeRight(1)(0)
var date_list_buffer = new ListBuffer[TempRow]()
for (dt <- date_period){
    date_list_buffer += TempRow(1, dt)
}
val date_list = date_list_buffer.toList
val df_date = date_list.toDF
df_date.cache
df_date.createOrReplaceTempView("date_table")

Reference

Last updated

Was this helpful?