Spark Function

承接上一页关于Spark中的数据类型,这页主要讲一些主要函数操作。这里假设我们使用dataframe进行操作,则需要查询相关的dataset的function。

groupby

Groups the Dataset using the specified columns, so we can run aggregation on them.

SeeRelationalGroupedDatasetarrow-up-right for all the available aggregate functions.

// Compute the average for all numeric columns grouped by department.
df.groupBy($"department").avg()
df.groupBy($"reorder").agg(count("reorder").alias("cnt"))

// Compute the max age and average salary, grouped by department and gender.
df.groupBy($"department", $"gender").agg(Map(
  "salary" -> "avg",
  "age" -> "max"
))

orderBy和sort

对于Spark来说两者是相同的!

排序

抽样

Select

返回子集

Selects a set of columns. 返回dataframe

返回统计值

agg聚合函数

返回统计值

distinct

Returns a new Dataset that contains only the unique rows from this Dataset.

collect(搭配toMap实现dataframe转map)

Returns an array that contains all of Rows in this Dataset.

struct(将两列变成tuple,然后变成map;拆分成多列)

Creates a new struct column that composes multiple input columns.

直接新建某列就是struct类型。

拆分成多列,使用select_1搭配

利用udf返回多列就是使用这个特点,先返回一个struct,然后再用select拆分

join

Reference

Last updated