Sampling
Hive抽样
Let’s say you have a Hive table with ten billion rows, but you want to efficiently randomly sample a fixed number- maybe ten thousand.
使用此方法并非真正抽样。
select * from my_table
limit 10000;
而order by效率太差。会造成集中在一个reducer,使得性能不平衡
select * from my_table
order by rand()
limit 10000;
而sort by只是保证在各个reducer中,并非整体随机。
select * from my_table
sort by rand()
limit 10000;
所以应该在分配时就指定随机分布。
select * from my_table
distribute by rand()
sort by rand()
limit 10000;
最后可以优化的地方是,若知道要取的数量,则限制rand大小,为了保险起见,选择比0.000001更大一点,否则会导致数量不够。
select * from my_table
where rand() <= 0.0001
distribute by rand()
sort by rand()
limit 10000;
Reference
Last updated
Was this helpful?