比较来自世界各地的卖家的域名和 IT 服务价格

选择列B. Pyspark Dataframe

我正在寻找一种从我的中选择列的方法 dataframe 在 pyspark. 对于第一行,我知道我可以使用
df.first//

, 但在列中不确定,鉴于他们拥有

没有列名称。

我有 5 列,我想沿着他们走路。


+--+---+---+---+---+---+---+
|_1| _2| _3| _4| _5| _6| _7|
+--+---+---+---+---+---+---+
|1 |0.0|0.0|0.0|1.0|0.0|0.0|
|2 |1.0|0.0|0.0|0.0|0.0|0.0|
|3 |0.0|0.0|1.0|0.0|0.0|0.0|
已邀请:

冰洋

赞同来自:

尝试这样的东西:


df.select/[c for c in df.columns if c in ['_2','_4','_5']]/.show//

八刀丁二

赞同来自:

前两列和 5 学期


df.select/df.columns[:2]/.take/5/

小明明

赞同来自:

您可以使用数组并在内部解压缩。 select:


cols = ['_2','_4','_5']
df.select/*cols/.show//

奔跑吧少年

赞同来自:

使用
df.schema.names

:


spark.version
# u'2.2.0'

df = spark.createDataFrame/[/"foo", 1/, /"bar", 2/]/
df.show//
# +---+---+
# | _1| _2|
# +---+---+
# |foo| 1|
# |bar| 2|
# +---+---+

df.schema.names
# ['_1', '_2']

for i in df.schema.names:
# df_new = df.withColumn/i, [do-something]/
print i
# _1
# _2

小明明

赞同来自:

数据集B.
ss.csv

包含一些对我感兴趣的列:


ss_ = spark.read.csv/"ss.csv", header= True, 
inferSchema = True/
ss_.columns



['Reporting Area', 'MMWR Year', 'MMWR Week', 'Salmonellosis /excluding Paratyphoid fever andTyphoid fever/†, Current week', 'Salmonellosis /excluding Paratyphoid fever andTyphoid fever/†, Current week, flag', 'Salmonellosis /excluding Paratyphoid fever andTyphoid fever/†, Previous 52 weeks Med', 'Salmonellosis /excluding Paratyphoid fever andTyphoid fever/†, Previous 52 weeks Med, flag', 'Salmonellosis /excluding Paratyphoid fever andTyphoid fever/†, Previous 52 weeks Max', 'Salmonellosis /excluding Paratyphoid fever andTyphoid fever/†, Previous 52 weeks Max, flag', 'Salmonellosis /excluding Paratyphoid fever andTyphoid fever/†, Cum 2018', 'Salmonellosis /excluding Paratyphoid fever andTyphoid fever/†, Cum 2018, flag', 'Salmonellosis /excluding Paratyphoid fever andTyphoid fever/†, Cum 2017', 'Salmonellosis /excluding Paratyphoid fever andTyphoid fever/†, Cum 2017, flag', 'Shiga toxin-producing Escherichia coli, Current week', 'Shiga toxin-producing Escherichia coli, Current week, flag', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Med', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Med, flag', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Max', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Max, flag', 'Shiga toxin-producing Escherichia coli, Cum 2018', 'Shiga toxin-producing Escherichia coli, Cum 2018, flag', 'Shiga toxin-producing Escherichia coli, Cum 2017', 'Shiga toxin-producing Escherichia coli, Cum 2017, flag', 'Shigellosis, Current week', 'Shigellosis, Current week, flag', 'Shigellosis, Previous 52 weeks Med', 'Shigellosis, Previous 52 weeks Med, flag', 'Shigellosis, Previous 52 weeks Max', 'Shigellosis, Previous 52 weeks Max, flag', 'Shigellosis, Cum 2018', 'Shigellosis, Cum 2018, flag', 'Shigellosis, Cum 2017', 'Shigellosis, Cum 2017, flag']


但我只需要几个

件:


columns_lambda = lambda k: k.endswith/', Current week'/ or k == 'Reporting Area' or k == 'MMWR Year' or k == 'MMWR Week'


筛选器返回必要列的列表,计算列表:


sss = filter/columns_lambda, ss_.columns/
to_keep = list/sss/


所需列的列表被解压缩为函数参数。 dataframe select, 一个返回的数据集,其中包含列表中的列:


dfss = ss_.select/*to_keep/
dfss.columns


结果:


['Reporting Area',
'MMWR Year',
'MMWR Week',
'Salmonellosis /excluding Paratyphoid fever andTyphoid fever/†, Current week',
'Shiga toxin-producing Escherichia coli, Current week',
'Shigellosis, Current week']



df.select//

有一个额外的夫妇:
http://spark.apache.org/docs/2 ... .drop
删除列列表。

要回复问题请先登录注册