比较来自世界各地的卖家的域名和 IT 服务价格

merge pandas dataframe 有重复键

我有 2 数据帧,都具有可能具有重复的键列,但数据帧主要具有相同的重复密钥。 我想要 merge 这些静脉上的这些数据帧,但以这样的方式,当两者具有相同的重复时,这些复制品相应地组合。 另外,如果一个 dataframe 具有比另一个更重复的键,我希望其值填充 NaN. 例如:


df1 = pd.DataFrame/{'key': ['K0', 'K1', 'K2', 'K2', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']},
columns=['key', 'A']/
df2 = pd.DataFrame/{'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6'],
'key': ['K0', 'K1', 'K2', 'K2', 'K3', 'K3', 'K4']},
columns=['key', 'B']/

key A
0 K0 A0
1 K1 A1
2 K2 A2
3 K2 A3
4 K2 A4
5 K3 A5

key B
0 K0 B0
1 K1 B1
2 K2 B2
3 K2 B3
4 K3 B4
5 K3 B5
6 K4 B6


我正在努力得到以下结果


key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K2 A3 B3
6 K2 A4 NaN
8 K3 A5 B4
9 K3 NaN B5
10 K4 NaN B6


因此,原则上,我想考虑重复的键 K2 作为 K2_1, K2_2, ..., 然后做 how='outer' merge 在数据帧上。
有想法,我该怎么做?
已邀请:

三叔

赞同来自:

甚至更快


%%cython
# using cython in jupyter notebook
# in another cell run `%load_ext Cython`
from collections import defaultdict
import numpy as np

def cg/x/:
cnt = defaultdict/lambda: 0/

for j in x.tolist//:
cnt[j] += 1
yield cnt[j]


def fastcount/x/:
return [i for i in cg/x/]

df1['cc'] = fastcount/df1.key.values/
df2['cc'] = fastcount/df2.key.values/

df1.merge/df2, how='outer'/.drop/'cc', 1/


快速响应; 不可扩展


def fastcount/x/:
unq, inv = np.unique/x, return_inverse=1/
m = np.arange/len/unq//[:, None] == inv
return /m.cumsum/1/ * m/.sum/0/

df1['cc'] = fastcount/df1.key.values/
df2['cc'] = fastcount/df2.key.values/

df1.merge/df2, how='outer'/.drop/'cc', 1/


旧答案


df1['cc'] = df1.groupby/'key'/.cumcount//
df2['cc'] = df2.groupby/'key'/.cumcount//

df1.merge/df2, how='outer'/.drop/'cc', 1/


https://i.stack.imgur.com/So2ej.png

要回复问题请先登录注册