dataframe のとりあつかい

1. file 読み込みと書き出し
2. head/tail
3. 行, 列の選択
4. 行, 列の追加と削除
5. 行, 列の結合
6. 行, 列のソート
7. 行, 列の集計
8. 参考URL

#+File Created: <2017-02-07 Tue 17:46>
#+Last Updated: <2018-02-21 Wed 19:07>

dataframe の取扱いについて, 各言語(python, R, julia)での違いを簡単にまとめておく.
環境は macOS X Yosemite 10.10.5, python3.5.2(Anaconda), R3.3.2, julia0.5.0(DataFrames v0.9.0) である.

1 file 読み込みと書き出し

ファイルからの読み込みと書出し

1.1 python

import pandas as pd
df = pd.read_csv('hoge.csv',sep=',',header=0)
print(df)
df.to_csv('fuga.tsv',sep="\t",header=True,index=False)

   id  leng  hei
0   1    10   20
1   2     5    7
2   3    15   22
3   4     3    5

読み込み:
read_csv(file名, sep="区切り文字", header=None(列名が書いてある行数. 列名が無ければ None))
書き出し:
to_csv(file 名, sep=",",header=True, index=False)
header を書き出すとき True
index (行番号)を書き出すとき True

1.2 R

dat <- read.table('hoge.csv',sep=',',header=T)
dat
write.table(dat,file='fugaR.tsv',dat,sep='\t',quote=F,row.names=F,col.names=T)

  id leng hei
1  1   10  20
2  2    5   7
3  3   15  22
4  4    3   5
x
fugaR.tsv

読み込み:
read.table(file名, sep="区切り文字", header=T)
header=T あるいは F: 1 行目が列名かどうか.
書き出し
write.table(データ, file=ファイル名, sep="区切り文字", quote=F, row.names=F,col.names=T)
quote=F: 書出し要素に "" をつけるかどうか.
row.names=F: 行名を書くかどうか
col.names=T: 列名を書くかどうか

1.3 julia

using DataFrames
dat = readtable("hoge.csv",separator=',')
print(dat)
# separator は file 拡張子である程度推測してくれるっぽい
writetable("fugaJ.dat", dat, quotemark = ' ', separator = '\t')

4×3 DataFrames.DataFrame
│ Row │ id │ leng │ hei │
├─────┼────┼──────┼─────┤
│ 1   │ 1  │ 10   │ 20  │
│ 2   │ 2  │ 5    │ 7   │
│ 3   │ 3  │ 15   │ 22  │
│ 4   │ 4  │ 3    │ 5   │

読み込み
readtable(ファイル名, separator="区切り文字")
書き出し:
writetable(ファイル名, データ, quotemark=' ', separator='\t')
quotemark=' ': 書出し要素につける quote. quotemark='"' とすると "hoge" とかになる.
separator='\t': 区切り文字
注: quote 入れない場合, quotemark='' とやると何故かエラーとなる. Why?
何か julia は初動が遅いような感じ.

1.4 まとめ

	読み込み	書き出し
python	pd.read_csv(file)	df.to_csv(file)
R	read.table(file)	write.table(df,file)
julia	readtable(file)	writetable(file)

2 head/tail

データをちら見する方法(head: 先頭, tail: 最後)

python: df.head(5)
R, julia: head(df,5)

2.1 python

import pandas as pd
# R のサンプルデータ iris を python で使う.
import pyper
r = pyper.R(use_pandas='True')
df = pd.DataFrame(r.get('iris'))
pd.set_option('display.width', 150) # 一行に書く幅を指定する
print(type(df))
print(df.head(5))
print(df.tail(5))

<class 'pandas.core.frame.DataFrame'>
    Sepal.Length    Sepal.Width    Petal.Length    Petal.Width     Species
0             5.1            3.5             1.4            0.2  b'setosa'
1             4.9            3.0             1.4            0.2  b'setosa'
2             4.7            3.2             1.3            0.2  b'setosa'
3             4.6            3.1             1.5            0.2  b'setosa'
4             5.0            3.6             1.4            0.2  b'setosa'
      Sepal.Length    Sepal.Width    Petal.Length    Petal.Width        Species
145             6.7            3.0             5.2            2.3  b'virginica'
146             6.3            2.5             5.0            1.9  b'virginica'
147             6.5            3.0             5.2            2.0  b'virginica'
148             6.2            3.4             5.4            2.3  b'virginica'
149             5.9            3.0             5.1            1.8  b'virginica'

R のサンプルデータを python でつかう別の方法

from sklearn import datasets
iris = datasets.load_iris()
# pandas object へ変換
# 列名がなくなっちゃうけど.
df   = pd.DataFrame(iris.data)

2.2 R

options(width=150) # 一行の幅を指定する(今回は無くてもいいけど比較のため書いておく)
head(iris,5)
tail(iris,5)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
146          6.7         3.0          5.2         2.3 virginica
147          6.3         2.5          5.0         1.9 virginica
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica

2.3 julia

using DataFrames
using RDatasets
iris = dataset("datasets","iris")
println(head(iris,5))
println(tail(iris,5))

5×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species  │
├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────┤
│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ "setosa" │
│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ "setosa" │
│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ "setosa" │
│ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ "setosa" │
│ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ "setosa" │
5×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species     │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────────┤
│ 1   │ 6.7         │ 3.0        │ 5.2         │ 2.3        │ "virginica" │
│ 2   │ 6.3         │ 2.5        │ 5.0         │ 1.9        │ "virginica" │
│ 3   │ 6.5         │ 3.0        │ 5.2         │ 2.0        │ "virginica" │
│ 4   │ 6.2         │ 3.4        │ 5.4         │ 2.3        │ "virginica" │
│ 5   │ 5.9         │ 3.0        │ 5.1         │ 1.8        │ "virginica" │

2.4 まとめ

	head
python	df.head(5)
R	head(df,5)
julia	head(df,5)

3 行, 列の選択

領域の選択と書出し

3.1 python

# サンプルデータ
import pandas as pd
import pyper
r=pyper.R(use_pandas='True')
iris = pd.DataFrame(r.get('iris'))

# とりあえず最初の 5 行を表示
print(iris.head(5))
# 列名
print(iris.columns)
# (行数, 列数)を tuple で
print(iris.shape)
# 行数
print(len(iris.index))
# 列数
print(len(iris.columns))

    Sepal.Length    Sepal.Width    Petal.Length    Petal.Width     Species
0             5.1            3.5             1.4            0.2  b'setosa'
1             4.9            3.0             1.4            0.2  b'setosa'
2             4.7            3.2             1.3            0.2  b'setosa'
3             4.6            3.1             1.5            0.2  b'setosa'
4             5.0            3.6             1.4            0.2  b'setosa'
Index([' Sepal.Length ', ' Sepal.Width ', ' Petal.Length ', ' Petal.Width ',
       'Species'],
      dtype='object')
(150, 5)
150
5

# サンプルデータ
import pandas as pd
import pyper
r=pyper.R(use_pandas='True')
iris = pd.DataFrame(r.get('iris'))

# 列の取得
# iris[[0]]             # 列(column)番号を使って
# iris['Species']       # 列の名前を使って
# iris.iloc[:,0]        # 列番号
# iris.loc[:,'Species'] # column index
# iris.ix[:,0]          # 0 列目のみ
print("data[[column_number]]=")
print(iris[[0]].head(2))

print("\ndata['column_name']=")
print(iris['Species'].head(2))

print("\ndata.iloc[:,column_number]=")
print(iris.iloc[:,0].head(2))

print("\ndata.loc[:,column_name]=")
print(iris.loc[:,'Species'].head(2))

print("\ndata.column_name=")
print(iris.Species.head(2))

print("\ndata.ix[row_from:row_to, col_from:col_to]=")
print(iris.ix[0:2,1:3])   # 行列の範囲

print("\ndata[row,col]=")
print(iris.ix[0,0])       # 値の取り出し

data[[column_number]]=
    Sepal.Length 
0             5.1
1             4.9

data['column_name']=
0    b'setosa'
1    b'setosa'
Name: Species, dtype: object

data.iloc[:,column_number]=
0    5.1
1    4.9
Name:  Sepal.Length , dtype: float128

data.loc[:,column_name]=
0    b'setosa'
1    b'setosa'
Name: Species, dtype: object

data.column_name=
0    b'setosa'
1    b'setosa'
Name: Species, dtype: object

data.ix[row_from:row_to, col_from:col_to]=
    Sepal.Width    Petal.Length 
0            3.5             1.4
1            3.0             1.4
2            3.2             1.3

data[row,col]=
5.1

# サンプルデータ
import pandas as pd
import pyper
r=pyper.R(use_pandas='True')
iris = pd.DataFrame(r.get('iris'))
# 検索
print(iris[iris.ix[:,1]>4.0])
# 何故か column 名に空白があいてる 'Sepal.Width' ではエラーとなってしまう
print(iris[iris.loc[:,' Sepal.Width ']>4.0])

     Sepal.Length    Sepal.Width    Petal.Length    Petal.Width     Species
15             5.7            4.4             1.5            0.4  b'setosa'
32             5.2            4.1             1.5            0.1  b'setosa'
33             5.5            4.2             1.4            0.2  b'setosa'
     Sepal.Length    Sepal.Width    Petal.Length    Petal.Width     Species
15             5.7            4.4             1.5            0.4  b'setosa'
32             5.2            4.1             1.5            0.1  b'setosa'
33             5.5            4.2             1.4            0.2  b'setosa'

R との違い:

制限が無い時に ":" を使う点と
":" による範囲指定の端
python: iris.ix[:,1:3] x:y => x から y-1
R: iris[,2:3] x:y => x から y

3.2 R

names(iris)  # 列名
dim(iris)    # 行数, 列数
nrow(iris)   # 行数
ncol(iris)   # 列数

head(iris[,1])           # 列番号 R は 1-based 行を選択しないときは ":" を書かない
head(iris$Sepal.Length)  # 列の名前を使って
head(iris[1:3,2:3])      # 行列の範囲
head(iris[,c("Sepal.Width","Species")])  # 複数列の取得
iris[1,1]    # 値の取り出し
# 検索 subset(data, 条件)
sel <- subset(iris,iris["Sepal.Width"]>4.0)
sel

[1] 5.1 4.9 4.7 4.6 5.0 5.4
[1] 5.1 4.9 4.7 4.6 5.0 5.4
  Sepal.Width Petal.Length
1         3.5          1.4
2         3.0          1.4
3         3.2          1.3
  Sepal.Width Species
1         3.5  setosa
2         3.0  setosa
3         3.2  setosa
4         3.1  setosa
5         3.6  setosa
6         3.9  setosa
[1] 150   5
[1] 150
[1] 5
[1] 5.1
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
16          5.7         4.4          1.5         0.4  setosa
33          5.2         4.1          1.5         0.1  setosa
34          5.5         4.2          1.4         0.2  setosa

3.3 julia

using DataFrames
using RDatasets
iris = dataset("datasets","iris")
println(size(iris))     # (行数, 列数)
println(names(iris))    # 列名
println(nrow(iris))     # 行数
println(ncol(iris))     # 列数 length(iris) でも同じ

(150,5)
Symbol[:SepalLength,:SepalWidth,:PetalLength,:PetalWidth,:Species]
150
5
5

using DataFrames
using RDatasets
iris = dataset("datasets","iris")

println(head(iris))
println(head(iris[:,1:1]))  # 1 列目のみ取り出す julia は 1-based
# println(head(iris[:,1]))  # これはダメ?
println(head(iris[:,[:Species]]))
# println(head(iris[:,:Species]))  # これもダメみたい
println(iris[1:3,2:3])  # 1 行-3行目まで + 2列-3列目まで(R と同じ!!)
println(iris[1,1])  # 値の取り出し

println(head(iris[:,[1,3,5]])) # 複数の列を取り出すときは配列にする
println(head(iris[:,2:3]))     # 2-3 列目の抽出
println(head(iris[1:1,])) # これで 1 列目を取り出せてるようだがよくわからない.

# 検索
# 不等号の前に何故か . がついてる点に注意する
# element-wise に操作したい場合は . で始まる演算子を使う
println(iris[iris[:, :SepalWidth] .> 4.0,:])

3.4 まとめ

	tuple	列名	行数	列数
python	iris.shape	iris.columns	iris.index	isis.columns
R	dim(iris)	names(iris)	nrow(iris)	ncol(iris)
julia	size(iris)	names(iris)	nrow(iris)	ncol(iris)

	列番号	列名	範囲	値抽出
python	iris.iloc[:,0]	iris.loc[:,name]	iris.ix[0:2,1:3]	iris.ix[0,0]
R	iris[,1]	iris$name	iris[1:3,2:3]	iris[1,1]
julia	iris[:,1:1]	iris[:[:name]]	iris[1:3,2:3]	iris[1,1]

	検索
python	iris.[iris.lloc[:,"name"] > value]
R	subset(iris,iris["name"] > value)
julia	iris[iris[:,:name] .> value,:]

python と R, julia でのデータアクセスの方法のちがい.

python は 0-based, R, julia は 1-based
配列の添字の始まり数が違う
最初の列の取得
python: iris.ix[:,0]
R: iris[,1]
制限が無いときに, python 及び julia では ":" を使う. R は何も書かない.
":" を使った範囲指定
python: iris.ix[:,1:3] = 1 列目(0-based) から 2(=3-1, 0-based)列目まで
R: iris[,2:3] = 2 列目(1-based) から 3列目(1-based)まで
julia: iris[:,2:3] = 2 列目(1-based) から 3列目(1-based)まで

そのほかのちゅうい点:
julia では column名は :column名でアクセスするらしい.
何かいちいちめんどくさいなぁ…

4 行, 列の追加と削除

4.1 python

import pandas as pd
import numpy  as np

df = pd.DataFrame([[1,2],[3,4]],columns=list('AB'))
print(df)

# 1. 追加
# 行方向に追加
dfa= pd.DataFrame([[5,6],[7,8],[9,10],[11,12]],columns=list('AB'))
print()
print(dfa)

# ignore_index=True で行番号を付け直し
dfl=df.append(dfa,ignore_index=True)
print()
print(dfl)

# 列方向に追加
dfl['C']=np.array([1,2,3,4,5,6])
print()
print(dfl)

# 2. 削除
# 行の削除
# drop を使う
# cud = cun.drop([3,4])  # 複数の行を削除するとき
dfl2=dfl.drop(3)
print()
print(dfl2)

# 列の削除
# drop(列名, axis=1)   # axis=1 を使う
dfl3=dfl.drop('A',axis=1)
print()
print(dfl3)

   A  B
0  1  2
1  3  4

    A   B
0   5   6
1   7   8
2   9  10
3  11  12

    A   B
0   1   2
1   3   4
2   5   6
3   7   8
4   9  10
5  11  12

    A   B  C
0   1   2  1
1   3   4  2
2   5   6  3
3   7   8  4
4   9  10  5
5  11  12  6

    A   B  C
0   1   2  1
1   3   4  2
2   5   6  3
4   9  10  5
5  11  12  6

    B  C
0   2  1
1   4  2
2   6  3
3   8  4
4  10  5
5  12  6

4.2 R

colA <- c(1,3)
colB <- c(2,4)
df <- data.frame(A=colA,B=colB)
df
# 1. 追加
# 行方向に追加 rbind
colAadd <- c(5,7,9,11)
colBadd <- c(6,8,10,12)
dfa <- data.frame(A=colAadd,B=colBadd)
print("added:")
dfa
# rbind
dfl <- rbind(df,dfa)
print("rbind:")
dfl

# 列方向に追加 cbind
colC <- c(12,13,14,15,16,17)
dfc <- data.frame(C=colC)
print("added:")
dfc
# cbind
dfk <- cbind(dfl,dfc)
print("cbind:")
dfk

# 2. 削除
# 行の削除 "-" を使う
dfl2 <- dfk[c(-1,-2),]
print("drop row:")
dfl2
# 列の削除
dfl3 <- dfl2[,-1]
print("drop col:")
dfl3

  A B
1 1 2
2 3 4
[1] "added:"
   A  B
1  5  6
2  7  8
3  9 10
4 11 12
[1] "rbind:"
   A  B
1  1  2
2  3  4
3  5  6
4  7  8
5  9 10
6 11 12
[1] "added:"
   C
1 12
2 13
3 14
4 15
5 16
6 17
[1] "cbind:"
   A  B  C
1  1  2 12
2  3  4 13
3  5  6 14
4  7  8 15
5  9 10 16
6 11 12 17
[1] "drop row:"
   A  B  C
3  5  6 14
4  7  8 15
5  9 10 16
6 11 12 17
[1] "drop col:"
   B  C
3  6 14
4  8 15
5 10 16
6 12 17

4.3 julia

using DataFrames
df  = DataFrame(A=[1,3],B=[2,4])
dfa = DataFrame(A=[5,7,9,11],B=[6,8,10,12])

# 1. 追加
# 行方向に追加 vcat
dfl = vcat(df,dfa)
println("vcat:")
println(dfl)

# 列方向に追加 hcat
dfc = DataFrame(C=[12,13,14,15,16,17])
dfk = hcat(dfl,dfc)
println("hcat:")
println(dfk)

# 2. 削除
# 行の削除
dfl2 = dfk[setdiff(collect(1:1:nrow(dfk)),[3,4]),:]
println(collect(1:1:6))   # 1 から 1 step で 6 までの配列を作成
println(setdiff(collect(1:1:6),[3,4]))
println("delete row 3,4")
println(dfl2)

# 列の削除

# 破壊的 method で削除する場合
# delete!(dfk,[:C,:B])

# setdiff(names(dfk),[:C]) => [:A,:B]
#dfl3 = dfk[setdiff(names(dfk),[:C])]
dfl3 = dfk[setdiff(names(dfk),[:C,:B])]
println("delete column C")
println(setdiff(names(dfk),[:C]))  # Symbol[:A, :B]
println(dfl3)

# dfl3 = dfk[(x in [:C]) for x in names(df)] # 内包標記: 動かない... Why?
# dfl3 = dfk[:,1:2]   # これは動くけど
# dfl3 = dfk[: [1:2]] # これは動かない

dfl3 = dfk[:,[true,true,false]]   # これは動く
println("delete column C")
println(dfl3)

vcat:
6×2 DataFrames.DataFrame
│ Row │ A  │ B  │
├─────┼────┼────┤
│ 1   │ 1  │ 2  │
│ 2   │ 3  │ 4  │
│ 3   │ 5  │ 6  │
│ 4   │ 7  │ 8  │
│ 5   │ 9  │ 10 │
│ 6   │ 11 │ 12 │
hcat:
6×3 DataFrames.DataFrame
│ Row │ A  │ B  │ C  │
├─────┼────┼────┼────┤
│ 1   │ 1  │ 2  │ 12 │
│ 2   │ 3  │ 4  │ 13 │
│ 3   │ 5  │ 6  │ 14 │
│ 4   │ 7  │ 8  │ 15 │
│ 5   │ 9  │ 10 │ 16 │
│ 6   │ 11 │ 12 │ 17 │
[1,2,3,4,5,6]
[1,2,5,6]
delete row 3,4
4×3 DataFrames.DataFrame
│ Row │ A  │ B  │ C  │
├─────┼────┼────┼────┤
│ 1   │ 1  │ 2  │ 12 │
│ 2   │ 3  │ 4  │ 13 │
│ 3   │ 9  │ 10 │ 16 │
│ 4   │ 11 │ 12 │ 17 │
delete column C
Symbol[:A,:B]
6×1 DataFrames.DataFrame
│ Row │ A  │
├─────┼────┤
│ 1   │ 1  │
│ 2   │ 3  │
│ 3   │ 5  │
│ 4   │ 7  │
│ 5   │ 9  │
│ 6   │ 11 │
delete column C
6×2 DataFrames.DataFrame
│ Row │ A  │ B  │
├─────┼────┼────┤
│ 1   │ 1  │ 2  │
│ 2   │ 3  │ 4  │
│ 3   │ 5  │ 6  │
│ 4   │ 7  │ 8  │
│ 5   │ 9  │ 10 │
│ 6   │ 11 │ 12 │

4.4 まとめ

	行方向追加	列方向追加	行削除	列削除
python	df1.append(df2)	df1[new_col]=df2	df1.drop([row1,row2])	df1.drop([col1,col2],axis=1)
R	rbind(df1,df2)	cbind(df1,df2)	df1[c(-row1, -row2),]	df1[,c(-col1,-col2)]
julia	vcat(df1,df2)	hcat(df1,dv2)	df1[setdiff(collect(1:1:nrow(df1)),[row1,row2]),:]	df1[setdiff(names(df1),[:col1,:col2])]

julia の削除が異様にめんどくさいが何か別の方法あるんだろーか…

5 行, 列の結合

5.1 python

import pandas as pd
import numpy  as np
# 横方向に結合する場合
# pd.merge(left,right, on='key', how="inner)
cu1 = pd.DataFrame([[1,"John"],[2,"Mark"]],columns=['id','name'])
cu2 = pd.DataFrame([[1,'ok','mut'],[2,'not ok','wild']],columns=['id','OK','m/w'])
print(cu1)
print()
print(cu2)
print()
# id をキーにして結合する.
cum = pd.merge(cu1,cu2,on='id',how='inner')
print(cum)

# 縦方向に結合する場合
# pd.concat([table1, table2],ignore_index=True)
cu3 = pd.DataFrame([[3,"hoge",'not ok','wild'],
                    [4,"fuga",'ok','wild']], columns=['id','name','OK','m/w'])

cun=pd.concat([cum,cu3],ignore_index=True)
print()
print(cun)

   id  name
0   1  John
1   2  Mark

   id      OK   m/w
0   1      ok   mut
1   2  not ok  wild

   id  name      OK   m/w
0   1  John      ok   mut
1   2  Mark  not ok  wild

   id  name      OK   m/w
0   1  John      ok   mut
1   2  Mark  not ok  wild
2   3  hoge  not ok  wild
3   4  fuga      ok  wild

5.2 R

cu1 <- data.frame(id=c(1,2),name=c('John','Mark'))
print("cu1")
cu1
cu2 <- data.frame(id=c(1,2),OK=c('ok','not ok'),mut=c('mut','wild'))
print("cu2")
cu2
cum <- merge(cu1,cu2,by='id',all=T)
print("merge")
cum

[1] "cu1"
  id name
1  1 John
2  2 Mark
[1] "cu2"
  id     OK  mut
1  1     ok  mut
2  2 not ok wild
[1] "merge"
  id name     OK  mut
1  1 John     ok  mut
2  2 Mark not ok wild

cbind = column bind: 列で結合する.

cu1 <- data.frame(id=c(1,2),name=c('John','Mark'))
print("cu1")
cu1
cu2 <- data.frame(id=c(1,2),OK=c('ok','not ok'),mut=c('mut','wild'))
print("cu2")
cu2
cum <- cbind(cu1,cu2,by='id',all=T)
print("cbind")
cum

[1] "cu1"
  id name
1  1 John
2  2 Mark
[1] "cu2"
  id     OK  mut
1  1     ok  mut
2  2 not ok wild
[1] "cbind"
  id name id     OK  mut by  all
1  1 John  1     ok  mut id TRUE
2  2 Mark  2 not ok wild id TRUE

縦方向の結合: rbind

cu1 <- data.frame(id=c(1,2),name=c('John','Mark'))
print("cu1")
cu1
cu2 <- data.frame(id=c(3,4),name=c('hoge','fuga'))
print("cu2")
cu2
cum <- rbind(cu1,cu2)
print("rbind")
cum

[1] "cu1"
  id name
1  1 John
2  2 Mark
[1] "cu2"
  id name
1  3 hoge
2  4 fuga
[1] "rbind"
  id name
1  1 John
2  2 Mark
3  3 hoge
4  4 fuga

5.3 julia

using DataFrames
cu1 = DataFrame(id=[1,2],name=["John","Mark"])
println("cu1:")
println(cu1)
cu2 = DataFrame(id=[1,2],OK=["ok","not ok"],mut=["mut","wild"])
println("cu2:")
println(cu2)
cum = join(cu1,cu2,on=:id, kind=:inner)
println("merged:")
println(cum)

cu1:
2×2 DataFrames.DataFrame
│ Row │ id │ name   │
├─────┼────┼────────┤
│ 1   │ 1  │ "John" │
│ 2   │ 2  │ "Mark" │
cu2:
2×3 DataFrames.DataFrame
│ Row │ id │ OK       │ mut    │
├─────┼────┼──────────┼────────┤
│ 1   │ 1  │ "ok"     │ "mut"  │
│ 2   │ 2  │ "not ok" │ "wild" │
merged:
2×4 DataFrames.DataFrame
│ Row │ id │ name   │ OK       │ mut    │
├─────┼────┼────────┼──────────┼────────┤
│ 1   │ 1  │ "John" │ "ok"     │ "mut"  │
│ 2   │ 2  │ "Mark" │ "not ok" │ "wild" │

5.4 まとめ

	横方向結合	縦方向結合
python	pd.merge(df1,df2, on=key_col_name,how='inner')	pd.concat([df1,df2])
R	merge(df1,df2,by=key_col_name,all=T), cbind	rbind(df1,df2)
julia	join(df1,df2,on=:key_col_name,kind=:innter)

6 行, 列のソート

6.1 python

import pandas as pd
# ソート
df2= pd.DataFrame([[1,3,'hokkaido'],[4,5,'tokyo'],[3,5,'saitama'],
                   [6,9,'oosaka'],  [1,1,'aomori']]);
df2.index=['suzuki','tanaka','kimura','endo','yoshida'];
df2.columns=['col1','col2','col3']
print(df2)
# 数値でソートする
# sort_values(by=[列],ascending=True)
dfs1 = df2.sort_values(by=['col1'],ascending=True)
print("sorted by col1:")
print(dfs1)
# 行名(index)に基いてソート
dfs2 = df2.sort_index(ascending=False)
print("sorted by index:")
print(dfs2)
# 列名に基いてソート
dfs3 = df2.sort_index(axis=1, ascending=False)
print("sorted by col names:")
print(dfs3)

         col1  col2      col3
suzuki      1     3  hokkaido
tanaka      4     5     tokyo
kimura      3     5   saitama
endo        6     9    oosaka
yoshida     1     1    aomori
sorted by col1:
         col1  col2      col3
suzuki      1     3  hokkaido
yoshida     1     1    aomori
kimura      3     5   saitama
tanaka      4     5     tokyo
endo        6     9    oosaka
sorted by index:
         col1  col2      col3
yoshida     1     1    aomori
tanaka      4     5     tokyo
suzuki      1     3  hokkaido
kimura      3     5   saitama
endo        6     9    oosaka
sorted by col names:
             col3  col2  col1
suzuki   hokkaido     3     1
tanaka      tokyo     5     4
kimura    saitama     5     3
endo       oosaka     9     6
yoshida    aomori     1     1

6.2 R

df2 <- data.frame('col2'=c(1,4,3,6,1), 'col1'=c(3,5,5,9,1),
                  'col3'=c('hokkaido','tokyo','saitama','oosaka','aomori'),
                  'name'=c('suzuki','tanaka','kimura','endo','yoshida'))
df2
print("col1 の値で sort")
dfs1 <- df2[order(df2$col1),]
dfs1
# これでもよい
dfs2 <- df2[sort.list(df2$col1,decreasing=TRUE),]
rownames(dfs2) <- c(1:nrow(dfs2))
print("col1 の値で sort (2)")
dfs2

print("行名で sort")
dfsr <- dfs1[order(row.names(dfs1)),]
dfsr

print("列名で sort")
dfsc <- dfs1[,order(names(dfs1))]
dfsc

  col2 col1     col3    name
1    1    3 hokkaido  suzuki
2    4    5    tokyo  tanaka
3    3    5  saitama  kimura
4    6    9   oosaka    endo
5    1    1   aomori yoshida
[1] "col1 の値で sort"
  col2 col1     col3    name
5    1    1   aomori yoshida
1    1    3 hokkaido  suzuki
2    4    5    tokyo  tanaka
3    3    5  saitama  kimura
4    6    9   oosaka    endo
[1] "col1 の値で sort (2)"
  col2 col1     col3    name
1    6    9   oosaka    endo
2    4    5    tokyo  tanaka
3    3    5  saitama  kimura
4    1    3 hokkaido  suzuki
5    1    1   aomori yoshida
[1] "行名で sort"
  col2 col1     col3    name
1    1    3 hokkaido  suzuki
2    4    5    tokyo  tanaka
3    3    5  saitama  kimura
4    6    9   oosaka    endo
5    1    1   aomori yoshida
[1] "列名で sort"
  col1 col2     col3    name
5    1    1   aomori yoshida
1    3    1 hokkaido  suzuki
2    5    4    tokyo  tanaka
3    5    3  saitama  kimura
4    9    6   oosaka    endo

6.3 julia

using DataFrames
df2 = DataFrame(col1=[1,4,3,6,1],col2=[3,5,5,9,1],
                col3=["hokkaido","tokyo","saitama","oosaka","aomori"],
                name=["suzuki","tanaka","kimura","endo","yoshida"])
println(df2)
dfs1 = sort(df2, cols=[:col1,:col2],rev=true) # array の場合は by= だが DataFrame の場合は cols=
println("sorted:")
println(dfs1)

5×4 DataFrames.DataFrame
│ Row │ col1 │ col2 │ col3       │ name      │
├─────┼──────┼──────┼────────────┼───────────┤
│ 1   │ 1    │ 3    │ "hokkaido" │ "suzuki"  │
│ 2   │ 4    │ 5    │ "tokyo"    │ "tanaka"  │
│ 3   │ 3    │ 5    │ "saitama"  │ "kimura"  │
│ 4   │ 6    │ 9    │ "oosaka"   │ "endo"    │
│ 5   │ 1    │ 1    │ "aomori"   │ "yoshida" │
sorted:
5×4 DataFrames.DataFrame
│ Row │ col1 │ col2 │ col3       │ name      │
├─────┼──────┼──────┼────────────┼───────────┤
│ 1   │ 6    │ 9    │ "oosaka"   │ "endo"    │
│ 2   │ 4    │ 5    │ "tokyo"    │ "tanaka"  │
│ 3   │ 3    │ 5    │ "saitama"  │ "kimura"  │
│ 4   │ 1    │ 3    │ "hokkaido" │ "suzuki"  │
│ 5   │ 1    │ 1    │ "aomori"   │ "yoshida" │

6.4 まとめ

	数値	行名	列名
python	df.sort_values(by=[col_name])	df.sort_index()	df.sort_index(axis=1)
R	df[order(df$col_name)]	df[order(row.names(df)),]	df[,order(names(df))]
julia	sort(df,cols=[:col_name])

julia のは調べるの力尽きたのでまたこんど.

7 行, 列の集計

7.1 python

import pandas as pd
import numpy  as np
df = pd.DataFrame({'col1':[8,9,10],
                   'col2':[1,2,3],
                   'col3':[4,5,6],
                   'col4':[9,1,2]
                   })
# apply
# 列毎に関数を適用
ca = df.apply(lambda x: np.sum(x))
print("apply each col")
print(ca)
# 行毎に関数を適用
cr = df.apply(lambda x: np.sum(x),axis=1)
print("apply each row")
print(cr)

# クロス集計
import pyper
r = pyper.R(use_pandas='True')
df = pd.DataFrame(r.get('iris'))
# df['C']=[1 if x>3.0 else 0 for x in df.loc[:,' Sepal.Length ']]
# 内包標記で書いてみた!!
# 1 列目の値が 3 より大きければ 1 そうでなければ 0 を 'C' 列に入れる.
df['C']=[1 if x>3.0 else 0 for x in df.ix[:,1]]
#print(df.loc[:,' Sepal.Length '].head(2))
print(df.head(5))
# pd.crosstab(y軸方向, x軸方向)
cross = pd.crosstab(df.C,df.Species)
print(cross)

apply each col
col1    27
col2     6
col3    15
col4    12
dtype: int64
apply each row
0    22
1    17
2    21
dtype: int64
    Sepal.Length    Sepal.Width    Petal.Length    Petal.Width     Species  C
0             5.1            3.5             1.4            0.2  b'setosa'  1
1             4.9            3.0             1.4            0.2  b'setosa'  0
2             4.7            3.2             1.3            0.2  b'setosa'  1
3             4.6            3.1             1.5            0.2  b'setosa'  1
4             5.0            3.6             1.4            0.2  b'setosa'  1
Species  b'setosa'  b'versicolor'  b'virginica'
C                                              
0                8             42            33
1               42              8            17

7.2 R

# apply
# 列に対して適用
means <- sapply(iris,mean)
means

# 1 だと行に対し関数適用
# 2 だと列に対し関数適用
# c(1,2) だと各要素に対し関数適用
max <- apply(iris,2,max)
max

# クロス集計
head(iris)
hoge   <- sapply(iris$Sepal.Width, function(p) { if(p>3.0) {return(1); }else{ return(0); }})
iris$C <- hoge
cross  <- table(iris$C, iris$Species)
cross

Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
    5.843333     3.057333     3.758000     1.199333           NA 
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
       "7.9"        "4.4"        "6.9"        "2.5"  "virginica" 
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

    setosa versicolor virginica
  0      8         42        33
  1     42          8        17

7.3 julia

using DataFrames

function ret01(x,thr)
    if(x > thr)
        return 1
    else
        return 0
    end
end

df = DataFrame(col1=[8,9,10],
               col2=[1,2,3],
               col3=[4,5,6],
               col4=[9,1,2])
println(df)
# apply
# 列毎に関数を適用
co = colwise(sum,df)
println(co)
for col in eachcol(df)
   println(mean(col[2]))
end
# 行毎に関数を適用
# 行毎でいっぺんには出来ないっぽい.
for row in eachrow(df)
   # mean(Array)
   # いちいち変換しないといけないのか...
   println(mean(convert(Array,row)))
end

# クロス集計
using RDatasets
df = dataset("datasets","iris")
println(head(df,5))

# println(df[Symbol("SepalLength")])
# julia は if else 文入りの内包表記できるのか?
# SepalWIdth 3 以下を 0, それ以外は 1 を入れる.
selct = [ ret01(x,3.0) for x in df[Symbol("SepalWidth")]]
# println(selct)

# C 列にデータを入れる
df[:C]=selct
println(head(df,5))

using FreqTables
cross = freqtable(df, :C, :Species)
println(cross)

3×4 DataFrames.DataFrame
│ Row │ col1 │ col2 │ col3 │ col4 │
├─────┼──────┼──────┼──────┼──────┤
│ 1   │ 8    │ 1    │ 4    │ 9    │
│ 2   │ 9    │ 2    │ 5    │ 1    │
│ 3   │ 10   │ 3    │ 6    │ 2    │
Any[[27],[6],[15],[12]]
9.0
2.0
5.0
4.0
5.5
4.25
5.25
5×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species  │
├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────┤
│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ "setosa" │
│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ "setosa" │
│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ "setosa" │
│ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ "setosa" │
│ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ "setosa" │
5×6 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species  │ C │
├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────┼───┤
│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ "setosa" │ 1 │
│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ "setosa" │ 0 │
│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ "setosa" │ 1 │
│ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ "setosa" │ 1 │
│ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ "setosa" │ 1 │
2×3 Named Array{Int64,2}
C ╲ Species │     setosa  versicolor   virginica
────────────┼───────────────────────────────────
0           │          8          42          33
1           │         42           8          17

7.4 まとめ

	列毎関数適用	行毎関数適用	クロス集計
python	df.apply(func)	df.apply(func,axis=1)	pd.crosstab(df.col1, df.col2)
R	sapply(df,func)	apply(df,1,func)	table(df$col1,df$col2)
	apply(df,2,func)
julia	colwise(func,df)		freqtable(df,:col1, :col2)

8 参考URL

主なものを以下に示す.

Introducing Julia/DataFrames - Wikibooks, open books for an open world:
http://bit.ly/2kRXXQC
DataFrames : Apply a function by rows - Google グループ:
https://groups.google.com/forum/#!topic/julia-users/q52Sxff5lME
Learn Julia in Y Minutes:
https://learnxinyminutes.com/docs/ja-jp/julia-jp/
データフレーム | Julia の DataFrames:
http://stat.biopapyrus.net/julia/dataframes.html
10 Minutes to DataFrames.jl - StatsFragments:
http://sinhrks.hatenablog.com/entry/2015/12/23/003321
Vectors, Arrays and Matrices – Quantitative Economics:
https://lectures.quantecon.org/jl/julia_arrays.html
Database-Style Joins and Indexing — dataframesjl 0.6.0 documentation
http://dataframesjl.readthedocs.io/en/latest/joins_and_indexing.html
sort by column doesn't work · Issue #660 · JuliaStats/DataFrames.jl:
https://github.com/JuliaStats/DataFrames.jl/issues/660

Python(+Pandas), R, Julia(+DataFrames) でのテキストファイルおよび SQLite からの読み込み - Qiita:
http://qiita.com/ngr_t/items/57867de223f741f735b8
データ分析ライブラリPandasの使い方 - Librabuch:
https://librabuch.jp/blog/2013/12/pandas_python_advent_calendar_2013/
Python pandas の算術演算 / 集約関数 / 統計関数まとめ - StatsFragments:
http://sinhrks.hatenablog.com/entry/2014/11/27/232150
Python pandas データのイテレーションと関数適用、pipe - StatsFragments:
http://sinhrks.hatenablog.com/entry/2015/06/18/221747
python:データ処理tips その3 クロス集計したものをヒートマップで可視化する - MATHGRAM:
http://www.mathgram.xyz/entry/2016/02/28/141510

apply() ファミリー:
http://cse.naro.affrc.go.jp/takezawa/r-tips/r/24.html
How can I use the row.names attribute to order the rows of my dataframe in R? - Stack Overflow
http://stackoverflow.com/questions/20295787/how-can-i-use-the-row-names-attribute-to-order-the-rows-of-my-dataframe-in-r
r - Sort columns of a dataframe by column name - Stack Overflow
http://stackoverflow.com/questions/7334644/sort-columns-of-a-dataframe-by-column-name