9.6.索引的基本操作

from toolkit import H
import pandas as pd
from pandas import DataFrame, Series
import numpy as np

Windows 10
Python 3.8.8 @ MSC v.1928 64 bit (AMD64)
Latest build date 2021.02.28
pandas version:  1.2.2
numpy version:  1.20.1

索引对象的方法

tuples = [('bar', 'one'),
          ('bar', 'two'),
          ('baz', 'one'),
          ('baz', 'two'),
          ('foo', 'one'),
          ('foo', 'two'),
          ('qux', 'one'),
          ('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
print(index)

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

.levels为property属性，它返回一个FrozenList（不可变列表），列表中存储每一级的label（也就是创建MultiIndex时传入的levels参数）。

index.levels

FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

.get_level_values(level)：返回指定level的Index，用于MultiIndex。

print(index.get_level_values(0), "\n")
print(index.get_level_values(1), "\n")
print(index.get_level_values('second'))

Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
dtype='object', name='first')

Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'],
dtype='object', name='second')

Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'],
dtype='object', name='second')

df = pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])

df.columns.levels  # original MultiIndex

df[['foo','qux']].columns.levels  # sliced
df[['foo', 'qux']].columns.to_numpy()
# for a specific level
df[['foo', 'qux']].columns.get_level_values(0)

Index(['foo', 'foo'], dtype='object', name='first')

new_mi = df[['foo', 'qux']].columns.remove_unused_levels()
new_mi.levels

FrozenList([['foo'], ['one', 'two']])

索引重赋值、索引对齐

level参数已经被加入到pandas对象中的 reindex() 和 align() 方法中。这将有助于沿着一个层级来广播值（broadcast values）。例如：

创建示例df：

midx = pd.MultiIndex(levels=[['zero', 'one'], ['x', 'y']],
                      codes=[[1, 1, 0, 0], [1, 0, 1, 0]])


df = pd.DataFrame(np.random.randn(4, 2), index=midx)

df

               0         1
one  y  0.969419 -1.461856
     x  1.060536 -0.447844
zero y -1.034162 -0.611150
     x  1.191578  0.220043

创建示例df2：

df2 = df.mean(level=0)
df2

             0         1
one   1.014978 -0.954850
zero  0.078708 -0.195553

df2.reindex(df.index, level=0)

               0         1
one  y  1.014978 -0.954850
     x  1.014978 -0.954850
zero y  0.078708 -0.195553
     x  0.078708 -0.195553

# aligning
df_aligned, df2_aligned = df.align(df2, level=0)

print(df_aligned, "\n")
print(df2_aligned)

               0         1
one  y  0.969419 -1.461856
     x  1.060536 -0.447844
zero y -1.034162 -0.611150
     x  1.191578  0.220043

               0         1
one  y  1.014978 -0.954850
     x  1.014978 -0.954850
zero y  0.078708 -0.195553
     x  0.078708 -0.195553

`reindex`方法参数

DataFrame.reindex(self, labels=None, index=None, columns=None,
                  axis=None, method=None, copy=True, level=None,
                  fill_value=nan, limit=None, tolerance=None)

index：array-like. 给出了新的index的label
columns：array-like. 给出了新的columns的label
method：当新的label的值缺失时，如何处理。参数值可以为：
- None：不做任何处理，缺失地方填充NaN
- 'backfill'/'bfill'：用下一个可用的值填充该空缺（后向填充）
- 'pad'/'ffill'：用上一个可用的值填充该空缺（前向填充）
- 'nearest'：用最近的可用值填充该空缺
copy：布尔值. 如果为True，则返回一个新的Series对象（即使传入的index与原来的index相同）
level：一个整数或者name，在MultiIndex的指定级别上匹配简单索引
fill_value：一个标量。指定缺失值的填充数据，默认为NaN（如果该参数与method同时出现，则以method为主）
limit：整数. 指定前向/后向填充时：如果有连续的k个NaN，则只填充其中limit个。它与method配合
tolerance：整数. 用于给出在不匹配时，连续采用前向/后向/最近邻匹配的跨度的最大值。它与method配合

交换索引层级

`swaplevel`方法

DataFrame.swaplevel(self, i=-2, j=-1, axis=0)

swaplevel()函数可以用来交换两个层级

print(df, "\n")
print(df.swaplevel(0, 1, axis=0))

               0         1
one  y  0.969419 -1.461856
     x  1.060536 -0.447844
zero y -1.034162 -0.611150
     x  1.191578  0.220043

               0         1
y one   0.969419 -1.461856
x one   1.060536 -0.447844
y zero -1.034162 -0.611150
x zero  1.191578  0.220043

`reorder_levels`方法

DataFrame.reorder_levels(self, order, axis=0)

reorder_levels()是一个更一般化的 swaplevel方法，允许您用简单的一步来重排列索引的层级：

print(df, "\n")
print(df.reorder_levels([1, 0], axis=0))

               0         1
one  y  0.969419 -1.461856
     x  1.060536 -0.447844
zero y -1.034162 -0.611150
     x  1.191578  0.220043

               0         1
y one   0.969419 -1.461856
x one   1.060536 -0.447844
y zero -1.034162 -0.611150
x zero  1.191578  0.220043

重命名索引：`rename`

DataFrame.rename(self, mapper=None, index=None, columns=None,
                 axis=None, copy=True, inplace=False, level=None,
                 errors='ignore')

rename()方法用来重命名多层索引。renames的index或columns参数可以接受一个字典，从而仅仅重命名您希望更改名字的行或列：

df.rename(columns={0: "col0", 1: "col1"})

            col0      col1
one  y  0.969419 -1.461856
     x  1.060536 -0.447844
zero y -1.034162 -0.611150
     x  1.191578  0.220043

df.rename(index={"one": "two", "y": "z"})

               0         1
two  z  0.969419 -1.461856
     x  1.060536 -0.447844
zero z -1.034162 -0.611150
     x  1.191578  0.220043

重命名索引层级：`rename_axis`

DataFrame.rename_axis(self, mapper=None, index=None, columns=None,
                      axis=None, copy=True, inplace=False)

rename_axis()方法可以用于对Index 或者 MultiIndex进行重命名。尤其地，您可以明确MultiIndex中的不同层级的名称，这可以被用于在之后使用 reset_index() ，把多层级索引的值转换为一个列。

df.rename_axis(index=['abc', 'def'])

                 0         1
abc  def
one  y    0.969419 -1.461856
     x    1.060536 -0.447844
zero y   -1.034162 -0.611150
     x    1.191578  0.220043

df.rename_axis(columns="Cols").columns

RangeIndex(start=0, stop=2, step=1, name='Cols')

rename 和rename_axis都支持一个明确的字典、Series 或者一个映射函数，将标签，名称映射为新的值。

df.rename_axis(index=['abc', 'def'], inplace=True)
print(df)

                 0         1
abc  def
one  y    0.969419 -1.461856
     x    1.060536 -0.447844
zero y   -1.034162 -0.611150
     x    1.191578  0.220043

对多层级索引进行排序：`sort_index`

DataFrame.sort_index(self, axis=0, level=None, ascending=True,
                     inplace=False, kind='quicksort',
                     na_position='last', sort_remaining=True,
                     ignore_index=False, key=None)

对于拥有多层级索引的对象来说，使用 sort_index方法来排序。

创建示例s：

import random

tuples = [('bar', 'one'),
          ('bar', 'two'),
          ('baz', 'one'),
          ('baz', 'two'),
          ('foo', 'one'),
          ('foo', 'two'),
          ('qux', 'one'),
          ('qux', 'two')]
random.shuffle(tuples)

s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
s

foo  two   -1.176215
qux  one    1.108736
foo  one    0.652951
qux  two   -0.402180
bar  one   -0.100761
baz  two   -1.303137
bar  two   -0.996656
baz  one   -0.249797
dtype: float64

# 两者等价
s.sort_index()
s.sort_index(level=0)

bar  one   -0.100761
     two   -0.996656
baz  one   -0.249797
     two   -1.303137
foo  one    0.652951
     two   -1.176215
qux  one    1.108736
     two   -0.402180
dtype: float64

s.sort_index(level=1)

bar  one   -0.100761
baz  one   -0.249797
foo  one    0.652951
qux  one    1.108736
bar  two   -0.996656
baz  two   -1.303137
foo  two   -1.176215
qux  two   -0.402180
dtype: float64

如果“多层级索引”都被命名了的话，你也可以向 sort_index 传入一个层级名称。

s.index.set_names(['L1', 'L2'], inplace=True)

print(s.sort_index(level='L1'), "\n")
print(s.sort_index(level='L2'))

L1   L2
bar  one   -0.100761
     two   -0.996656
baz  one   -0.249797
     two   -1.303137
foo  one    0.652951
     two   -1.176215
qux  one    1.108736
     two   -0.402180
dtype: float64

L1   L2
bar  one   -0.100761
baz  one   -0.249797
foo  one    0.652951
qux  one    1.108736
bar  two   -0.996656
baz  two   -1.303137
foo  two   -1.176215
qux  two   -0.402180
dtype: float64

对于多维度的对象来说，你也可以对任意的的维度来进行索引，只要他们是具有多层级索引的：

df.T.sort_index(level=1, axis=1)

abc       one      zero       one      zero
def         x         x         y         y
0    1.060536  1.191578  0.969419 -1.034162
1   -0.447844  0.220043 -1.461856 -0.611150

如果索引没有排序，您仍然可以对它们进行索引，但是索引的效率会极大降低，并且会抛出PerformanceWarning警告。而且，这将返回数据的副本而非数据的视图：

dfm = pd.DataFrame({'jim': [0, 0, 1, 1],
                    'joe': ['x', 'x', 'z', 'y'],
                    'jolie': np.random.rand(4)})
# y 应该在 z 的前面
dfm = dfm.set_index(['jim', 'joe'])
print(dfm)
print(dfm.loc[(1, 'z')])

            jolie
jim joe
0   x    0.340107
    x    0.745068
1   z    0.260104
    y    0.458321
            jolie
jim joe
1   z    0.260104
<ipython-input-1-23c78e5aac00>:7: PerformanceWarning: indexing past
lexsort depth may impact performance.
  print(dfm.loc[(1, 'z')])

另外，如果试图通过”切片“索引一个没有完全lexsorted的对象，您将会碰到如下的错误：

try:
    dfm.loc[(0, 'y'):(1, 'z')]
except Exception as e:
    print(f"UnsortedIndexError: {e}")

UnsortedIndexError: 'Key length (2) was greater than MultiIndex
lexsort depth (1)'

排序之后则不会报错

dfm.sort_index().loc[(0, 'y'):(1, 'z')]

            jolie
jim joe
1   y    0.458321
    z    0.260104

在MultiIndex上使用is_lexsorted()方法，可以查看该索引是否已经被排序。而使用lexsort_depth属性则可以返回排序的深度：

print("*****未排序*****")
print("is lexsorted:", dfm.index.is_lexsorted())
print("lexsort depth:", dfm.index.lexsort_depth)

dfm = dfm.sort_index()

print("*****已排序*****")
print("is lexsorted:", dfm.index.is_lexsorted())
print("lexsort depth:", dfm.index.lexsort_depth)

*****未排序*****
is lexsorted: False
lexsort depth: 1
*****已排序*****
is lexsorted: True
lexsort depth: 2

现在，切片索引就可以正常工作了。

dfm.loc[(0, 'y'):(1, 'z')]

            jolie
jim joe
1   y    0.458321
    z    0.260104

将列转换为索引：`set_index`

将列数据变成行索引（只对DataFrame有效，因为Series没有列索引），其中：col label变成index name，列数据变成行label：

DataFrame.set_index(self, keys, drop=True, append=False,
                    inplace=False, verify_integrity=False)

keys：指定了一个或者一列的column label。这些列将会转换为行index
drop：一个布尔值。如果为True，则keys对应的列会被删除；否则这些列仍然被保留
append：一个布尔值。如果为True，则原有的行索引将保留（此时一定是个多级索引）；否则抛弃原来的行索引。
inplace：一个布尔值。如果为True，则原地修改并且返回None
verify_integrity：一个布尔值。如果为True，则检查新的index是否有重复值。否则会推迟到检测过程到必须检测的时候。

df = DataFrame(np.random.randint(low=0, high=10, size=(4, 3)))
df["key"] = ["A", "B", "C", "D"]
print(df)

df.set_index(keys="key", append=True)

   0  1  2 key
0  2  1  5   A
1  3  7  0   B
2  1  8  4   C
3  2  4  2   D

       0  1  2
  key
0 A    2  1  5
1 B    3  7  0
2 C    1  8  4
3 D    2  4  2

将索引转换为列：`reset_index`

reset_index会将层次化的行index转移到列中，成为新的一列。同时index 变成一个整数型的，从0开始编号：

DataFrame.reset_index(self, level=None, drop=False, inplace=False,
                      col_level=0, col_fill='')

Series.reset_index(self, level=None, drop=False, name=None,
                   inplace=False)

level：一个整数、str、元组或者列表。它指定了将从层次化的index中移除的level。如果为None，则移除所有的level。
drop：drop=True，丢弃指定level，不加入列中。drop=False，则将指定的level转换为列。
inplace：一个布尔值。如果为True，则原地修改并且返回None。
col_level：索引转换为列后，指定该列的列标签位于列索引的哪个level。
col_fill：指定除col_level之外的列level的名字。默认为空字符串。当存在多级列索引时生效。

对于Series，name就是插入后，对应的列label

columns = [["a", "a", "b"], ["1", "2", "3"]]
df = DataFrame(np.random.randint(low=0, high=10, size=(4, 3)), columns=columns)
print(df)

df.reset_index(drop=False, col_level=0)

  index  a     b
         1  2  3
0     0  3  6  5
1     1  6  8  5
2     2  0  6  9
3     3  1  8  1

按label删除行、列：`drop`

丢弃某条轴上的一个或者多个label

DataFrame.drop(self, labels=None, axis=0, index=None, columns=None,
               level=None, inplace=False, errors='raise')

labels：单个label或者一个label序列，代表要被丢弃的label
axis：一个整数，或者轴的名字。默认为 0 轴
level：一个整数或者level名字，用于MultiIndex。因为可能在多个level上都有同名的label。
inplace：一个布尔值。如果为True，则原地修改并且返回None
errors：可以为'ignore'/'raise'

a = df.reset_index(drop=False, col_level=0)
print(a)
a.drop(labels="a", axis=1)

  index  a     b
         1  2  3
0     0  3  6  5
1     1  6  8  5
2     2  0  6  9
3     3  1  8  1
D:\Software\miniconda\envs\blog\lib\site-
packages\pandas\core\generic.py:4152: PerformanceWarning: dropping on
a non-lexsorted multi-index without a level parameter may impact
performance.
  obj = obj._drop_axis(labels, axis, level=level, errors=errors)

9.6.索引的基本操作

索引对象的方法

索引重赋值、索引对齐

reindex方法参数

交换索引层级

swaplevel方法

reorder_levels方法

重命名索引：rename

重命名索引层级：rename_axis

对多层级索引进行排序：sort_index

将列转换为索引：set_index

将索引转换为列：reset_index

按label删除行、列：drop