9.4.索引对象

from toolkit import H
import numpy as np
import pandas as pd
import pprint

Windows 10
Python 3.8.8 @ MSC v.1928 64 bit (AMD64)
Latest build date 2021.03.02
pandas version:  1.2.2
numpy version:  1.20.1

索引对象

pandas的索引对象负责管理轴标签和其他元数据（比如轴名称等）。构建Series或DataFrame时，所用到的任何数组或其他序列的标签都会被转换成一个Index，而 index对象是不可变的，也就是index对象的元素不可修改（immutable）。如果要修改index内容，只能通过重新赋值的形式。

不可修改性非常重要，因为这样才能使Index对象在多个数据结构之间安全共享。

下表列出了pandas中主要的Index对象：

类	说明
Index	最泛化的Index对象，将轴标签表示为一个由Python对象组成的NumPy 数组
Int64lndex	针对整数的特殊Index
Multiindex	“层次化”索引对象，表示单个轴上的多层索引。可以看做由元组组成的数组
Datetimelndex	存储纳秒级时间截（用NumPy的datetime64类型表示）
Periodlndex	针对Period数据（时间间隔）的特殊Index

以下是pandas中所有Index对象：

['CategoricalIndex',
 'DatetimeIndex',
 'Float64Index',
 'Index',
 'Int64Index',
 'IntervalIndex',
 'MultiIndex',
 'PeriodIndex',
 'RangeIndex',
 'TimedeltaIndex',
 'UInt64Index']

每个索引都有一些方法和属性，它们可用于设置逻辑并回答有关该索引所包含的数据的常见问题。下表列出了这些函数：

说明	方法
append	连接另一个index对象.产生一个新的Index
diff	计算差集，并得到一个Index
intersection	计算交集
union	计算并集
isin	计算一个指示各值是否都包含在参数集合中的布尔型数组
delete	删除索引i处的元素，并得到新的Index
drop	删除传入的值.并得到新的Index
insert	将元素插入到索引i处，并得到新的Index
is_monotonic	当各元素均大于等于前一个元素时，返回TRUE
is_unique	当Index没有重复值时，返回True
unique	计算Index中唯一值的数组

Index

pd.Index(data=None, dtype=None, copy=False, name=None, tupleize_cols=True,
         kwargs)

参数：

data：一个array-like，必须是一维的
name：一个字符串，为Index的名字。
dtype：指定数据类型。如果为None，则默认为object
copy：一个布尔值。如果为True，则拷贝输入数据data
tupleize_cols：一个布尔值，如果可能则尽量创建MultiIndex对象

pd.Index(['a', 'b', 'c'])

Index(['a', 'b', 'c'], dtype='object')

实际上，通过pd.Index类也可以创建pd.Int64Index、pd.Float64Index、pd.RangeIndex、pd.UInt64Index、pd.DatetimeIndex、pd.TimedeltaIndex

from datetime import datetime, timedelta

print(type(pd.Index([1,2,3,4], dtype=np.uint64,  name='UInt64Index')))
print(type(pd.Index([1,2,3,4], dtype=np.int64,   name='Int64Index')))
print(type(pd.Index([1,2,3,4], dtype=np.float64, name='Float64Index')))
print(type(pd.Index(range(5),                    name='RangeIndex')))
print(type(pd.Index([datetime.today()],          name='DatetimeIndex')))
print(type(pd.Index([timedelta(microseconds=1)], name='TimedeltaIndex')))

<class 'pandas.core.indexes.numeric.UInt64Index'>
<class 'pandas.core.indexes.numeric.Int64Index'>
<class 'pandas.core.indexes.numeric.Float64Index'>
<class 'pandas.core.indexes.range.RangeIndex'>
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
<class 'pandas.core.indexes.timedeltas.TimedeltaIndex'>

多级索引 MultiIndex

MultiIndex代表的是多级索引对象。它继承自Index，其中的多级label采用元组对象来表示。因此，MultiIndex可以看作一个元组数组。在MultiIndex内部，并不直接保存元组对象，而是使用多个Index对象保存索引中每级的label。

pd.MultiIndex(levels=None, codes=None, sortorder=None, names=None,
              dtype=None, copy=False, name=None,
              verify_integrity=True)

参数：

levels：一个数组的列表，指定每一级的索引。
codes：整数列表的列表，指定每个位置对应的索引标签。
sortorder：一个整数，给出了已经排序好了的level级别。
names：一个字符串序列，指定每级索引的name。
copy：一个布尔值。如果为True，则拷贝基础数据。
verify_integrity：一个布尔值。如果为True，则检测各级别的label/level都是连续的且有效的。
name：MultiIndex的名字。

# index1 和 index2 是等价的
index1 = pd.Index([('A', 'x1'), ('A', 'x2'), ('B', 'y1'),
                   ('B', 'y2'), ('B', 'y3')], names=['class1', 'class2'])
index2 = pd.MultiIndex(levels=[['A', 'B'], ['x1', 'x2', 'y1', 'y2', 'y3']],
                       codes=[[0, 0, 1, 1, 1], [0, 1, 2, 3, 4]],
                       names=['class1', 'class2'])
pd.DataFrame(np.random.randint(1, 10, (5, 3)), index=index2)

               0  1  2
class1 class2
A      x1      8  1  7
       x2      7  2  1
B      y1      4  7  8
       y2      2  6  5
       y3      2  8  3

一般使用如下替代构造方法来创建MultiIndex：

MultiIndex.from_arrays(arrays, sortorder, names)：将list of array-likes转换为MultiIndex。每个array-like对象给出对应层级的所有数据点的索引。
MultiIndex.from_tuples(tuples, sortorder, names) ：将list of tuple-likes转换为MultiIndex。每个tuple-likes都是行/列索引。
MultiIndex.from_product(iterables, sortorder, names)：将list of iterables转换为MultiIndex。各个iterable对象按顺序指明了不同层级的索引标签，每个数据点的索引根据笛卡尔积算法生成。
MultiIndex.from_frame(df, sortorder, names)：将DataFrame转换为MultiIndex。DataFrame的每一列给出了对应层级的所有数据点的索引。

也可以通过传递一个元组列表给Index()，并且将tupleize_cols设置为True来创建MultiIndex

`from_tuples`

arrays = [
    ['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
    ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

tuples = list(zip(*arrays))
pprint.pprint(tuples)
print('')
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

pprint.pprint(index)

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

`from_product`

iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
pprint.pprint(iterables)
print('')
index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
pprint.pprint(index)

[['bar', 'baz', 'foo', 'qux'], ['one', 'two']]

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

`from_frame`

df = pd.DataFrame([['bar', 'one'], ['bar', 'two'],
                   ['foo', 'one'], ['foo', 'two']],
                  columns=['first', 'second'])
pprint.pprint(df)
print('') 
index = pd.MultiIndex.from_frame(df)
pprint.pprint(index)

  first second
0   bar    one
1   bar    two
2   foo    one
3   foo    two

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('foo', 'one'),
            ('foo', 'two')],
           names=['first', 'second'])

从list-like创建MultiIndex

arrays = [
    np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
    np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
pprint.pprint(arrays)

pd.DataFrame(np.random.randn(8, 4), index=arrays)

[array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
      dtype='<U3'),
 array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'],
      dtype='<U3')]

                0         1         2         3
bar one  1.438074  0.851745 -0.263231  0.285984
    two  0.830161 -1.887687  0.175476  1.057429
baz one -1.066494 -0.282858  0.760714  0.069183
    two  0.333587 -0.724593 -0.215265 -1.637538
foo one  0.050430 -1.658196 -0.409160  0.692135
    two  0.397801 -0.158865  1.922566  0.529642
qux one -1.022481  0.214911  0.513812  0.727262
    two -1.325092 -2.795077  0.403388 -1.136477

# pd.Series(np.random.randn(8), index=arrays)

选择数据

列索引是多级索引

创建示例dataframe：

tuples = [('bar', 'one'),
          ('bar', 'two'),
          ('baz', 'one'),
          ('baz', 'two'),
          ('foo', 'one'),
          ('foo', 'two'),
          ('qux', 'one'),
          ('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
df

first        bar                 baz  ...       foo       qux          
second       one       two       one  ...       two       one       two
A      -0.936493  1.624851  0.381365  ...  0.510066 -0.079910 -0.659817
B      -1.170663 -0.427186  1.827135  ... -0.489893  0.119415  0.853525
C       0.358102 -1.269896  0.434656  ... -0.628121  0.234735 -0.563489

[3 rows x 8 columns]

dataframe的[]选择列：

df['bar']

second       one       two
A      -0.936493  1.624851
B      -1.170663 -0.427186
C       0.358102 -1.269896

[]使用多级索引：

# 实际上与 df[('bar', 'one')] 等价
df['bar', 'one']

A   -0.936493
B   -1.170663
C    0.358102
Name: (bar, one), dtype: float64

[[]]选择多列：

df[['bar', 'baz', 'foo']] # list

first        bar                 baz                 foo
second       one       two       one       two       one       two
A      -0.936493  1.624851  0.381365 -2.307657  0.144425  0.510066
B      -1.170663 -0.427186  1.827135 -0.303377  0.037561 -0.489893
C       0.358102 -1.269896  0.434656 -1.654805 -0.907289 -0.628121

()用于同一轴的多级索引：

df[[('bar', 'one'), ('baz','one')]] # list of tuples

first        bar       baz
second       one       one
A      -0.936493  0.381365
B      -1.170663  1.827135
C       0.358102  0.434656

# 行列索引
print(df.loc['A', 'bar'], '\n')
# () 用于同一轴的多级索引
print(df.loc['A', ('bar', 'one')], '\n')
# [] 用于增加轴 以下三行代码等价
print(df.loc['A', [('bar', 'one')]], '\n')
print(df.loc['A', ('bar', ['one'])], '\n')
print(df.loc['A', (['bar'], 'one')], '\n')
# [] 用于增加轴
print(df.loc['A', ['bar']], '\n')

second
one   -0.936493
two    1.624851
Name: A, dtype: float64

-0.9364927994329391

first  second
bar    one      -0.936493
Name: A, dtype: float64

first  second
bar    one      -0.936493
Name: A, dtype: float64

first  second
bar    one      -0.936493
Name: A, dtype: float64

first  second
bar    one      -0.936493
       two       1.624851
Name: A, dtype: float64

# tuple of lists
print(df.loc['A', (['bar', 'foo'], ['one', 'two'])], '\n')
# list of tuples
print(df.loc['A', [('bar', 'one'), ('foo', 'two')]], '\n')

first  second
bar    one      -0.936493
       two       1.624851
foo    one       0.144425
       two       0.510066
Name: A, dtype: float64

first  second
bar    one      -0.936493
foo    two       0.510066
Name: A, dtype: float64

行索引是多级索引

df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df

                0         1         2         3
bar one  1.027278 -0.180081 -0.110635  0.724393
    two -0.189482 -0.946654  0.910129 -0.504019
baz one -1.099056  1.757185 -0.422867  1.919558
    two  0.066132  1.711503  0.033937 -0.691720
foo one -2.470554  0.587912 -2.073973 -0.550532
    two -2.011376  0.480204 -1.050221  1.501428
qux one -0.117892  2.151013  0.962524 -0.183576
    two  0.319457 -0.230228  1.655544 -0.364321

df.loc['bar', [0, 1]]

            0         1
one  1.027278 -0.180081
two -0.189482 -0.946654

df.loc[(['bar'], 'one'), [0, 1]]
# df.loc[(['bar'], 'one'), (0, 1)]

                0         1
bar one  1.027278 -0.180081

分类索引

CategoricalIndex分类索引非常适合有重复的索引。这是一个围绕Categorical 而创建的容器。这可以非常高效地存储和索引的具有大量重复元素的索引。

# H.get_param("pd.CategoricalIndex", globals())
from pandas.api.types import CategoricalDtype

df = pd.DataFrame({'A': np.arange(6),
                   'B': list('aabbca')})

df['B'] = df['B'].astype(CategoricalDtype(list('cab')))

print(df, '\n')
print(df.dtypes, '\n')
print(df.B.cat.categories, '\n')

   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

A       int32
B    category
dtype: object

Index(['c', 'a', 'b'], dtype='object')

通过.set_index()建立一个有CategoricalIndex 分类索引的df2：

df2 = df.set_index('B')
print(df2.index)

CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a',
'b'], ordered=False, name='B', dtype='category')

使用 __getitem__/.iloc/.loc 进行索引，在含有重复值的索引上的工作原理相似。索引值必须在一个分类中，否者将会引发KeyError错误。

df2.loc['a']

   A
B
a  0
a  1
a  5

CategoricalIndex 在索引之后也会被保留：

df2.loc['a'].index

CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'],
ordered=False, name='B', dtype='category')

索引排序将会按照类别清单中的顺序进行（我们已经基于 CategoricalDtype(list('cab'))建立了一个索引，因此排序的顺序是cab）

df2.sort_index()

   A
B
c  4
a  0
a  1
a  5
b  2
b  3

分组操作（Groupby）也会保留索引的全部信息。

df2.groupby(level=0).sum()
df2.groupby(level=0).sum().index

CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'],
ordered=False, name='B', dtype='category')

重设索引的操作将会根据输入的索引值返回一个索引。传入一个列表，将会返回一个最普通的Index；如果使用类别对象Categorical，则会返回一个分类索引CategoricalIndex，按照其中传入的的类别值Categorical dtype来进行索引。正如同你可以对任意pandas的索引进行重新索引一样，这将允许你随意索引任意的索引值，即便它们并不存在在你的类别对象中。

df2.reindex(['a', 'e'])
df2.reindex(['a', 'e']).index

df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde')))

df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde'))).index

间隔索引

pandas在 0.20.0 版本中新加入 IntervalIndex和它自己特有的IntervalDtype以及 Interval 标量类型，在pandas中，间隔数据是获得头等支持的。

IntervalIndex间隔索引允许一些唯一的索引，并且也是 cut() 和qcut()的返回类型

使用间隔索引来进行数据索引

df = pd.DataFrame({'A': [1, 2, 3, 4]},
                  index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]))

df

        A
(0, 1]  1
(1, 2]  2
(2, 3]  3
(3, 4]  4

间隔索引也可以使用基于标签的索引.loc：

# 2位于(1, 2]之中
df.loc[2]

A    2
Name: (1, 2], dtype: int64

# 2位于(1, 2]之中  3位于(2, 3]之中
df.loc[[2, 3]]

        A
(1, 2]  2
(2, 3]  3

如果标签被包含在间隔当中，这个间隔也将会被选择

print(df.loc[2.5])
print(df.loc[[2.5, 3.5]])

A    3
Name: (2, 3], dtype: int64
        A
(2, 3]  3
(3, 4]  4

使用 Interval来选择数据，将只返回严格匹配（从pandas0.25.0开始）。

df.loc[pd.Interval(1, 2)]

A    2
Name: (1, 2], dtype: int64

试图选择一个没有被严格包含在 IntervalIndex 内的区间Interval，将会出发KeyError错误。

try:
    df.loc[pd.Interval(0.5, 2.5)]
except KeyError as e:
    print('KeyError:', e)

KeyError: Interval(0.5, 2.5, closed='right')

可以使用overlaps()来创建一个布尔选择器，来选中所有与给定区间(Interval)重复的所有区间。

idxr = df.index.overlaps(pd.Interval(0.5, 2.5))
print(idxr)
print(df[idxr])

[ True  True  True False]
        A
(0, 1]  1
(1, 2]  2
(2, 3]  3

使用 `cut` 和 `qcut`来为数据分块

cut() 和 qcut() 都将返回一个分类Categorical 对象，并且每个分块区域都会以分类索引IntervalIndex的方式被创建并保存在它的.categories属性中。

c = pd.cut(range(4), bins=2)
print(c)
print(c.categories)

[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]
IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],
              closed='right',
              dtype='interval[float64]')

cut() 也可以接受一个 IntervalIndex 作为他的 bins 参数，这样可以使用一个非常有用的pandas的写法。首先，我们调用 cut() 在一些数据上面，并且将 bins设置为某一个固定的数，从而生成bins。

随后，我们可以在其他的数据上调用 cut()，并传入.categories 的值，作为 bins参数。这样新的数据就也将会被分配到同样的bins里面

pd.cut([0, 3, 5, 1], bins=c.categories)

[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

任何落在bins之外的数据都将会被设为 NaN

生成一定区间内的间隔

如果我们需要经常地使用步进区间，我们可以使用 interval_range() 函数，结合 start, end, 和 periods来建立一个 IntervalIndex 对于数值型的间隔，默认的 interval_range间隔频率是1，对于datetime类型的间隔则是日历日。

pd.interval_range(start=0, end=5)

pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4)

pd.interval_range(end=pd.Timedelta('3 days'), periods=3)

IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2
days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]],
              closed='right',
              dtype='interval[timedelta64[ns]]')

freq 参数可以被用来明确非默认的频率，并且可以充分地利用各种各样的 frequency aliasesdatetime类型的时间间隔。

pd.interval_range(start=0, periods=5, freq=1.5)

pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4, freq='W')

pd.interval_range(start=pd.Timedelta('0 days'), periods=3, freq='9H')

IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0
days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]],
              closed='right',
              dtype='interval[timedelta64[ns]]')

此外， closed 参数可以用来声明哪个边界是包含的。默认情况下，间隔的右界是包含的。

pd.interval_range(start=0, end=4, closed='both')

pd.interval_range(start=0, end=4, closed='neither')

IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)],
              closed='neither',
              dtype='interval[int64]')

0.23.0版本加入的功能

使用start, end, 和 periods可以从 start 到 end（包含）生成一个平均分配的间隔，在返回IntervalIndex中生成periods这么多的元素（译者：区间）。

pd.interval_range(start=0, end=6, periods=4)

pd.interval_range(pd.Timestamp('2018-01-01'),
                  pd.Timestamp('2018-02-28'), periods=3)

IntervalIndex([(2018-01-01, 2018-01-20 08:00:00], (2018-01-20
08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28]],
              closed='right',
              dtype='interval[datetime64[ns]]')