9.3.选择数据

from toolkit import H
import numpy as np
import pandas as pd
import copy

Windows 10
Python 3.8.8 @ MSC v.1928 64 bit (AMD64)
Latest build date 2021.03.15
pandas version:  1.2.2
numpy version:  1.20.1

了解并区别 pandas 的各种索引方式对数据处理很有帮助。很多时候，挑选恰当的索引方式，可以避免或减少循环的使用。这不仅能让代码更清晰，而且还能极大地提升程序的性能。

pandas 基于 NumPy，并提供了和 NumPy 类似的数据索引方式。熟悉 NumPy 的用户可以很快上手 pandas 的数据选择方式，但仍可能对pandas的索引方式有一些困惑。

一方面，因为 NumPy 的索引方式很复杂，而 pandas 的索引方式则较为简单，NumPy 的一些索引规则不被 pandas 支持，例如 pandas 不支持整数数组索引；另一方面，pandas 提供类似 ndarray 的数据结构，但其索引规则和 NumPy 仍有一些区别。

索引规则

pandas 提供了两种主要的数据结构：Series 和 DataFrame。Series 类似一维的 ndarray，而 DataFrame 类似二维的 ndarray。与 ndarray 不同的是，Series 和 DataFrame 可以有标签索引，而 ndarray 只有整数索引（除了结构数组）。Series 可以有行标签索引，DataFrame 则可以有行列标签索引。

pandas 支持如下索引种类：

标量：整数、标签
1-D array-like：整数序列、标签序列
切片：整数切片、标签切片
布尔数组

与 NumPy 不同，pandas 有多种调用索引的方式：

索引操作符[]
特性.loc\iloc
特性.at\iat
方法take\get

索引规则：

Series 只能索引行数据，因为 Series 只有一个维度。选择 Series 的数据时，无论是何种索引方式，都只能传入一个索引（行索引），否则报错。
索引操作符[]支持全部的索引种类，但[]只支持传入一个索引，例如df[0, 1]这种形式是不允许的。对于 DataFrame，使用标量索引（整数、标签）和序列索引时，[] 索引其列数据；使用切片索引时，[] 索引其行数据。
loc支持全部的索引种类；iloc是 integer-location 的缩写，仅支持整数、一维整数序列、整数切片、布尔数组。选择 DataFrame 的数据时，loc/iloc可以省略列索引，不能省略行索引。
at类似loc，但仅支持标签索引和整数索引；iat类似iloc，但仅支持整数索引。at/iat用于返回或设置标量。选择 DataFrame 的数据时，at/iat不能省略行索引或列索引。
如果 Series/DataFrame 设置了行标签或列标签，那么 Series 可以通过属性的方式访问或设置行标量，DataFrame 可以通过属性访问或设置列。
Series/DataFrame 的索引默认为整数值，此时[]/loc/at只能使用整数值索引；若设置了行标签或列标签，则只能用标签值索引。而iloc/iat在这两种情况下都可以使用整数值索引。
整数索引、标签索引意味着从对应的轴取出 1 个元素，因此，每多 1 个整数索引或标签索引，索引结果就减少 1 个维度。从 Series 抽取一个元素，返回标量；从 DataFrame 抽取一个元素，返回 Series 对象。
pandas 的切片索引有两种：整数切片、标签切片。整数切片是左闭右开的半开区间，和Python普通的整数切片一致；而标签切片是闭区间，即其末端是包含的（inclusive）。
若索引结果是标量，则丢失其索引信息；若索引结果是 Series 或 DataFrame，其对应的索引信息会被保留。

Attention

仅当index或columns是Python中有效的标识符时，Series或DataFrame才可以通过属性访问行或列。例如，ser.1是不允许的。

另外，index或columns与现有方法名冲突时，该index或columns则无法通过属性的方式访问。

索引示例代码

创建示例数据选择行选择列选择多行选择多列选择标量选择多行多列

ser = pd.Series([4.5, 7.2, -5.3, 3.6, 6.1], index=['d', 'b', 'a', 'c', 'b'])
print(ser)

d    4.5
b    7.2
a   -5.3
c    3.6
b    6.1
dtype: float64

d = {'one': pd.Series([1, 2, 3, 5], index=['a', 'b', 'c', 'e']),
     'two': pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])}

df = pd.DataFrame(d)
print(df)

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
e  5.0    5

整数索引标签索引属性

# Series --> Return scalar
print(ser[1])
print(ser.iloc[1])
print(ser.iat[1])

7.2
7.2
7.2

# DataFrame --> Return Series
print(df.iloc[0])  # equal to df.iloc[0, :]

one    1.0
two    1.0
Name: a, dtype: float64

# Series --> Return scalar
print(ser["b"])
print(ser.loc["b"])

b    7.2
b    6.1
dtype: float64
b    7.2
b    6.1
dtype: float64

# DataFrame --> Return Series
print(df.loc["a"])  # equal to df.loc["a", :]

one    1.0
two    1.0
Name: a, dtype: float64

print(ser.b)

b    7.2
b    6.1
dtype: float64

整数索引标签属性

# DataFrame --> Return Series
print(df.iloc[:, 0])

a    1.0
b    2.0
c    3.0
d    NaN
e    5.0
Name: one, dtype: float64

# DataFrame --> Return Series
print(df["one"])
print(df.loc[:, "one"])

a    1.0
b    2.0
c    3.0
d    NaN
e    5.0
Name: one, dtype: float64
a    1.0
b    2.0
c    3.0
d    NaN
e    5.0
Name: one, dtype: float64

# DataFrame --> Return Series
print(df.one)

a    1.0
b    2.0
c    3.0
d    NaN
e    5.0
Name: one, dtype: float64

整数序列标签序列整数切片标签切片

# Series --> Return Series
print(ser[[1, 0]])
print(ser.iloc[[1, 0]])

b    7.2
d    4.5
dtype: float64
b    7.2
d    4.5
dtype: float64

# DataFrame --> Return DataFrame
print(df.iloc[[0, 1]])  # equal to df.loc[[0, 1], :]

   one  two
a  1.0    1
b  2.0    2

# Series --> Return Series
print(ser[["b", "a"]])
print(ser.loc[["b", "a"]])

b    7.2
b    6.1
a   -5.3
dtype: float64
b    7.2
b    6.1
a   -5.3
dtype: float64

# DataFrame --> Return DataFrame
print(df.loc[["a", "b"]])

   one  two
a  1.0    1
b  2.0    2

不能使用()

try:
    ser.loc[("b", "a")]
except Exception as e:
    print(e)

Too many indexers

# Series --> Return Series
print(ser[1:3])
print(ser.iloc[1:3])

b    7.2
a   -5.3
dtype: float64
b    7.2
a   -5.3
dtype: float64

# DataFrame --> Return DataFrame
print(df[0:3:2])
print(df[0:2])

   one  two
a  1.0    1
c  3.0    3
   one  two
a  1.0    1
b  2.0    2

# Series --> Return Series
print(ser["d":"a"])

d    4.5
b    7.2
a   -5.3
dtype: float64

# DataFrame --> Return DataFrame
print(df["a":"c"])

   one  two
a  1.0    1
b  2.0    2
c  3.0    3

标签序列标签切片

# DataFrame --> Return DataFrame
print(df[["one", "two"]])
print(df.loc[:, ["one", "two"]])

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
e  5.0    5
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
e  5.0    5

# DataFrame --> Return DataFrame
print(df.loc[:, "one":"two"])

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
e  5.0    5

使用切片索引时，[]索引行数据，而不是索引列数据：

print(df["one":"two"])

Empty DataFrame
Columns: [one, two]
Index: []

# Series --> Return scalar
print(ser.d)
print(ser[0])
print(ser.iloc[0])
print(ser.loc["d"])
print(ser.iat[0])
print(ser.at["d"])

4.5
4.5
4.5
4.5
4.5
4.5

# DataFrame --> Return scalar
print(df.loc["a", "one"])
print(df.iloc[0, 0])
print(df.at["a", "one"])
print(df.iat[0, 0])
print(df["one"]["a"])  # Not recommended

1.0
1.0
1.0
1.0
1.0

如果存在重复的标签，在使用标签索引时，Series 和 DataFrame 都会返回 Series，而非标量：

print(ser.loc["b"])

b    7.2
b    6.1
dtype: float64

print(df.loc[["a", "c"], ["one", "two"]])
print(df.iloc[[0, 2], [0, 1]])

# First select a single column, then select multiple rows
print(df["one"][["a", "c"]])  # Not recommended
print(df[["one", "two"]].loc[["a", "c"]])  # Not recommended

   one  two
a  1.0    1
c  3.0    3
   one  two
a  1.0    1
c  3.0    3
a    1.0
c    3.0
Name: one, dtype: float64
   one  two
a  1.0    1
c  3.0    3

`at`和`iat`特性——快速访问标量

因为索引[]必须处理很多情况（标签列表，切片，布尔索引等），所以它有一些开销。如果只想访问标量值，最快的方法是使用at和iat方法。

print(ser.at["b"])
print(ser.iat[0])
print(df.at["a", "one"])
print(df.iat[0, 1])

b    7.2
b    6.1
dtype: float64
4.5
1.0
1

at\iat与loc\iloc相似，但at\iat不支持切片和布尔索引。

重复的索引

虽然很多时候都强调轴标签（索引值）要唯一，但这并不是强制性的。来看看下面这个简单的带有重复索引值的DataFrame：

obj = pd.DataFrame(np.ones((5, 2)), index=['a', 'a', 'b', 'b', 'c'],
                   columns=['one', 'one'])

索引的is_unique属性可以告诉你它的值是否是唯一的

print(obj.index.is_unique)
print(obj.columns.is_unique)

False
False

对于带有重复值的索引，数据选取的行为将会有些不同。如果某个索引对应多个值，则返回一个Series；而对应单个值的，则返回一个标量值。

运算对齐

对Series进行运算（布尔索引过滤、标量乘法、数学函数等），都会保留标签和值之间的链接。

a = pd.DataFrame(np.ones((2, 2)), index=['a', 'b'], columns=['c', 'd'])
b = pd.DataFrame(np.arange(4).reshape(2, 2),
                 index=['a', 'b'], columns=['d', 'c'])
a + b

     c    d
a  2.0  1.0
b  4.0  3.0

存在缺失标签的索引

在旧版本的pandas中，.loc[list-of-labels]只要有一个label存在，代码就可以正常工作，否则将引发KeyError。但从0.21版本开始，不推荐此行为。推荐的代替方案是使用.reindex()方法。在较新的pandas版本，只要索引中存在缺失的标签，将引发KeyError。

df.reindex(["a", "b", "y", "z"])

   one  two
a  1.0  1.0
b  2.0  2.0
y  NaN  NaN
z  NaN  NaN

如果df中存在重复的索引，reindex将引发ValueError：

df_tmp = pd.DataFrame({'one': pd.Series([1, 2, 3, 5], index=['a', 'a', 'c', 'e']),
                       'two': pd.Series([1, 2, 3, 4], index=['a', 'a', 'c', 'e'])})
try:
    df_tmp.reindex(["a", "c"])
except ValueError as e:
    print(e)

cannot reindex from a duplicate axis

如果使用切片索引（整数/标签切片），即使切片超出范围，也不会引发错误。如果整个切片都不在索引范围，则返回空Series，或空DataFrame。

重新赋值

[]/iloc/loc/iat/at不仅可以返回所选数据，还可以给所选数据赋值：

df.loc["two"] = 2

使用[]/loc/at为不存在的标签赋值时，可以给DataFrame添加新的标签行或标签列，即对目标对象执行放大操作（enlargement）。

df["three"] = 30
print(df)

     one  two  three
a    1.0    1     30
b    2.0    2     30
c    3.0    3     30
d    NaN    4     30
e    5.0    5     30
two  2.0    2     30

使用数值索引的iloc/iat则不支持放大操作。xs\take\lookup方法只能用于获取值，不能用于赋值。

`get`方法——返回默认值

df.get(key, default=None)

Series或DataFrame都有一个get方法，这类似字典的get方法，可以返回默认值。对于Series，get方法对行进行索引；对于DataFrame，get方法对列进行索引。

print(ser.get("a"))
print(df.get(["one", "two"]), 5)
print(df.get(["one", "three"], 5))

-5.3
     one  two
a    1.0    1
b    2.0    2
c    3.0    3
d    NaN    4
e    5.0    5
two  2.0    2 5
     one  three
a    1.0     30
b    2.0     30
c    3.0     30
d    NaN     30
e    5.0     30
two  2.0     30

`lookup`方法

给定一系列行标签和列标签的情况下提取一组值，返回NumPy数组：

# 返回 ('c','one') 和 ('b','two')
df.lookup(["c", "b"], ["one", "two"])

<ipython-input-1-9520edc4baa0>:2: FutureWarning: The 'lookup' method
is deprecated and will beremoved in a future version.You can use
DataFrame.melt and DataFrame.locas a substitute.
  df.lookup(["c", "b"], ["one", "two"])

array([3., 2.])

`take`方法

pd.DataFrame.take(self, indices, axis=0, is_copy=None, kwargs)

indices：一个整数数组，指明了所需元素的位置。

与ndarray相似，pandas的 Index、Series和DataFrame 也提供take()方法。它可以沿着某个维度，按照给定的索引取回所有的元素。take方法不使用标签索引，这意味着给定的索引必须是指明元素位置的整数索引。这个整数索引必须是一维的array-like。take 也可以接受负整数，作为相对于结尾的相对位置。

# Index
index = pd.Index(np.random.randint(0, 1000, 10))
print(index)

positions = [0, 9, 3]

print(index[positions])
print(index.take(positions))

Int64Index([700, 922, 341, 534, 648, 885, 0, 215, 882, 889], dtype='int64')
Int64Index([700, 889, 534], dtype='int64')
Int64Index([700, 889, 534], dtype='int64')

# Series
ser = pd.Series(np.random.randn(10))
print(ser.iloc[positions])
print(ser.take(positions))

0    0.759892
9    2.094738
3   -0.599890
dtype: float64
0    0.759892
9    2.094738
3   -0.599890
dtype: float64

# DataFrame
frm = pd.DataFrame(np.random.randn(5, 3))
print(frm)

print(frm.take([1, 4, 3]))
print(frm.take([0, 2], axis=1))

          0         1         2
0 -0.578601  1.044800  1.194279
1 -1.600242  1.506052  1.185538
2  1.027127 -0.723504 -2.328613
3  0.285260 -1.157831  0.564503
4  1.219048  0.374449  0.014576
          0         1         2
1 -1.600242  1.506052  1.185538
4  1.219048  0.374449  0.014576
3  0.285260 -1.157831  0.564503
          0         2
0 -0.578601  1.194279
1 -1.600242  1.185538
2  1.027127 -2.328613
3  0.285260  0.564503
4  1.219048  0.014576

需要注意的是，pandas对象的take方法并不会正常地工作在布尔索引上，并且有可能会返回一切意外的结果。

arr = np.arange(10)
print(arr.take([False, False, True, True]))

[0 0 1 1]

ser = pd.Series(np.arange(10))
print(ser.take([False, False, True, True]))

0    0
0    0
1    1
1    1
dtype: int32

最后，关于性能方面的一个小建议，因为take方法处理的是一个范围更窄的输入，因此会比花式索引（fancy indexing）的速度快很多。

arr = np.random.randn(10000, 5)
indexer = np.arange(10000)
np.random.shuffle(indexer)

%timeit arr[indexer]
%timeit arr.take(indexer, axis=0)

492 µs ± 148 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
196 µs ± 31.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

ser = pd.Series(arr[:, 0])
%timeit ser.iloc[indexer]
%timeit ser.take(indexer)

519 µs ± 175 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
441 µs ± 234 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

视图和副本

熟悉 NumPy 的用户都会知道，ndarray 数组的索引结果有视图和副本之分。副本是原数据的完整拷贝，其物理内存和原数据的内存在同一个位置。视图是原数据的别称或引用，其物理内存和原数据的内存位于相同的物理位置。对副本进行修改，不会影响原数据，而修改视图则会影响原数据。

pandas 的索引结果也分为视图和副本。因为 pandas 的底层操作基于 NumPy，其索引结果返回视图还是副本的逻辑也是来自于 NumPy。

ndarray 所有元素都是同质的，即都具有相同的dtype。但 DataFrame 并非总是如此。如果一个 DataFrame 的所有列都是相同的dtype，那么它是 single-dtyped object；如果一个 DataFrame 的列不是全部具有相同的 dtype，那么它是 multi-dtyped object，或者称为 mixed-type。

如果索引结果是 multi-dtyped DataFrame，那么索引结果必定是副本。因为 pandas 无法将 multi-dtyped DataFrame 作为单个同质的 NumPy 数组的视图返回。但是，为了提高效率，返回 single-dtyped DataFrame 的索引操作可能返回一个视图，具体取决于对象的内存布局。

NumPy 数组的flags属性可以显示数组是副本还是视图，但 pandas 的 DataFrame 并没有对外暴露这样的接口。不过，DataFrame 的一些私有属性包含了这些信息，我们可以访问这些私有属性，从而判断索引结果是副本还是视图。

创建示例定义get_df_info函数

创建示例 DataFrame：

data = {"one": [1, 2, 3, 4, 5],
        "two": [10, 20, 30, 40, 50],
        "three": ["A", "B", "C", "D", "D"]}
mixed_df = pd.DataFrame(data)

print("Is view?", mixed_df._is_view)
print("Is mixed-type?", mixed_df._is_mixed_type)
print("Is single-dtyped?", mixed_df._mgr.is_single_block)

data = {"one": [1, 2, 3, 4, 5],
        "two": [10, 20, 30, 40, 50]}
single_df = pd.DataFrame(data)

print("Is view?", single_df._is_view)
print("Is mixed-type?", single_df._is_mixed_type)
print("Is single-dtyped?", single_df._mgr.is_single_block)

print("mixed_df 内存地址:", hex(id(mixed_df)).upper())
print("single_df 内存地址:", hex(id(single_df)).upper())

Is view? False
Is mixed-type? True
Is single-dtyped? False
Is view? False
Is mixed-type? False
Is single-dtyped? True
mixed_df 内存地址: 0X293C66A4850
single_df 内存地址: 0X293C6648640

def get_df_info(df, original_df=None):
    if original_df is not None:
        backup_df = copy.deepcopy(original_df)
    else:
        backup_df = copy.deepcopy(df)

    print("  Is view?", df._is_view)
    if df._is_view != (df.values.base is None):
        print("  Is 'values.base' None?", df.values.base is None)
    else:
        print("* Is 'values.base' None?", df.values.base is None)
    print("  Is mixed-type?", df._is_mixed_type)
    print("  Is copy?", df._is_copy.__str__().replace("0x00000", "0x"))
    try:
        print("  Result type:", df.dtypes.to_list())
    except AttributeError:
        print("  Result type:", df.dtypes)

    pd.set_option('mode.chained_assignment', None)
    time.sleep(1)
    timestamp_int = int(time.time())
    if original_df is not None:
        df[0] = timestamp_int
        if df._is_view == (timestamp_int in original_df.values):
            print("  Is original dataframe modified?",
                  timestamp_int in original_df.values)
        else:
            print("* Is original dataframe modified?",
                  timestamp_int in original_df.values)

    pd.set_option('mode.chained_assignment', "raise")
    try:
        df[0] = timestamp_int
    except pd.core.common.SettingWithCopyError:
        # print('\033[31mSettingWithCopyError\033[0m')
        print('* SettingWithCopyError')
    return backup_df

整数标签序列切片

print("single row from a mixed-type DataFrame:")
mixed_df = get_df_info(mixed_df.iloc[0], mixed_df)
print("single row from a single-typed DataFrame:")
single_df = get_df_info(single_df.iloc[0], single_df)

print("single column from a mixed-type DataFrame:")
mixed_df = get_df_info(mixed_df.iloc[:, 2], mixed_df)
print("single column from a single-typed DataFrame:")
single_df = get_df_info(single_df.iloc[:, 0], single_df)

single row from a mixed-type DataFrame:
  Is view? False
  Is 'values.base' None? True
  Is mixed-type? False
  Is copy? <weakref at 0x293C669A4F0; to 'DataFrame' at 0x293C66A4850>
  Result type: object
  Is original dataframe modified? False
* SettingWithCopyError
single row from a single-typed DataFrame:
  Is view? True
  Is 'values.base' None? False
  Is mixed-type? False
  Is copy? None
  Result type: int64
  Is original dataframe modified? True
single column from a mixed-type DataFrame:
  Is view? True
  Is 'values.base' None? False
  Is mixed-type? False
  Is copy? None
  Result type: object
  Is original dataframe modified? True
* SettingWithCopyError
single column from a single-typed DataFrame:
  Is view? True
  Is 'values.base' None? False
  Is mixed-type? False
  Is copy? None
  Result type: int64
  Is original dataframe modified? True

标签索引得到的结果和整数索引是一样的。

print("single row from a mixed-type DataFrame:")
mixed_df = get_df_info(mixed_df.loc[0], mixed_df)
print("single row from a single-typed DataFrame:")
single_df = get_df_info(single_df.loc[0], single_df)

print("single column from a mixed-type DataFrame:")
mixed_df = get_df_info(mixed_df.loc[:, "one"], mixed_df)
print("single column from a single-typed DataFrame:")
single_df = get_df_info(single_df.loc[:, "one"], single_df)

single row from a mixed-type DataFrame:
  Is view? False
  Is 'values.base' None? True
  Is mixed-type? False
  Is copy? <weakref at 0x293C669AE50; to 'DataFrame' at 0x293C6648640>
  Result type: object
  Is original dataframe modified? False
* SettingWithCopyError
single row from a single-typed DataFrame:
  Is view? True
  Is 'values.base' None? False
  Is mixed-type? False
  Is copy? None
  Result type: int64
  Is original dataframe modified? True
single column from a mixed-type DataFrame:
  Is view? True
  Is 'values.base' None? False
  Is mixed-type? False
  Is copy? None
  Result type: int64
  Is original dataframe modified? True
* SettingWithCopyError
single column from a single-typed DataFrame:
  Is view? True
  Is 'values.base' None? False
  Is mixed-type? False
  Is copy? None
  Result type: int64
  Is original dataframe modified? True

整数序列和标签序列得到的结果是一致的，因此这里只展示整数序列的结果。

index = [0, 1]
print("multi-rows from a mixed-type DataFrame:")
mixed_df = get_df_info(mixed_df.iloc[index], mixed_df)
print("multi-rows from a single-typed DataFrame:")
single_df = get_df_info(single_df.iloc[index], single_df)

print("multi-columns from a mixed-type DataFrame:")
mixed_df = get_df_info(mixed_df.iloc[:, index], mixed_df)
print("multi-columns from a single-typed DataFrame:")
single_df = get_df_info(single_df.iloc[:, index], single_df)

multi-rows from a mixed-type DataFrame:
  Is view? False
* Is 'values.base' None? False
  Is mixed-type? True
  Is copy? <weakref at 0x293C66A50E0; to 'DataFrame' at 0x293C66A4AF0>
  Result type: [dtype('int64'), dtype('int64'), dtype('O')]
  Is original dataframe modified? False
* SettingWithCopyError
multi-rows from a single-typed DataFrame:
  Is view? False
* Is 'values.base' None? False
  Is mixed-type? False
  Is copy? <weakref at 0x293C669AE50; to 'DataFrame' at 0x293C66A43A0>
  Result type: [dtype('int64'), dtype('int64')]
  Is original dataframe modified? False
* SettingWithCopyError
multi-columns from a mixed-type DataFrame:
  Is view? False
* Is 'values.base' None? False
  Is mixed-type? False
  Is copy? <weakref at 0x293C66A3180; to 'DataFrame' at 0x293C6698D00>
  Result type: [dtype('int64'), dtype('int64')]
  Is original dataframe modified? False
multi-columns from a single-typed DataFrame:
  Is view? False
* Is 'values.base' None? False
  Is mixed-type? False
  Is copy? None
  Result type: [dtype('int64'), dtype('int64')]
  Is original dataframe modified? False

整数切片和标签切片得到的结果是一致的。

print("multi-rows from a mixed-type DataFrame:")
mixed_df = get_df_info(mixed_df.iloc[0:2], mixed_df)
print("multi-rows from a single-typed DataFrame:")
single_df = get_df_info(single_df.iloc[0:2], single_df)

print("multi-columns from a mixed-type DataFrame:")
mixed_df = get_df_info(mixed_df.loc[:, "one":"three"], mixed_df)
print("multi-columns from a single-typed DataFrame:")
single_df = get_df_info(single_df.loc[:, "one":"three"],single_df)

multi-rows from a mixed-type DataFrame:
  Is view? False
* Is 'values.base' None? False
  Is mixed-type? True
  Is copy? <weakref at 0x293C66A5450; to 'DataFrame' at 0x293C6635C40>
  Result type: [dtype('int64'), dtype('int64'), dtype('O')]
  Is original dataframe modified? False
* SettingWithCopyError
multi-rows from a single-typed DataFrame:
  Is view? True
  Is 'values.base' None? False
  Is mixed-type? False
  Is copy? <weakref at 0x293C664AC70; to 'DataFrame' at 0x293C668CA90>
  Result type: [dtype('int64'), dtype('int64')]
  Is original dataframe modified? False
* SettingWithCopyError
multi-columns from a mixed-type DataFrame:
  Is view? False
* Is 'values.base' None? False
  Is mixed-type? True
  Is copy? None
  Result type: [dtype('int64'), dtype('int64'), dtype('O')]
  Is original dataframe modified? False
multi-columns from a single-typed DataFrame:
  Is view? True
  Is 'values.base' None? False
  Is mixed-type? False
  Is copy? <weakref at 0x293C669D2C0; to 'DataFrame' at 0x293C666C460>
  Result type: [dtype('int64')]
  Is original dataframe modified? False

通过字典创建 DataFrame 时，pandas 永远不会创建视图，但通过 ndarray 创建 DataFrame df时，pandas 会创建视图，视图指向 ndarray。df的索引结果的_is_view属性也为True，但索引结果的视图并不指向df。此时，对索引结果进行修改，并不会影响df。

df = pd.DataFrame(np.ones((4, 3)))
get_df_info(df)
print()
df = pd.DataFrame(np.ones((4, 3)))
df = get_df_info(df[(df > 0.5).any(axis=1)], df)

  Is view? True
  Is 'values.base' None? False
  Is mixed-type? False
  Is copy? None
  Result type: [dtype('float64'), dtype('float64'), dtype('float64')]

  Is view? True
  Is 'values.base' None? False
  Is mixed-type? False
  Is copy? None
  Result type: [dtype('float64'), dtype('float64'), dtype('float64')]
  Is original dataframe modified? False

这意味着即使索引结果_is_view属性为True，我们也无法保证索引结果就是原DataFrame的视图，它也可能是某个 ndarray 的视图。有人会提出使用df.values.base属性替代私有属性df._is_view。但在上面的序列索引和切片索引的例子中，我们可以发现，df.values.base属性可能和df._is_view属性出现冲突。我不是熟悉 pandas 内部构造的专家，无法解释出现冲突的原因，但可以知道通过df.values.base属性也无法准确判断索引结果是否为原DataFrame的视图。

事实上，pandas 无法确定索引结果到底是副本，还是原DataFrame的视图，或者某个 ndarray 的视图。

a = pd.DataFrame(np.arange(16).reshape((4,4)))
x = a[(a > 2).any(axis=1)]

# Assign using equivalent notation to below
x.iloc[:len(x), 1] = 100
print(x._is_view)

# Assign using slightly different syntax
x.iloc[:, 1] = 1
print(x._is_view)

True
False

因为 NumPy 是可以预测地返回视图或副本，有人可能会困惑于基于 NumPy 的 pandas 为何不能预测地返回。NumPy 的索引规则虽然较复杂，但其索引种类和索引方式是比较单一的。而 pandas 为了实现功能更加强大且通用的索引，其索引并没有完全与与底层 NumPy 数组的功能相结合。随着时间的推移，pandas 索引的设计和底层 NumPy 功能之间的相互作用导致了一套复杂的规则，这些规则决定了是否可以返回视图或副本。

最终，只有一个简单的笨方法可以完全确定索引操作x=df[]/df.loc[]/df.iloc[]返回的结果是否为原DataFrame的视图，只需查看更改x中的值是否会影响x。如果是，则x为视图；如果不是，则x为副本。

链式赋值

pandas 所有的 indexing-set 操作都会修改df本身。

data = {"one": [1, 2, 3, 4, 5],
        "two": [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
df["one"] = 100
df.loc[0, "one"] = 200
df.iloc[0, 1] = 300
print(df)

   one  two
0  200  300
1  100   20
2  100   30
3  100   40
4  100   50

若df是由一些原始数据创建的DataFrame，我们无需担心（例如上面的代码）。一旦df是其他DataFrame的索引结果，df的 indexing-set 操作就有可能产生副作用，或者出现非预期的结果。

df是（临时）副本，而实际上想修改原DataFrame，此时代码不按预期工作。
df是视图，对df进行修改会影响原DataFrame，若你不想修改原DataFrame，此时代码便出现副作用。

上述两种情况只会在链式赋值（chained assignment）时发生。有经验的 pandas 开发者能自如地操作 pandas 的索引行为。然而对于 pandas 新手来说，链式索引几乎不可避免，同时，如上一节所述，pandas 无法预测地返回视图或副本，从而导致链式赋值的结果是不可预测的，很容易出现上述两种非预期的操作。因此，在2013年底，pandas 0.13.0 版本引入了SettingWithCopyWarning这个警告，以解决许多开发人员遇到的链式赋值失败的问题。

不幸的是，用 Jeff Reback （pandas 的核心开发者之一）的话来说，“从语言的角度来看，直接检测链式赋值是不可能的，只能进行推断。”

理解这句话，需要考虑 Python 解释器如何执行此代码：

# indexing-get
dfmi.loc[:, ('one', 'second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)

链式赋值有不同的处理方式：

# chained assignment = indexing-get + indexing-set
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)

__getitem__和__setitem__是两个独立的操作，__setitem__并不知道__getitem__。

关于chained assignment的警告/异常旨在通知用户可能存在无效的赋值，有可能出现误报（假阳性）、或者漏报（假阴性）的情况。

误报在早期的 pandas 版本中较为普遍，但此后大部分被消除了。为了完整起见，在此处收录一些常见的误报示例。如果您在使用较早版本的 pandas 时遇到以下任何情况，则可以安全地忽略或消除该警告（或通过升级完全避免该警告）。

误报示例

使用已存在的列的值向DataFrame 中添加新列会生成警告，但此问题已得到解决。

data['bidtime_hours'] = data.bidtime.map(lambda x: x * 24)
data.head(2)

在DataFrame的切片上使用apply方法进行 indexing-set 操作时，也会出现误报，这也已经到解决。

data.loc[:, 'bidtime_hours'] = data.bidtime.apply(lambda x: x * 24)
data.head(2)

pandas 0.17.0 的DataFrame.sample方法存在一个bug，会导致误报。现在DataFrame.sample方法总会返回副本。

sample = data.sample(2)
sample.loc[:, 'price'] = 120
sample.head()

直到现在，漏报还是很常见，漏报的示例参考“视图和副本”小节中的示例代码。

有时候，代码没有明显的链式赋值，SettingWithCopy 警告也会出现。实际上，你可能已经使用了链式赋值：

def do_something(df):
    foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
    # ... many lines here ...
    # We don't know whether this will modify df or not!
    foo['quux'] = value
    return foo

除非你完全明确索引操作的结果，否则忽略警告就像在代码中埋雷。幸运的是，SettingWithCopyWarning警告一般在链式赋值时才会出现。因此，学习一些技巧，避免使用链式赋值，便可消除警告信息。

如果要修改原DataFrame，应该直接对原DataFrame进行 indexing-set 操作，而不是修改其视图。例如，使用df.loc[0,'one']=10，而不是df['one'][0]=10。
如果不想修改原DataFrame，应该显式创建 indexing-get 结果的副本，并对副本执行 indexing-set 操作。例如：a=df['one'].copy(); a=10。

需要注意的是，虽然使用.copy方法可以绕开警告，但盲目使用.copy方法可能会带来另一个问题——MemoryError，这取决于所处理数据的大小和可用内存大小。

from memory_profiler import profile

@profile
def foo():
    df = pd.DataFrame(np.random.randn(2 * 10 ** 7))

    d1 = df[:]
    d1 = d1.copy()

foo()

Line #    Mem usage    Increment  Occurences   Line Contents
#============================================================
     6     75.4 MiB     75.4 MiB           1   @profile
     7                                         def foo():
     8    228.0 MiB    152.7 MiB           1       df = pd.DataFrame(np.random.randn(2 * 10 ** 7))
     9
    10    228.0 MiB      0.0 MiB           1       d1 = df[:]
    11    380.6 MiB    152.6 MiB           1       d1 = d1.copy()

在上述示例中，df1是df的副本，但df占用的内存不会被自动回收，而是保留在内存中。即使执行df=df[:].copy()语句，旧的df数据和新的副本也会在短暂的时间内同时被存储在内存中。

pandas 提供了显式控制SettingWithCopy的mode.chained_assignment设置选项，该选项可以采用以下的值：

pd.set_option('mode.chained_assignment', 'raise') # 引发异常
pd.set_option('mode.chained_assignment', 'warn')  # 显示警告, 默认值
pd.set_option('mode.chained_assignment', None)    # 完全关闭警告

如果开发团队中有经验不足的 pandas 开发人员，或者项目需要高度严谨，那么这可能特别有用。使用此设置的一种更精确的方法是使用上下文管理器。

# resets the option we set in the previous code segment
pd.reset_option('mode.chained_assignment')
with pd.option_context('mode.chained_assignment', None):
    data[data.bidder == 'parakeet2004']['bidderrate'] = 100

这种方法支持细粒度的警告控制，而不是不加选择地影响整个环境。

索引方式总结表

类型
`df[val]`	从DataFrame选取单列或一组列；在特殊情况下比较便利：布尔型数组（过滤行）、切片（行切片）、或布尔型DataFrame (根据条件设置值）
`df.loc[val]`	通过标签，选取DataFrame的单个行或一组行
`df.loc[:, val]`	通过标签，选取单列或列子集
`df.loc[val1, val2]`	通过标签，同时选取行和列
`df.iloc[where]`	通过整数位置，从DataFrame选取单个行或行子集
`df.iloc[:, where]`	通过整数位置，从DataFrame选取单个列或列子集
`df.iloc[where_i, where_j]`	通过整数位置，同时选取行和列
`df.at [label_i, label_j]`	通过行和列标签，选取单一的标量
`df.iat[i, j]`	通过行和列的位置（整数），选取单一的标量
`reindex`	通过标签选取行或列
`get _value, set_value`	通过行和列标签选取单一值