2.5.String

Linux 5.4.0-74-generic
Python 3.9.5 @ GCC 7.3.0
Latest build date 2021.06.19

多行字符串

# tab = "input"
str_1 = """
This is a multi-line string
This is the second line
"""
print(str_1)

str_2 = "\nThis is a multi-line string\nThis is the second line\n"
print(str_2)

print(str_1 == str_2)

This is a multi-line string
This is the second line


This is a multi-line string
This is the second line

True

转义

如果要在字符串里面输入特殊字符，需要使用转义符\，例如输入转义符本身：

print("12\\14")

12\14

如果不是特殊字符，则转义符不会生效：

print("\g")

\g

前缀

在 Python 3 中，字符串可以添加以下前缀：

r / R：表示非转义的原始字符串，常用于正则表达式、路径字符串。
b / B：表示 Python 3 的 bytes，Python 2 的字符串默认就是 bytes。Python2的b前缀只是为了兼容Python3的这种写法。
u / U：表示 Python 2 的 unicode 字符串，即对字符串进行 unicode 编码。Python 3 里默认的 str 就是 Python 2 的 unicode。Python 3 的u前缀只是为了兼容 Python 2 的这种写法。建议统一采用UTF-8编码。
f / F：表示f格式字符串。

字符编码

Python 3有两种不同的字符串：

str：文本字符串类型，默认使用 USC-4 编码储存字符串。相当于 Python 2 的unicode类型，但 Python 2 的unicode默认使用 USC-2 编码储存字符串。str 可以通过编码，转换为 bytes。
bytes：字节字符串类型，默认使用 UTF-8 编码储存字符串。字面值用 16 进制或者 ASCII 码显示。相当于 Python 2 的str类型。bytes 可以通过解码转换为 str。

$$ \begin{array}{} \text{Python 3} \mathbf{str} &\Longleftrightarrow \text{Python 2} \mathbf{unicode} \\ \text{Python 3} \mathbf{bytes} &\Longleftrightarrow \text{Python 2} \mathbf{str} \end{array} $$

Python的文档说 Python 3 的str类型储存Unicode编码或Unicode字符串，而bytes类型储存字符串的二进制字节流。实际上，这相当有误导性，无论str还是bytes，其本质都是整型，只是 Python 3的str使用定长编码储存字符串，而betys使用变长编码储存字符串。

在 Python 3 中实例化一个字符串

str_1 = "中国"
# 默认使用 UTF-8 编码
print(str_1.encode())
print(str_1.encode(encoding="utf-8"))

b'\xe4\xb8\xad\xe5\x9b\xbd'
b'\xe4\xb8\xad\xe5\x9b\xbd'

# 创建一个二进制字符串
str_2 = b'\xe4\xb8\xad\xe5\x9b\xbd'
# 默认使用 UTF-8 解码
print(str_2.decode())
print(str_2.decode("utf-8"))

中国
中国

Python 3 的str类型在处理或运算字符串时使用，而bytes类型一般在网络传输字符串或将字符串保存到硬盘时使用。使用 encode / decode 方法时，不能超出指定编码的范围，例如不能对中文字符使用.encode("ascii")，因为中文不在ascii码的范围内。

try:
    '中文'.encode('ascii')
except UnicodeEncodeError as e:
    print(e)

try:
    b'\xe4\xb8\xad\xe6\x96\x87'.decode('ascii')
except UnicodeDecodeError as e:
    print(e)

'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

如果bytes中包含无法解码的字节，decode()方法会报错：

try:
    b'\xe4\xb8\xad\xff'.decode('utf-8')
except UnicodeDecodeError as e:
    print(e)

'utf-8' codec can't decode byte 0xff in position 3: invalid start byte

但是，如果bytes中只有一小部分无效的字节，可以传入errors = 'ignore'忽略错误的字节：

"中".encode("utf-8")  # 中 原来的编码
b'\xe4\xb8\xad\xff'.decode('utf-8', errors='ignore')

'中'

如果代码会在多个运行环境中使用，在对str和bytes互相转换时，为了避免乱码问题，应该使用 UTF-8 编码对str和bytes进行转换。因为 Python 3 在 encode 和 decode 字符串时默认使用 UTF-8 编码，所以使用 Python 3 会较少遇到字符编码问题。但 Python 2 的str类型储存的可能是ASCII、UTF-8、GBK、gb2312、cp437 或 big5 等等，这往往取决于当前系统或终端所用的默认编码。如果你没有显式指定 UTF-8 编码，或者不知道当前使用的是什么编码，就可以遇到乱码。

如果对字符编码不了解，可以参考1.5字符编码。

可以通过以下代码输出Python、当前系统、系统标准输入、系统标准输出、系统文件名所使用的编码：

import sys, locale

print(sys.getdefaultencoding())
print(locale.getdefaultlocale())

print(sys.stdin.encoding)
print(sys.stdout.encoding)
print(sys.getfilesystemencoding())

utf-8
('zh_CN', 'UTF-8')
utf-8
UTF-8
utf-8

可以通过以下代码判断 Python 3 的str类型或 Pyhon 2 的 unicode 类型使用的编码是 UCS-2 还是 UCS-4：

from __future__ import print_function
import sys

# 如果是 1114111 则是 USC-4
# 如果是 65535 则是 USC-4
print(sys.maxunicode)

由于 Python 源代码也是一个文本文件，所以，当你的源代码中包含中文的时候，最好使用UTF-8编码保存源代码文件。当 Python 解释器读取源代码时，为了让它按 UTF-8 编码读取，我们通常在文件开头写上这两行：

# !/usr/bin/env python3
# -*- coding: utf-8 -*-

第一行注释是为了告诉 Linux/OS X系统，这是一个 Python可执行程序，Windows 系统会忽略这个注释。第二行注释是为了告诉 Python 解释器，按照 UTF-8 编码读取源代码，否则，在源代码中写的中文输出可能会有乱码。

UTF-8编码注释只是让Python解释器按UTF-8的编码规则读取源代码，并不意味着源代码一定是使用UT8-8编码保存的，必须并且要确保文本编辑器正在使用UTF-8编码。

格式化

C printf 样式的格式化

字符串具有一种特殊的内置操作：使用 % (取模)运算符，可以格式化字符串，此时%也被称为字符串的格式化运算符或插值运算符。对于format % values(其中format为一个字符串)，在format中的 % 转换标记符将被替换为零个或多个values条目。其效果类似于在C语言中使用sprintf()。

# tab = "input"
print('Hello, %s' % 'world')  # 省略括号
print('Hi, %s, you have $%d' % ('Michael', 10000.58))

Hello, world
Hi, Michael, you have $10000

转换标记符包含两个或更多字符，由以下元素组成，且必须遵循此处规定的顺序：

%字符，用于标记转换符的起始。
映射键（可选），由加圆括号的字符序列组成(例如(somename))。
转换标志（可选）
- -表示左对齐
- +表示转换后的内容前方加上正负号
- 空格表示正数前方保留一个空格
- 0表示如果转换位置位数不够时，用0填充
最小字段宽度（可选），转换后字段的最小宽度，如果不足自动用空格补齐；如果最小字段宽度值为*，则从元组中读取。
精度（可选），在.之后加精度值的形式给出。如果指定为为*，则从元组中读取。
长度修饰符（可选）。
转换类型。

Conversion	Meaning	Conversion	Meaning
`d`	格式化整数	`f`	浮点数
`i`	格式化整数	`F`	浮点数
`o`	8 进制	`g`	浮点数，如果指数小于 -4 或小于精度，则使用小写指数格式，否则使用小数格式
`u`	格式化无符号整型(已过时)	`G`	浮点数，类似 `g`, 使用大写指数
`x`	16 进制，小写字母	`c`	单个字符，格式化字符及其 ASCII 码
`X`	16 进制，大写字母	`r`	字符串，使用 repr() 转换任何 Python 对象
`e`	浮点指数格式，小写字母	`s`	字符串，使用 str() 转换任何 Python 对象
`E`	浮点指数格式，大写字母	`%`	% 字符

# 映射键
## 如果值是一个字典，那么必须使用映射键. 如：
print('%(num)d' % {"num": 10})

# 最小字段宽度
print('%2s' % 123456)
print('%8s' % 123456)
print('%*s' % (10, 123456))

123456
  123456
    123456

# 精度
print('%.2f' % 100.256)
print('%.*f' % (2, 100.256))

100.26
100.26

# 转换标志
print('%-8d' % 123456)
print('%+8d' % 123456)
print('%08d' % 123456)

123456
 +123456
00123456

format方法

< 模板字符串 >.format( < 逗号分隔的参数 >)

<模板字符串>由一系列的槽{}组成，用来控制修改字符串中嵌入值出现的位置，其基本思想是将format()方法的<逗号分隔的参数>中的参数按照序号关系替换到<模板字符串>的槽中。如果大括号中没有序号，则按照出现顺序替换。该方法会创建新字符串对象，不会修改原值。

{}中指定顺序，按指定的顺序替换

print("{2}:{1}:{0}".format("1", "2", "3"))

3:2:1

没有指定顺序，按出现的顺序依次替换

print("{}:{}:{}".format("1", "2", "3"))
# 等价于
# "{1}:{2}:{3}".format("1", "2", "3")

1:2:3

format()方法中<模板字符串>的槽除了包括参数序号，还可以包括格式控制信息。槽内部的完整样式如下：

替换字段       - replacement_field ::=  "{" [field_name] ["!" conversion] [":" format_spec] "}"
字段名         - field_name        ::=  arg_name ("." attribute_name | "[" element_index "]")*
参数名         - arg_name          ::=  [identifier | integer]
属性名         - attribute_name    ::=  identifier
元素索引       - element_index     ::=  integer | index_string
索引字符串     - index_string      ::=  <any source character except "]"> +
转换字段       - conversion        ::=  "r" | "s"
格式规范说明符 - format_spec        ::=  <described in the next section>

一个槽{}由三部分组成：

field_name: 字段名可以是 arg_name、attribute_name、element_index之中的一个。
1. arg_name 可以是整数或关键字。整数为.format()的位置参数，关键字为.format()的关键字参数。
2. 如果.format()的参数存在属性，可以通过 arg_name.attribute_name 的形式获取属性值。
3. 如果.format()参数为可迭代对象，可以通过 arg_name[integer|index_string] 的形式获取索引位置的元素。
conversion: 转换字段由 ! 开始， r 代表调用 repr(), s 代表调用 str()。
format_spec: 格式规范说明符由 : 开始。

# field_name 是 关键字
print("{name!r}".format(name="小明"))
# field_name 是 整数
print("{0!s}".format("小明"))
# field_name 是 属性名
class Person:
    name = "小明"

print("{.name!s}".format(Person))
# field_name 是 整数索引
names = ["小红", "小明"]
print("{[1]!s}".format(names))
# field_name 是 字符串索引
from pandas import Series

names = Series({"name":"小明"})
print("{[name]!s}".format(names))

'小明'
小明
小明
小明
小明

其中，<格式控制标记>用来控制参数显示时的格式，可以使用的参数如下：

format_spec     ::=  [[fill]align][sign][#][0][width][grouping_option][.precision][type]
fill            ::=  <any character>
align           ::=  "<" | ">" | "=" | "^"
sign            ::=  "+" | "-" | " "
width           ::=  digit+
grouping_option ::=  "_" | ","
precision       ::=  digit+
type            ::=  "b" | "c" | "d" | "e" | "E" | "f" | "F" | "g" | "G"| "n" | "o" | "s" | "x" | "X" | "%"

填充字符：如果指定了对齐选项，则可以使用填充字符。默认为空格.

对齐选项：默认右对齐，各种对齐选项的含义如下：

选项	意义
`'<'`	强制字段在可用空间内左对齐（这是大多数对象的默认值）。
`'>'`	强制字段在可用空间内右对齐（这是数字的默认值）。
`'='`	强制将填充放置在符号（如果有）之后但在数字之前。这用于以“+000000120”形式打印字段。此对齐选项仅对数字类型有效。当'0'紧接在字段宽度之前时，它成为默认值。
`'^'`	强制字段在可用空间内居中。

sign：仅对数字类型有效，可以是以下之一：

选项	意义
`'+'`	表示标志应该用于正数和负数。
`'-'`	表示标志应仅用于负数（这是默认行为）。
space	表示应在正数上使用前导空格，在负数上使用减号。

#：仅适用于数字参数，同时仅适用于2、8、16进制的数字。会在输出的数字前添加 0b、0o、0x前缀。

宽度：指定最小字段宽度。

千位分隔符：只能使用_或,作为千位分隔符。

精度：

类型：该选项决定了数据应该如何呈现。

可用的字符串表示类型：

Type	Meaning
`s`	字符串格式，字符串的默认类型，可省略
`None`	同 `s`

可用的整数表示类型：

Type	Meaning
`b`	输出为 2 进制整数
`c`	输出为为相应的 unicode 字符
`d`	输出为 10 进制整数
`o`	输出为 8 进制整数
`x`	输出为 16 进制整数，使用小写字母
`X`	输出为 16 进制整数，使用大写字母
`n`	类似 `d`, 会使用当前区域设置插入适当的数字分隔符
`None`	同 `d`

可用的浮点数和小数值表示类型：

Type	Meaning
`e`	指数表示法，使用字母 `e` 以科学记数法打印数字，默认进度为 6
`E`	类似 `e`, 使用字母 `E`
`f`	浮点数，默认进度为 6
`F`	浮点数，同 `f`
`g`	一般形式，受 precision 选项影响
`G`	类似 `g`, 当使用指数表示法时，使用 `E` 而不是 `e`
`n`	和 `g` 相同，会使用当前区域设置插入适当的数字分隔符
`%`	百分数表示
`None`	同 `g`

# 填充、对齐、符号、宽度、分组选项、精度
# 填充*号、居中、正号标记、2精度
"{0:*^+20.2f}".format(120)

'******+120.00*******'

# 填充*号、居中、正号标记、_分隔符、2精度
"{0:*^+20_.2f}".format(12000)

'*****+12_000.00*****'

# 填充*号、居中、正号标记、,分隔符、2精度
"{0:*^+20,.2f}".format(12000)

'*****+12,000.00*****'

# 填充*号、左对齐、正号标记、2精度
"{0:*<+20.2f}".format(-120)

'-120.00*************'

# 填充*号、左对齐、负号标记、2精度
"{0:*>-20.2f}".format(-120)

'*************-120.00'

方法

逻辑方法

方法	描述
`.isalnum`	if all characters in the string are alpha-numeric and there is at least one character
`.isalpha`	if all characters in the string are alphabetic and there is at least one character
`.isnumeric`	if all characters in the string are numeric and there is at least one character
`.isascii`	ASCII characters have code points in the range U+0000-U+007F. Empty string is ASCII too.
`.isdecimal`	if all characters in the string are decimal and there is at least one character
`.isdigit`	if all characters in the string are digits and there is at least one character
`.isidentifier`	Use `keyword.iskeyword()` to test for reserved identifiers such as "def" and "class".
`.islower`	if all cased characters in the string are lowercase and there is at least one cased character
`.isupper`	if all cased characters in the string are uppercase and there is at least one cased character
`.isprintable`	if all of its characters are considered printable in `repr()` or if it is empty.
`.isspace`	if all characters in the string are whitespace and there is at least one character
`.istitle`	In a title-cased string, upper- and title-case characters may only follow uncased characters and lowercase characters only cased ones.

方法	描述
`.startswith(prefix[, start[, end]])`	如果字符串以`prefix`开头
`.endswith(suffix[, start[, end]])`	如果字符串以`suffix`结尾

大小字母转换

方法	描述
`.lower`	返回转换为小写的字符串的副本。
`.upper`	返回转换为大写的字符串的副本。
`.capitalize`	使第一个字符字母大写，其余字符字母小写，返回副本。
`.swapcase`	将大写字符转换为小写，将小写字符转换为大写，返回副本。
`.title`	每个单词都用大写字母开头，返回副本。

拆分字符串

方法	描述
`.split(sep=None, maxsplit=-1)`	从左到右以`sep`分割字符串
`.rsplit(sep=None, maxsplit=-1)`	从右到左以`sep`分割字符串
`.splitlines(keepends=False)`	以回车符或换行符分割字符串
`.partition(sep, /)`	从左到右以第一个出现的`sep`分割字符串，返回包括3个元素的元组

"123#456#789".split(sep="#", maxsplit=1)

['123', '456#789']

"123#456#789".rsplit(sep="#", maxsplit=1)

['123#456', '789']

"123\r\n456\r\n789".splitlines(keepends=False)

['123', '456', '789']

"123\r\n456\r\n789".splitlines(keepends=True)

['123\r\n', '456\r\n', '789']

"123#456#789".partition("#")

('123', '#', '456#789')

合并字符串

"_".join(["a", "b", "c"])

'a_b_c'

替换字符串

方法	描述
`.replace(old, new, count=-1)`	返回一个副本，其中所有出现的子字符串`old`都替换为`new`。
`.strip(chars=None)`	返回字符串的副本，其中前导和尾随空格删除。
`.lstrip(chars=None)`	返回删除前导空格的字符串副本。
`.rstrip(chars=None)`	返回删除尾随空格的字符串副本。
`.expandtabs(tabsize=8)`	返回一个副本，其中所有制表符都使用`tabsize`个空格代替。

"  \nabc\t\n\r ".strip()

'abc'

"  \nabc\t\n\r ".lstrip()

'abc\t\n\r '

"  \nabc\t\n\r ".rstrip()

'  \nabc'

"  \nabc\t\n\r ".expandtabs(tabsize=8)

'  \nabc     \n\r '

格式化字符串

方法	描述
`.format`	格式化字符串。
`.format_map`	格式化字符串。
`.ljust(width, fillchar=' ')`	返回长度为`width`的左对齐字符串。使用`fillchar`填充。
`.rjust(width, fillchar=' ')`	返回长度为`width`的右对齐字符串。使用`fillchar`填充。
`.center(width, fillchar=' ')`	返回长度为`width`的居中字符串。使用`fillchar`填充。
`.zfill(width)`	返回长度为`width`的字符串，若长度不够，用零从左边填充。

.format_map与.format在用法上大体一致，但有一个区别：

# format_map 和 format 的区别
# 如果要使用关键字 format必须这样使用
"{name}, {sex}".format(name="Jack", sex="male")
# format_map 可以用dict
"{name}, {sex}".format_map({"name":"Jack", "sex":"male"})

'Jack, male'

print("abc".ljust(10, "#"))
print("abc".rjust(10, "#"))
print("abc".center(10, "#"))
print("123".zfill(10))
# 非数字字符串也可以
print("abc".zfill(10))

abc#######
#######abc
###abc####
0000000123
0000000abc

统计子字符串出现次数

"abcAbc".count("c", 0, 6)

查找子字符串第一次出现的位置

.index与.find的功能是一样的，只不过.index方法若找不到子字符串会抛出ValueError，而.find会返回-1。

"abcAbc".index("bc", 0, 6)
"abcAbc".find("bc", 0, 6)

"abcAbc".index("bc", 0, 2)

---------------------------------------------------------------------------ValueError
Traceback (most recent call last)<ipython-input-1-f8cead4644a3> in
<module>
----> 1 "abcAbc".index("bc", 0, 2)
ValueError: substring not found

"abcAbc".find("bc", 0, 2)

-1