文献数据库 | BibTeX文件读取的Python实现

基础知识准备的差不多了。

今天笔者准备正式开始为构建我们自己的个人搜索引擎而努力了。

目前想到笔者计算机里最有价值的结构数据可能是文献数据了。而文献数据可以方便地由文献管理软件导出,如EndNote、Mendeley和Zetoro等,都可以导出BibTeX文件。这种文件结构性强,拥有大部分可用于分析的数据或者数据链接,如作者(Author)、数字文献定位号(DOI)、题目(Title)和摘要(Abstract)等。

通过对文献数据管理软件中的文献数据进行解读分析,或许我们就可以掌握主要的主题信息了,甚至还可以进一步掌握他们之间的关联…一切皆有可能,那就先从解析BibTeX文件开始吧。全文实现依赖于GitHub的python-bibtexparser库,其地址如下:https://github.com/sciunto-org/python-bibtexparser.

基本操作…

以下是笔者从Zotero导出的一份未经处理的原生态article的信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@article{lee_understanding_2018,
title = {Understanding the {Role} of {Functional} {Groups} in {Polymeric} {Binder} for {Electrochemical} {Carbon} {Dioxide} {Reduction} on {Gold} {Nanoparticles}},
volume = {28},
copyright = {{\textcopyright} 2018 WILEY-VCH Verlag GmbH \& Co. KGaA, Weinheim},
issn = {1616-3028},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/adfm.201804762},
doi = {10.1002/adfm.201804762},
abstract = {Electrochemical CO2 reduction reaction (CO2RR) is one of the promising strategies for converting CO2 to value-added chemicals. Gold (Au) catalysts are considered to be the best benchmarking materials for CO2RR to produce CO. In this work, the role of different functional groups of polymeric binders on CO2RR over Au catalysts is systematically investigated by combined experimental measurements and density functional theory (DFT) calculations. Especially, it is revealed that the functional groups can play a role in suppressing the undesired hydrogen evolution reaction, the main competing reaction against CO2RR, thus enabling more catalytic active sites to be available for CO2RR and enhancing the CO2RR activity. Consistent with the DFT prediction, fluorine (F)-containing functional groups in the F-rich polytetrafluoroethylene binder lead to a high Faradaic efficiency (?94.7\%) of CO production. This study suggests a new strategy by optimizing polymeric binders for the selective CO2RR.},
language = {en},
number = {45},
urldate = {2018-11-11},
journal = {Advanced Functional Materials},
author = {Lee, Ji Hoon and Kattel, Shyam and Xie, Zhenhua and Tackett, Brian M. and Wang, Jiajun and Liu, Chang-Jun and Chen, Jingguang G.},
month = nov,
year = {2018},
keywords = {binders, density functional theories, electrochemical carbon dioxide reduction, functional groups, gold},
pages = {1804762},
file = {Full Text PDF:E\:\\zotero\\storage\\QC4NU7WY\\Lee ?? - 2018 - Understanding the Role of Functional Groups in Pol.pdf:application/pdf}
}

尽管结构十分清晰,可以大概看出是一个二维字典的结构,但是其中不乏非UTF-8编码的奇怪字符。

这是不能避免的。

英文中的常用字符都包括在UTF-8编码中,但由于文献中存在各国作者,姓名中就带有各国特色;再加之其他不可预测的不同编码方式,因此,读取此类BibTeX还是要使用UTF编码更为稳妥。UTF-8编码是UTF编码的简化,是一套以8位为一个编码单位的可变长编码,能包含的奇怪字符自然就少了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# coding = utf-8
import bibtexparser as bp
from bibtexparser.bparser import BibTexParser
from bibtexparser.customization import convert_to_unicode
with open('exampleBib.bib') as bibfile:
parser = BibTexParser() # 声明解析器类
parser.customization = convert_to_unicode # 将BibTeX编码强制转换为UTF编码
bibdata = bp.load(bibfile, parser = parser) # 通过bp.load()加载
# 输出作者和DOI
print(bibdata.entries[0]['author'])
print(bibdata.entries[0]['doi'])

可以看出要实现这样基本能用的效果要导入至少三个函数库:

  • bibtexparser.load(bibfile, parser)
  • BibTexParser()
  • convert_to_unicode(record)

出错分析

正当笔者激动地搓着手手…

这时候出错了…

1
2
3
4
5
6
7
8
9
Traceback (most recent call last):
File "readBibFile.py", line 50, in <module>
bibdata = bp.load(bibfile, parser = parser)
File "/usr/local/lib/python2.7/dist-packages/bibtexparser/__init__.py", line 71, in load
return parser.parse_file(bibtex_file)
File "/usr/local/lib/python2.7/dist-packages/bibtexparser/bparser.py", line 165, in parse_file
return self.parse(file.read(), partial=partial)
...
bibtexparser.bibdatabase.UndefinedString: u'nov'

看到最后一行,笔者知道月份(“month”)字段出了问题。

1
month = nov,

也是很奇怪了。明明其它数据都好好的,就“month”字段怎么没有“{}”呢?

如果“month”字段没有“{}”那么“nov”就无法识别。当然如果“month”字段是数字的话,好像又能够顺利运行了。这一点要批评Zotero了,笔者后来再去试试别的文献管理软件,看看是不是存在同样的问题。

运行结果

好了,手动修改调整“month”字段后至少就能顺利运行了,python-bibtexparser超好用的!

1
2
Lee, Ji Hoon and Kattel, Shyam and Xie, Zhenhua and Tackett, Brian M. and Wang, Jiajun and Liu, Chang-Jun and Chen, Jingguang G.
10.1002/adfm.201804762

进阶美化

python-bibtexparser的作者(们)充分考虑到了大家爱美的需求,提供了一些自定义的美化方式,比如姓名上的“姓”“名”区分,就使用一个customization库文件考虑到了。要做到这一点,需要导入customization库。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# coding = utf-8
import bibtexparser as bp
from bibtexparser.bparser import BibTexParser
from bibtexparser.customization import *
def customizations(record):
"""Use some functions delivered by the library
:param record: a record
:returns: -- customized record
"""
record = type(record)
record = author(record)
record = editor(record)
record = journal(record)
record = keyword(record)
record = link(record)
record = page_double_hyphen(record)
record = doi(record)
# record = month(record)
return record
with open('exampleBib.bib') as bibfile:
parser = BibTexParser()
parser.customization = customizations # 这里跟上面不同,用了自定义的customizations函数
bibdata = bp.load(bibfile, parser = parser)
print(bibdata.entries[0]['author'])
print(bibdata.entries[0]['doi'])

而customizations函数实际写在另一个库文件中…

其中就有author字段的处理函数。我们也可以自定义自己的字段处理函数。不知道“month”字段能不能通过这种方法解决兼容问题呢?先截取一小段吧:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def author(record):
"""
Split author field into a list of "Name, Surname".
:param record: the record.
:type record: dict
:returns: dict -- the modified record.
"""
if "author" in record:
if record["author"]:
record["author"] = getnames([i.strip() for i in record["author"].replace('\n', ' ').split(" and ")]) # getname又是一个复杂函数,处理姓名分段
else:
del record["author"]
return record

最后输出结果,在作者处不一样啦:

1
2
[u'Lee, Ji Hoon', u'Kattel, Shyam', u'Xie, Zhenhua', u'Tackett, Brian M.', u'Wang, Jiajun', u'Liu, Chang-Jun', u'Chen, Jingguang G.']
10.1002/adfm.201804762

而原来的输出结果名字是不分开的(以下为默认输出结果,非自定义):

1
2
Lee, Ji Hoon and Kattel, Shyam and Xie, Zhenhua and Tackett, Brian M. and Wang, Jiajun and Liu, Chang-Jun and Chen, Jingguang G.
10.1002/adfm.201804762

好了,今天先到这吧,

前天的急性肠胃炎还没好,今天还失眠到早5点,又冷又饿还困…

嗝…

鄙人水平有限,乐于班门弄斧,欢迎来信指教。

Email: mozheyang@outlook.com