0%

使用Python正则表达式批量下载Nature Materials

写在前面

本文主要从技术角度讨论了Python批量下载的可能,并没有任何批量下载文献的用意。如有用户使用该方法对文献进行批量下载,后果自负。有关文献超量下载的后果,可参考:知乎 - 如何评价哈工大留学生违规超量下载一事?

Python源代码


运行环境

  • Windows 10, 64bit
  • python 3.6

其他说明

  • 已通过腾讯管家将C盘Desktop搬到F盘,因此“桌面”目录为”F:/personal/desktop/“。
  • 提前在桌面建立了文件夹pdf,该文件夹地址为”F:/personal/desktop/pdf/“。
  • 鉴于以上原因本程序仅供参考,不能在所有环境下适用,请读者按需进行修改。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import urllib.request as rq
import sys
import re

def getHtml(url):
with rq.urlopen(url) as page:
html = page.read().decode('utf-8')
print("Html content achieved.")
return html

def getPdf(html):
reg = r'nmat[0-9]{4}.pdf'
pdfre = re.compile(reg)
list_pdf = re.findall(pdfre,html)
print("Pdfs' namelist achieved.")
return list_pdf

def urllist(url, list_pdf):
list_html = list_pdf
for i in range(0,len(list_pdf)):
list_html[i] = 'http://www.nature.com/nmat/journal/' + url[35:42] + 'pdf/' + "".join(list_pdf[i])
print("Urllist achieved.")
return list_html

def downloadPdf(list_html):
num = 1
url = list_html[1]
path_toc = "F:\\personal\\desktop\\pdf\\toc.pdf"
print(path_toc)
link_toc = 'http://www.nature.com/nmat/journal/' + url[35:42] + 'pdf/' + 'toc.pdf'
rq.urlretrieve(link_toc, path_toc)
print(link_toc + " completed!")

path_masthead = "F:\\personal\\desktop\\pdf\\masthead.pdf"
print(path_masthead)
link_masthead = 'http://www.nature.com/nmat/journal/' + url[35:42] + 'pdf/' + 'masthead.pdf'
rq.urlretrieve(link_toc, path_toc)
print(link_masthead + " completed!")

for pdf_link in list_html:
path = "F:\\personal\\desktop\\pdf\\" + str(num) + ".pdf"
print(path)
rq.urlretrieve(pdf_link, path)
print(pdf_link + " completed!")
num = num + 1

def nmatmain():
args = sys.argv
if len(args) == 1:
print("Format: python NmatDownload.py [url]")
print("Example: python NmatDownload.py http://www.nature.com/nmat/journal/v16/n4/index.html")
return
elif len(args) == 2:
url = args[1]
else:
print("Too many arguments!")
return

# main
html = getHtml(url)
list_pdf = getPdf(html)

list_html = urllist(url, list_pdf)
downloadPdf(list_html)

if __name__=='__main__':
nmatmain()

代码解析


建立 (def)一个函数,输入url,输出 (return)对应html文本:

1
2
3
4
5
6
import urllib.request as rq
def getHtml(url):
with rq.urlopen(url) as page: # Python 3 必需写法
html = page.read().decode('utf-8')
print("Html content achieved.")
return html

关键步骤,Python 2写法:

1
2
3
page = urllib.request.urlopen(url)
htmlencode = page.read() # 得到html文本
html =htmlcode.decode('urf-8') # 使用utf-8文本格式对文本进行解码

输入html文本,通过正则表达式提取指定字符串,最后输出所需list数据:

1
2
3
4
5
6
7
import re
def getPdf(html):
reg = r'nmat[0-9]{4}.pdf' # 正则表达式:nmat[0-9]{4}.pdf
pdfre = re.compile(reg)
list_pdf = re.findall(pdfre,html)
print("Pdfs' namelist achieved.")
return list_pdf

通过list数据进行字符串运算,输出pdf文件下载地址,并再次组成list数据:

1
2
3
4
5
6
def urllist(url, list_pdf):
list_html = list_pdf
for i in range(0,len(list_pdf)):
list_html[i] = 'http://www.nature.com/nmat/journal/' + url[35:42] + 'pdf/' + "".join(list_pdf[i])
print("Urllist achieved.")
return list_html

下载关键步骤——通过list和urllib.urlretrieve(link, download_path)函数进行迭代下载,

1
2
3
def downloadPdf(list_html):
rq.urlretrieve(pdf_link, path)
# pdf_link为pdf下载链接,path为下载后存放的本地地址

实现了可以在命令行界面输入参数的需求:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def nmatmain():
args = sys.argv
if len(args) == 1:
print("Format: python NmatDownload.py [url]")
print("Example: python NmatDownload.py http://www.nature.com/nmat/journal/v16/n4/index.html")
return
elif len(args) == 2:
url = args[1]
else:
print("Too many arguments!")
return

# main
html = getHtml(url)
list_pdf = getPdf(html)

list_html = urllist(url, list_pdf)
downloadPdf(list_html)

if __name__=='__main__':
nmatmain()

例如:

1
2
3
F:\nmat>python NmatDownload.py
Format: python NmatDownload.py [url]
Example: python NmatDownload.py http://www.nature.com/nmat/journal/v16/n4/index.html

笔者水平有限,如文中有错漏,恳请各位读者指出。
Email: mozheyang@outlook.com

Appendix

Hexo代码高亮 - Codeblock