13行MATLAB代码实现网络爬虫爬取NASA画廊星图-白红宇

13行MATLAB代码实现网络爬虫爬取NASA画廊星图

阅读量：2134 次

发布时间：2019-04-30

本文共 3709 字，大约阅读时间需要 12 分钟。

13行MATLAB代码实现网络爬虫爬取NASA画廊星图

2021/04/18上传

2021/04/21更新：修改N的输入方式，增加对png格式图片的下载支持，增加了自动处理几种错误情况的代码，能够将下载过程与报错记录保存到日志中。

源代码

N = input('Input the number you want to download:');    URL = 'https://www.nasa.gov/api/2/ubernode/_search';mainURL = 'https://www.nasa.gov/sites/default/files/';opt = weboptions('Timeout',10);for i=1:N    data = webread(URL,'size',num2str(N),'from','0','sort','promo-date-time:desc','q','((ubernode-type:image) AND (routes:1446))','_source_include','promo-date-time,master-image,nid,title,topics,missions,collections,other-tags,ubernode-type,primary-tag,secondary-tag,cardfeed-title,type,collection-asset-link,link-or-attachment,pr-leader-sentence,image-feature-caption,attachments,uri',opt);    imgURL = append(mainURL,data.hits.hits(i).x_source.master_image.uri(10:end));    img = webread(imgURL,opt);    filename = append('Img_',num2str(i),'_',data.hits.hits(i).x_source.master_image.title,'.jpg');	imwrite(img,filename);    disp(append('FINISHED:',num2str(i),'/',num2str(N)));enddisp('Completed!');

使用方法

将.m脚本文件所在路径添加到MATLAB路径中，运行脚本，命令行提示：Input the number you want to download:，输入你想下载的图片数量后，爬虫自动开始运行并显示进度，进度读完则显示Completed!，图片保存在脚本所在目录下。

讲解

本爬虫仅适用于爬取NASA画廊每日图片，但只要取得了图片链接，用此方法可以爬取其他更多网站。

在https://www.nasa.gov/multimedia/imagegallery/iotd.html使用F12中Network工具，可以抓取到网页获取图片信息的网址接口URL，它的参数由几个部分组成，其中size对应一次获取的图片数量，则可通过变更size来获得不同的图片数量。

URL的响应中，包含我们要获取的图片链接的一部分，即uri。

通过mainURL与uri(10:end)组合可以得到不同编号的图片链接，使用webread()函数读入即可。

weboption()函数用于设置访问方式为Get与超时响应时间Timeout。

append()合并字符串，imwrite()将图片写入指定文件并重命名。

via nasa.gov

2021/04/21更新：修改N的输入方式，增加对png格式图片的下载，增加了自动处理几种错误情况的代码，能够将下载过程与报错记录保存到日志中。

2021/04/21源代码：

disp('Input the number you want to download:[N1-N2]');N1 = input('N1:');N2 = input('N2:');disp(append('From ',num2str(min(N1,N2)),' to ',num2str(max(N1,N2)),' There are ',num2str(max(N1,N2)-min(N1,N2)+1),' pictures.'));URL = 'https://www.nasa.gov/api/2/ubernode/_search';mainURL = 'https://www.nasa.gov/sites/default/files/';opt = weboptions('Timeout',10);ispng=1;path = 'F:\PictureDownload\PictureDownload';for i=min(N1,N2):max(N1,N2)    try        data = webread(URL,'size',num2str(i),'from','0','sort','promo-date-time:desc','q','((ubernode-type:image) AND (routes:1446))','_source_include','promo-date-time,master-image,nid,title,topics,missions,collections,other-tags,ubernode-type,primary-tag,secondary-tag,cardfeed-title,type,collection-asset-link,link-or-attachment,pr-leader-sentence,image-feature-caption,attachments,uri',opt);    catch        disp('[ERROR]Failed to connect to the website. Check your web connection.');        break    end    imgURL = append(mainURL,data.hits.hits(i).x_source.master_image.uri(10:end));    try        img = webread(imgURL,opt);    catch        disp(append('[WARN]Failed to download the ',num2str(i),'th picture. It has been skipped up.'));        disp(append('[LINK]',imgURL));        i = i+1;        continue    end    filename = append(path,'Img_',num2str(i),'_',data.hits.hits(i).x_source.master_image.title,'.jpg');    try        imwrite(img,filename);        disp(append('[',num2str(i),']FINISHED:',num2str(i-min(N1,N2)+1),'/',num2str(max(N1,N2)-min(N1,N2)+1)));    catch        filename = append(path,'Img_',num2str(i),'_',data.hits.hits(i).x_source.master_image.title,'.png');        try            imwrite(img,filename);        catch            ispng=0;        end        if ispng==1            disp(append('[WARN]The ',num2str(i),'th picture is the format of png, it has been download successfully.'))        else            disp(append('[WARN]Failed to write in img file, The No.',num2str(i),' picture has been skipped up.'));            disp(append('[LINK]:',imgURL));            i = i+1;        end    endenddisp('Completed!');

转载地址：http://fjugf.baihongyu.com/

你可能感兴趣的文章

13行MATLAB代码实现网络爬虫 爬取NASA画廊星图

源代码

使用方法

讲解

2021/04/21源代码：

13行MATLAB代码实现网络爬虫爬取NASA画廊星图