信息发布→ 登录 注册 退出

magical_spider远程采集方案

发布时间:2025-07-19

点击量:

一个神奇的蜘蛛?项目,适用于数据采集任务,源码结构简洁明了。

index页面示例:

---

项目地址https://github.com/lixi5338619/magical_spider


使用指南1、配置settings.py,启动flask服务

2、参考demo文件夹中的代码进行测试,主要通过runflow.py运行。

代码语言:javascript代码运行次数:0运行复制```javascript import requestshost = 'http://127.0.0.1:5000'def magical_start(project_name,base_url = 'https://www./link/b6f05a7baab2fe0eea07e59bd5b0b317'): # 1、创建浏览器并选择session_id result = requests.post(f'{host}/create',data={'name':project_name,'url':base_url}).json() session_id,process_url = result['session_id'],result['process_url'] return session_id,process_urldef magical_request(session_id,process_url,request_url): # 2、请求浏览器_xhr data = {'session_id':session_id,'process_url':process_url, 'request_url':request_url,'request_type':'get'} result = requests.post(f'{host}/xhr',data=data).json() return result['result']def magical_close(session_id,process_url,process_name): # 4、关闭浏览器 close_data = {'session_id':session_id,'process_url':process_url,'process_name':process_name} requests.post(f'{host}/close',data=close_data).json()

3、测试代码

GET请求

代码语言:javascript代码运行次数:0运行复制javascript from demo.runflow import magical_start,magical_request,magical_closeproject_name = 'cnipa'base_url = 'https://www.cnipa.gov.cn'session_id,process_url = magical_start(project_name,base_url)print(len(magical_request(session_id, process_url,'https://www.cnipa.gov.cn/col/col57/index.html')))magical_close(session_id,process_url,project_name)

POST请求

代码语言:javascript代码运行次数:0运行复制javascript from demo.runflow import magical_start,magical_request,magical_closeimport jsonproject_name = 'chinadrugtrials'base_url = 'http://www.chinadrugtrials.org.cn'session_id,process_url = magical_start(project_name,base_url)data = {"id": "","ckm_index": "","sort": "desc","sort2": "","rule": "CTR","secondLevel": "0","currentpage": "2","keywords": "","reg_no": "","indication": "","case_no": "","drugs_name": "","drugs_type": "","appliers": "","communities": "","researchers": "","agencies": "","state": ""}formdata = json.dumps(data)print(magical_request(session_id=session_id, process_url=process_url, request_url='https://www./link/44f95c37a24495521a98b63f0bbf4268', request_type='post',formdata=formdata ))magical_close(session_id,process_url,project_name)

4、index页面可以查看和管理当前运行中的任务,同时可以查看系统的内存和磁盘使用情况。

5、demo文件夹包含任务流程汇总runflow.py,以及抖音和药监局的案例,提供了单任务和多任务的示例。


linux部署1.安装chrome(选择安装位置) yum install https://www./link/6b17d006a2ed6f12f07c7ea60b8002b5

2.检查chrome版本 google-chrome --version

3.安装对应版本的chromedriver_linux64 例如,我的chrome版本是104.0.5112.79 wget https://www./link/5bb5f8d030f5e6e4b6d137c6990f0772

4.解压chromedriver unzip chromedriver_linux64

5.授权chromedriver chmod 777 chromedriver

6.修改项目代码settings.py中的chromedriver路径

7.安装python依赖并启动flask项目

Python依赖:flask、sqlite3、selenium、websockets、opencv-python、numpyflask启动方式:python3 server.py8.开启服务器端口访问权限

9.运行项目测试

标签:# https  # currentcolor  # rule  # 夹中  # 访问权限  # 数据采集  # 进行测试  # 关闭浏览器  # 药监局  # 适用于  # 可以查看  # linux  # http  # github  # json  # flask  # JavaScript  # 浏览器  # 抖音  # git  # python  
在线客服
服务热线

服务热线

4008888355

微信咨询
二维码
返回顶部
×二维码

截屏,微信识别二维码

打开微信

微信号已复制,请打开微信添加咨询详情!