python - How to start a Scrapy spider from another one -

January 15, 2012

i have 2 spiders in 1 scrapy project. spider1 crawls list of page or entire website , analyzes content. spider2 uses splash fetch urls on google , pass list spider1.

so, spider1 crawls , analyze content , can used without being called spider2

# coding: utf8 scrapy.spiders import crawlspider import scrapy   class spider1(scrapy.spider):     name = "spider1"     tokens = []     query = ''      def __init__(self, *args, **kwargs):         '''         spider works 2 modes,         if 1 url crawls entire website,         if list of urls analyze page         '''         super(spider1, self).__init__(*args, **kwargs)         start_url = kwargs.get('start_url') or ''         start_urls = kwargs.get('start_urls') or []         query = kwargs.get('q') or ''         if google_query != '':             self.query = query         if start_url != '':             self.start_urls = [start_url]         if len(start_urls) > 0:             self.start_urls = start_urls       def parse(self, response):         '''         analyze , store data         '''         if len(self.start_urls) == 1:             next_page in response.css('a::attr("href")'):                 yield response.follow(next_page, self.parse)      def closed(self, reason):         '''         finalize crawl         '''

the code spider2

# coding: utf8 import scrapy scrapy_splash import splashrequest scrapy.crawler import crawlerprocess scrapy.utils.project import get_project_settings   class spider2(scrapy.spider):     name = "spider2"     urls = []     page = 0      def __init__(self, *args, **kwargs):         super(spider2, self).__init__(*args, **kwargs)         self.query = kwargs.get('q')         self.url = kwargs.get('url')         self.start_urls = ['https://www.google.com/search?q=' + self.query]      def start_requests(self):         splash_args = {             'wait:': 2,         }         url in self.start_urls:             splash_args = {                 'wait:': 1,             }             yield splashrequest(url, self.parse, args=splash_args)      def parse(self, response):         '''         extract urls self.urls         '''         self.page += 1      def closed(self, reason):         process = crawlerprocess(get_project_settings())         url in self.urls:             print(url)         if len(self.urls) > 0:             process.crawl('lexi', start_urls=self.urls, q=self.query)             process.start(false)

when running spider2 have error : twisted.internet.error.reactoralreadyrunning , spider1 called without list of urls. tried using crawlrunner advised scrapy documentation it's same problem. tried using crawlprocess inside parse method, "works" but, still have error message. when using crawlrunner inside parse method, doesn't work.

currently not possible start spider spider if you're using scrapy crawl command (see https://github.com/scrapy/scrapy/issues/1226). possible start spider spider if write startup script yourselves - trick use same crawlerprocess/crawlerrunner instance.

i'd not though, you're fighting agains framework. it'd nice support use case, not supported now.

an easier way either rewrite code use single spider class, or create script (bash, makefile, luigi/airflow if want fancy) runs scrapy crawl spider1 -o items.jl followed scrapy crawl spider2; second spider can read items created first spider , generate start_requests accordingly.

ftr: combining splashrequests , regular scrapy.requests in single spider supported (it should work), don't have create separate spiders them.

Search This Blog

TY

python - How to start a Scrapy spider from another one -

Comments

Post a Comment

Popular posts from this blog

html - How to set bootstrap input responsive width? -

javascript - Highchart x and y axes data from json -

javascript - Get js console.log as python variable in QWebView pyqt -