python - How to start a Scrapy spider from another one -


i have 2 spiders in 1 scrapy project. spider1 crawls list of page or entire website , analyzes content. spider2 uses splash fetch urls on google , pass list spider1.

so, spider1 crawls , analyze content , can used without being called spider2

# coding: utf8 scrapy.spiders import crawlspider import scrapy   class spider1(scrapy.spider):     name = "spider1"     tokens = []     query = ''      def __init__(self, *args, **kwargs):         '''         spider works 2 modes,         if 1 url crawls entire website,         if list of urls analyze page         '''         super(spider1, self).__init__(*args, **kwargs)         start_url = kwargs.get('start_url') or ''         start_urls = kwargs.get('start_urls') or []         query = kwargs.get('q') or ''         if google_query != '':             self.query = query         if start_url != '':             self.start_urls = [start_url]         if len(start_urls) > 0:             self.start_urls = start_urls       def parse(self, response):         '''         analyze , store data         '''         if len(self.start_urls) == 1:             next_page in response.css('a::attr("href")'):                 yield response.follow(next_page, self.parse)      def closed(self, reason):         '''         finalize crawl         ''' 

the code spider2

# coding: utf8 import scrapy scrapy_splash import splashrequest scrapy.crawler import crawlerprocess scrapy.utils.project import get_project_settings   class spider2(scrapy.spider):     name = "spider2"     urls = []     page = 0      def __init__(self, *args, **kwargs):         super(spider2, self).__init__(*args, **kwargs)         self.query = kwargs.get('q')         self.url = kwargs.get('url')         self.start_urls = ['https://www.google.com/search?q=' + self.query]      def start_requests(self):         splash_args = {             'wait:': 2,         }         url in self.start_urls:             splash_args = {                 'wait:': 1,             }             yield splashrequest(url, self.parse, args=splash_args)      def parse(self, response):         '''         extract urls self.urls         '''         self.page += 1      def closed(self, reason):         process = crawlerprocess(get_project_settings())         url in self.urls:             print(url)         if len(self.urls) > 0:             process.crawl('lexi', start_urls=self.urls, q=self.query)             process.start(false) 

when running spider2 have error : twisted.internet.error.reactoralreadyrunning , spider1 called without list of urls. tried using crawlrunner advised scrapy documentation it's same problem. tried using crawlprocess inside parse method, "works" but, still have error message. when using crawlrunner inside parse method, doesn't work.

currently not possible start spider spider if you're using scrapy crawl command (see https://github.com/scrapy/scrapy/issues/1226). possible start spider spider if write startup script yourselves - trick use same crawlerprocess/crawlerrunner instance.

i'd not though, you're fighting agains framework. it'd nice support use case, not supported now.

an easier way either rewrite code use single spider class, or create script (bash, makefile, luigi/airflow if want fancy) runs scrapy crawl spider1 -o items.jl followed scrapy crawl spider2; second spider can read items created first spider , generate start_requests accordingly.

ftr: combining splashrequests , regular scrapy.requests in single spider supported (it should work), don't have create separate spiders them.


Comments

Popular posts from this blog

networking - Vagrant-provisioned VirtualBox VM is not reachable from Ubuntu host -

c# - ASP.NET Core - There is already an object named 'AspNetRoles' in the database -

ruby on rails - ArgumentError: Missing host to link to! Please provide the :host parameter, set default_url_options[:host], or set :only_path to true -