EzDev.org

scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python. Scrapy | A Fast and Powerful Scraping and Web Crawling Framework


How to pass a user defined argument in scrapy spider

I am trying to pass a user defined argument to a scrapy's spider. Can anyone suggest on how to do that?

I read about a parameter -a somewhere but have no idea how to use it.


Source: (StackOverflow)

how to filter duplicate requests based on url in scrapy

I am writing a crawler for a website using scrapy with CrawlSpider.

Scrapy provides an in-built duplicate-request filter which filters duplicate requests based on urls. Also, I can filter requests using rules member of CrawlSpider.

What I want to do is to filter requests like:

http:://www.abc.com/p/xyz.html?id=1234&refer=5678

If I have already visited

http:://www.abc.com/p/xyz.html?id=1234&refer=4567

NOTE: refer is a parameter that doesn't affect the response I get, so I don't care if the value of that parameter changes.

Now, if I have a set which accumulates all ids I could ignore it in my callback function parse_item (that's my callback function) to achieve this functionality.

But that would mean I am still at least fetching that page, when I don't need to.

So what is the way in which I can tell scrapy that it shouldn't send a particular request based on the url?


Source: (StackOverflow)

Scrapy: Follow link to get additional Item data?

I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework:

The structure of the data I want to scrape is typically a table row for each item. Straightforward enough, right?

Ultimately I want to scrape the Title, Due Date, and Details for each row. Title and Due Date are immediately available on the page...

BUT the Details themselves aren't in the table -- but rather, a link to the page containing the details (if that doesn't make sense here's a table):

|-------------------------------------------------|
|             Title              |    Due Date    |
|-------------------------------------------------|
| Job Title (Clickable Link)     |    1/1/2012    |
| Other Job (Link)               |    3/2/2012    |
|--------------------------------|----------------|

I'm afraid I still don't know how to logistically pass the item around with callbacks and requests, even after reading through the CrawlSpider section of the Scrapy documentation.


Source: (StackOverflow)

How can i use multiple requests and pass items in between them in scrapy python

I have the item object and i need to pass that along many pages to store data in single item

LIke my item is

class DmozItem(Item):
    title = Field()
    description1 = Field()
    description2 = Field()
    description3 = Field()

Now those three description are in three separate pages. i want to do somrething like

Now this works good for parseDescription1

def page_parser(self, response):
      sites = hxs.select('//div[@class="row"]')
      items = []
      request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription1)
      request.meta['item'] = item
      return request 

def parseDescription1(self,response):
            item = response.meta['item']
            item['desc1'] = "test"
            return item

But i want something like

def page_parser(self, response):
      sites = hxs.select('//div[@class="row"]')
      items = []
      request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription1)
      request.meta['item'] = item

      request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)
      request.meta['item'] = item

      request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)
      request.meta['item'] = item


      return request 

def parseDescription1(self,response):
            item = response.meta['item']
            item['desc1'] = "test"
            return item

def parseDescription2(self,response):
            item = response.meta['item']
            item['desc2'] = "test2"
            return item

def parseDescription3(self,response):
            item = response.meta['item']
            item['desc3'] = "test3"
            return item

Source: (StackOverflow)

How to run Scrapy from within a Python script

I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this:

http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-a-script/

I can't figure out where I should put my spider code and how to call it from the main function. Please help. This is the example code:

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. 
# 
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
# 
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. 

#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports

from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue

class CrawlerScript():

    def __init__(self):
        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()
        self.items = []
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def _crawl(self, queue, spider_name):
        spider = self.crawler.spiders.create(spider_name)
        if spider:
            self.crawler.queue.append_spider(spider)
        self.crawler.start()
        self.crawler.stop()
        queue.put(self.items)

    def crawl(self, spider):
        queue = Queue()
        p = Process(target=self._crawl, args=(queue, spider,))
        p.start()
        p.join()
        return queue.get(True)

# Usage
if __name__ == "__main__":
    log.start()

    """
    This example runs spider1 and then spider2 three times. 
    """
    items = list()
    crawler = CrawlerScript()
    items.append(crawler.crawl('spider1'))
    for i in range(3):
        items.append(crawler.crawl('spider2'))
    print items

# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date  : Oct 24, 2010

Thank you.


Source: (StackOverflow)

Creating a generic scrapy spider

My question is really how to do the same thing as a previous question, but in Scrapy 0.14.

Using one Scrapy spider for several websites

Basically, I have GUI that takes parameters like domain, keywords, tag names, etc. and I want to create a generic spider to crawl those domains for those keywords in those tags. I've read conflicting things, using older versions of scrapy, by either overriding the spider manager class or by dynamically creating a spider. Which method is preferred and how do I implement and invoke the proper solution? Thanks in advance.

Here is the code that I want to make generic. It also uses BeautifulSoup. I paired it down so hopefully didn't remove anything crucial to understand it.

class MySpider(CrawlSpider):

name = 'MySpider'
allowed_domains = ['somedomain.com', 'sub.somedomain.com']
start_urls = ['http://www.somedomain.com']

rules = (
    Rule(SgmlLinkExtractor(allow=('/pages/', ), deny=('', ))),

    Rule(SgmlLinkExtractor(allow=('/2012/03/')), callback='parse_item'),
)

def parse_item(self, response):
    contentTags = []

    soup = BeautifulSoup(response.body)

    contentTags = soup.findAll('p', itemprop="myProp")

    for contentTag in contentTags:
        matchedResult = re.search('Keyword1|Keyword2', contentTag.text)
        if matchedResult:
            print('URL Found: ' + response.url)

    pass

Source: (StackOverflow)

Scrapy crawl from script always blocks script execution after scraping

I am following this guide http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script to run scrapy from my script. Here is part of my script:

    crawler = Crawler(Settings(settings))
    crawler.configure()
    spider = crawler.spiders.create(spider_name)
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()
    print "It can't be printed out!"

It works at it should: visits pages, scrape needed info and stores output json where I told it(via FEED_URI). But when spider finishing his work(I can see it by number in output json) execution of my script wouldn't resume. Probably it isn't scrapy problem. And answer should somewhere in twisted's reactor. How could I release thread execution?


Source: (StackOverflow)

Crawling with an authenticated session in Scrapy

In my previous question, I wasn't very specific over my problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word crawling.

So, here is my code so far:

class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['domain.com']
    start_urls = ['http://www.domain.com/login/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),
    )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        if not "Hi Herman" in response.body:
            return self.login(response)
        else:
            return self.parse_item(response)

    def login(self, response):
        return [FormRequest.from_response(response,
                    formdata={'name': 'herman', 'password': 'password'},
                    callback=self.parse)]


    def parse_item(self, response):
        i['url'] = response.url

        # ... do more things

        return i

As you can see, the first page I visit is the login page. If I'm not authenticated yet (in the parse function), I call my custom login function, which posts to the login form. Then, if I am authenticated, I want to continue crawling.

The problem is that the parse function I tried to override in order to log in, now no longer makes the necessary calls to scrape any further pages (I'm assuming). And I'm not sure how to go about saving the Items that I create.

Anyone done something like this before? (Authenticate, then crawl, using a CrawlSpider) Any help would be appreciated.


Source: (StackOverflow)

Scrapy - parse a page to extract items - then follow and store item url contents

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items. Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.

But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.

And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.

My code so far looks like this:

class MySpider(CrawlSpider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/?q=example",
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[@class="pagination"]'), callback='parse_item'),
        Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
    )


    def parse_item(self, response):
        main_selector = HtmlXPathSelector(response)
        xpath = '//h2[@class="title"]'

        sub_selectors = main_selector.select(xpath)

        for sel in sub_selectors:
            item = ExampleItem()
            l = ExampleLoader(item = item, selector = sel)
            l.add_xpath('title', 'a[@title]/@title')
            ......
            yield l.load_item()

Source: (StackOverflow)

How to setup and launch a Scrapy spider programmatically (urls and settings)

I've written a working crawler using scrapy,
now I want to control it through a Django webapp, that is to say:

  • Set 1 or several start_urls
  • Set 1 or several allowed_domains
  • Set settings values
  • Start the spider
  • Stop / pause / resume a spider
  • retrieve some stats while running
  • retrive some stats after spider is complete.

At first I thought scrapyd was made for this, but after reading the doc, it seems that it's more a daemon able to manage 'packaged spiders', aka 'scrapy eggs'; and that all the settings (start_urls , allowed_domains, settings ) must still be hardcoded in the 'scrapy egg' itself ; so it doesn't look like a solution to my question, unless I missed something.

I also looked at this question : How to give URL to scrapy for crawling? ; But the best answer to provide multiple urls is qualified by the author himeslf as an 'ugly hack', involving some python subprocess and complex shell handling, so I don't think the solution is to be found here. Also, it may work for start_urls, but it doesn't seem to allow allowed_domains or settings.

Then I gave a look to scrapy webservices : It seems to be the good solution for retrieving stats. However, it still requires a running spider, and no clue to change settings

There are a several questions on this subject, none of them seems satisfactory:

I know that scrapy is used in production environments ; and a tool like scrapyd shows that there are definitvely some ways to handle these requirements (I can't imagine that the scrapy eggs scrapyd is dealing with are generated by hand !)

Thanks a lot for your help.


Source: (StackOverflow)

Force my scrapy spider to stop crawling

is there a chance to stop crawling when specific if condition is true (like scrap_item_id == predefine_value ). My problem is similar to Scrapy - how to identify already scraped urls but I want to 'force' my scrapy spider to stop crawling after discover the last scraped item.


Source: (StackOverflow)

How to use pycharm to debug scrapy projects

I am working on scrapy 0.20 with python 2.7.

I found pycharm a good python debugger.

I want to test my scrapy spiders using it.

Anyone knows how to do that please?

What I have tried

Actually I tried to run the spider as a scrip. As a result, I built that scrip.

Then, I tried to add my scrapy project to pycharm as a model like this:

file->setting->project structure->add content root.

But I don't know what else I have to do


Source: (StackOverflow)

Scrapy - how to manage cookies/sessions

I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.

This is basically a simplified version of what I'm trying to do: enter image description here


The way the website works:

When you visit the website you get a session cookie.

When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.


My script:

My spider has a start url of searchpage_url

The searchpage is requested by parse() and the search form response gets passed to search_generator()

search_generator() then yields lots of search requests using FormRequest and the search form response.

Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.


I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?

If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?

I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?

I'm confused, any clarification would be greatly received!


EDIT:

Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.

I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.

Is this what you should do in this situation?


Source: (StackOverflow)

How to use CrawlSpider from scrapy to click a link with javascript onclick?

I want scrapy to crawl pages where going on to the next link looks like this:

<a rel='nofollow' href="#" onclick="return gotoPage('2');"> Next </a>

Will scrapy be able to interpret javascript code of that?

With livehttpheaders extension I found out that clicking Next generates a POST with a really huge piece of "garbage" starting like this:

encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n

I am trying to build my spider on the CrawlSpider class, but I can't really figure out how to code it, with BaseSpider I used the parse() method to process the first URL, which happens to be a login form, where I did a POST with:

def logon(self, response):
    login_form_data={ 'email': 'user@example.com', 'password': 'mypass22', 'action': 'sign-in' }
    return [FormRequest.from_response(response, formnumber=0, formdata=login_form_data, callback=self.submit_next)]

And then I defined submit_next() to tell what to do next. I can't figure out how do I tell CrawlSpider which method to use on the first URL?

All requests in my crawling, except the first one, are POST requests. They are alternating two types of requests: pasting some data, and clicking "Next" to go to the next page.


Source: (StackOverflow)

Using Scrapy with authenticated (logged in) user session

In the Scrapy docs, there is the following example to illustrate how to use an authenticated session in Scrapy:

class LoginSpider(BaseSpider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return [FormRequest.from_response(response,
                    formdata={'username': 'john', 'password': 'secret'},
                    callback=self.after_login)]

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return

        # continue scraping with authenticated session...

I've got that working, and it's fine. But my question is: What do you have to do to continue scraping with authenticated session, as they say in the last line's comment?


Source: (StackOverflow)