EzDev.org

scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python. Scrapy | A Fast and Powerful Scraping and Web Crawling Framework


Headless Browser and scraping - solutions [closed]

I'm trying to put list of possible solutions for browser automatic tests suits and headless browser platforms capable of scraping.


BROWSER TESTING / SCRAPING:

  • Selenium - polyglot flagship in browser automation, bindings for Python, Ruby, JavaScript, C#, Haskell and more, IDE for Firefox (as an extension) for faster test deployment. Can act as a Server and has tons of features.

JAVASCRIPT

  • PhantomJS - JavaScript, headless testing with screen capture and automation, uses Webkit. As of version 1.8 Selenium's WebDriver API is implemented, so you can use any WebDriver binding and tests will be compatible with Selenium
  • SlimerJS - similar to PhantomJS, uses Gecko (Firefox) instead of WebKit
  • CasperJS - JavaScript, build on both PhantomJS and SlimerJS, has extra features
  • Ghost Driver - JavaScript implementation of the WebDriver Wire Protocol for PhantomJS.
  • new PhantomCSS - CSS regression testing. A CasperJS module for automating visual regression testing with PhantomJS and Resemble.js.
  • new WebdriverCSS - plugin for Webdriver.io for automating visual regression testing
  • new PhantomFlow - Describe and visualize user flows through tests. An experimental approach to Web user interface testing.
  • new trifleJS - ports the PhantomJS API to use the Internet Explorer engine.
  • new CasperJS IDE (commercial)

NODE.JS

  • Node-phantom - bridges the gap between PhantomJS and node.js
  • WebDriverJs - Selenium WebDriver bindings for node.js by Selenium Team
  • WD.js - node module for WebDriver/Selenium 2
  • yiewd - WD.js wrapper using latest Harmony generators! Get rid of the callback pyramid with yield
  • ZombieJs - Insanely fast, headless full-stack testing using node.js
  • NightwatchJs - Node JS based testing solution using Selenium Webdriver
  • Chimera - Chimera: can do everything what phantomJS does, but in a full JS environment
  • Dalek.js - Automated cross browser testing with JavaScript through Selenium Webdriver
  • Webdriver.io - better implementation of WebDriver bindings with predefined 50+ actions
  • new Nightmare - PhantomJS bridge with a high-level API. It uses PhantomJS-Node under the hood.

WEB SCRAPING / MINING

  • Scrapy - Python, mainly a scraper/miner - fast, well documented and, can be linked with Django Dynamic Scraper for nice mining deployments, or Scrapy Cloud for PaaS (server-less) deployment, works in terminal or an server stand-alone proces, can be used with Celery, built on top of Twisted
  • Snailer - node.js module, untested yet.
  • Node-Crawler - node.js module, untested yet.

ONLINE TOOLS

  • new CasperBox - Run CasperJS scripts online

RELATED LINKS & RESOURCES

Questions:

  • Any pure Node.js solution or Nodejs to PhanthomJS/CasperJS module that actually works and is documented?

Answer: Chimera seems to go in that direction, checkout Chimera

  • Other solutions capable of easier JavaScript injection then Selenium?

  • Do you know any pure ruby solutions?

Answer: Checkout the list created by rjk with ruby based solutions

  • Do you know any related tech or solution?

Feel free to reedit this question and add content as you wish! Thank you for your contributions!


Updates

  1. added SlimerJS to the list
  2. added Snailer and Node-Crawler and Node-phantom
  3. added Yiewd WebDriver wrapper
  4. added WebDriverJs and WD.js
  5. added Ghost Driver
  6. added Comparsion of Webscraping software on Screen Scraper Blog
  7. added ZombieJs
  8. added Resemble.js and PhantomCSS and PhantomFlow, categorised and reedited content
  9. 04.01.2014, added Chimera, answered 2 questions
  10. added NightWatchJs
  11. added DalekJS
  12. added WebdriverCSS
  13. added CasperBox
  14. added trifleJS
  15. added CasperJS IDE
  16. added Nightmare

Source: (StackOverflow)

Cannot install Lxml on Mac os x 10.9

I want to install Lxml so I can then install Scrapy.

When I updated my Mac today it wouldn't let me reinstall lxml, I get the following error:

In file included from src/lxml/lxml.etree.c:314:
/private/tmp/pip_build_root/lxml/src/lxml/includes/etree_defs.h:9:10: fatal error: 'libxml/xmlversion.h' file not found
#include "libxml/xmlversion.h"
         ^
1 error generated.
error: command 'cc' failed with exit status 1

I have tried using brew to install libxml2 and libxslt, both installed fine but I still cannot install lxml.

Last time I was installing I needed to enable the developer tools on Xcode but since its updated to Xcode 5 it doesnt give me that option anymore.

Does anyone know what I need to do?


Source: (StackOverflow)

difference between BeautifulSoup and Scrapy crawler?

I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.


Source: (StackOverflow)

Access django models inside of Scrapy

Is it possible to access my django models inside of a Scrapy pipeline, so that I can save my scraped data straight to my model?

I've seen this, but I don't really get how to set it up?


Source: (StackOverflow)

How to give delay between each requests in scrapy?

I don't want to crawl simultaneously and get blocked. I would like to send 1 request per second.


Source: (StackOverflow)

learning python and also trying to implement scrapy ..getting this error

I am going through the scrapy tutorial http://doc.scrapy.org/en/latest/intro/tutorial.html and I followed it till I ran this command

scrapy crawl dmoz

And it gave me output with an error

2013-08-25 13:11:42-0700 [scrapy] INFO: Scrapy 0.18.0 started (bot: tutorial)
2013-08-25 13:11:42-0700 [scrapy] DEBUG: Optional features available: ssl, http11
2013-08-25 13:11:42-0700 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'}
2013-08-25 13:11:42-0700 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 4, in <module>
    execute()
  File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/Library/Python/2.7/site-packages/scrapy/commands/crawl.py", line 46, in run
    spider = self.crawler.spiders.create(spname, **opts.spargs)
  File "/Library/Python/2.7/site-packages/scrapy/command.py", line 34, in crawler
    self._crawler.configure()
  File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 44, in configure
    self.engine = ExecutionEngine(self, self._spider_closed)
  File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 62, in __init__
    self.downloader = Downloader(crawler)
  File "/Library/Python/2.7/site-packages/scrapy/core/downloader/__init__.py", line 73, in __init__
    self.handlers = DownloadHandlers(crawler)
  File "/Library/Python/2.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 18, in __init__
    cls = load_object(clspath)
  File "/Library/Python/2.7/site-packages/scrapy/utils/misc.py", line 38, in load_object
    mod = __import__(module, {}, {}, [''])
  File "/Library/Python/2.7/site-packages/scrapy/core/downloader/handlers/s3.py", line 4, in <module>
    from .http import HTTPDownloadHandler
  File "/Library/Python/2.7/site-packages/scrapy/core/downloader/handlers/http.py", line 5, in <module>
    from .http11 import HTTP11DownloadHandler as HTTPDownloadHandler
  File "/Library/Python/2.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 13, in <module>
    from scrapy.xlib.tx import Agent, ProxyAgent, ResponseDone, \
  File "/Library/Python/2.7/site-packages/scrapy/xlib/tx/__init__.py", line 6, in <module>
    from . import client, endpoints
  File "/Library/Python/2.7/site-packages/scrapy/xlib/tx/client.py", line 37, in <module>
    from .endpoints import TCP4ClientEndpoint, SSL4ClientEndpoint
  File "/Library/Python/2.7/site-packages/scrapy/xlib/tx/endpoints.py", line 222, in <module>
    interfaces.IProcessTransport, '_process')):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/zope/interface/declarations.py", line 495, in __call__
    raise TypeError("Can't use implementer with classes.  Use one of "
TypeError: Can't use implementer with classes.  Use one of the class-declaration functions instead.

I am not very familiar with python and I am not sure what is it complaining about

here is my domz_spider.py file

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

And here is my items file

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

    class DmozItem(Item):
        title = Field()
        link = Field()
        desc = Field()

and here is the directory structure

scrapy.cfg
tutorial/
tutorial/items.py
tutorial/pipelines.py
tutorial/settings.py
tutorial/spiders/
tutorial/spiders/domz_spider.py

here is the settings.py file

    BOT_NAME = 'tutorial'

    SPIDER_MODULES = ['tutorial.spiders']
    NEWSPIDER_MODULE = 'tutorial.spiders'

Source: (StackOverflow)

scrapy: Call a function when a spider quits

Is there a way to trigger a method in a Spider class just before it terminates?

I can terminate the spider myself, like this:

class MySpider(CrawlSpider):
    #Config stuff goes here...

    def quit(self):
        #Do some stuff...
        raise CloseSpider('MySpider is quitting now.')

    def my_parser(self, response):
        if termination_condition:
            self.quit()

        #Parsing stuff goes here...

But I can't find any information on how to determine when the spider is about to quit naturally.


Source: (StackOverflow)

How to pass a user defined argument in scrapy spider

I am trying to pass a user defined argument to a scrapy's spider. Can anyone suggest on how to do that?

I read about a parameter -a somewhere but have no idea how to use it.


Source: (StackOverflow)

Installing scrapy/pyopenssl in Windows' virtualenv

I am trying to install scrapy on Windows XP (32bit) virtualenv:

pip install scrapy

The installer spits out this ambiguous error message:

error: Only found improper OpenSSL directories: ['E:\\cygwin', 'E:\\Program Files\\Git']

How should I configure openssl / pyOpenSSL to make pip work?


Source: (StackOverflow)

How can I use different pipelines for different spiders in a single Scrapy project

I have a scrapy project which contains multiple spiders. Is there any way I can define which pipelines to use for which spider? Not all the pipelines i have defined are applicable for every spider.

Thanks


Source: (StackOverflow)

Can scrapy be used to scrape dynamic content from websites that are using AJAX?

I have recently been learning Python and am dipping my hand into building a web-scraper. It's nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel.

Most of the issues are solvable and I'm having a good little mess around. However I'm hitting a massive hurdle over one issue. If a site loads a table of horses and lists current betting prices this information is not in any source file. The clue is that this data is live sometimes, with the numbers being updated obviously from some remote server. The HTML on my PC simply has a hole where their servers are pushing through all the interesting data that I need.

Now my experience with dynamic web content is low, so this thing is something I'm having trouble getting my head around.

I think Java or Javascript is a key, this pops up often.

The scraper is simply a odds comparison engine. Some sites have APIs but I need this for those that don't. I'm using the scrapy library with Python 2.7

I do apologize if this question is too open-ended. In short, my question is: how can scrapy be used to scrape this dynamic data so that I can use it? So that I can scrape this betting odds data in real-time?

Cheers people :)


Source: (StackOverflow)

Best way for a beginner to learn screen scraping with python

This might be one of those questions that are difficult to answer, but here goes:

I don't consider my self programmer - but I would like to :-) I've learned R, because I was sick and tired of spss, and because a friend introduced me to the language - so I am not a complete stranger to programming logic.

Now I would like to learn python - primarily to do screen scraping and text analysis, but also for writing webapps with Pylons or Django.

So: How should I go about learning to screen scrape with python? I started going through the scrappy docs but I feel to much "magic" is going on - after all - I am trying to learn, not just do.

On the other hand: There is no reason to reinvent the wheel, and if Scrapy is to screen scraping what Django is to webpages, then It might after all be worth jumping straight into Scrapy. What do you think?

Oh - BTW: The kind of screen scraping: I want to scrape newspaper sites (i.e. fairly complex and big) for mentions of politicians etc. - That means I will need to scrape daily, incrementally and recursively - and I need to log the results into a database of sorts - which lead me to a bonus question: Everybody is talking about nonSQL DB. Should I learn to use e.g. mongoDB right away (I don't think I need strong consistency), or is that foolish for what I want to do?

Thank you for any thoughts - and I apologize if this is to general to be considered a programming question.


Source: (StackOverflow)

Running Scrapy spiders in a Celery task

I have a Django site where a scrape happens when a user requests it, and my code kicks off a Scrapy spider standalone script in a new process. Naturally, this isn't working with an increase of users.

Something like this:

class StandAloneSpider(Spider):
    #a regular spider

settings.overrides['LOG_ENABLED'] = True
#more settings can be changed...

crawler = CrawlerProcess( settings )
crawler.install()
crawler.configure()

spider = StandAloneSpider()

crawler.crawl( spider )
crawler.start()

I've decided to use Celery and use workers to queue up the crawl requests.

However, I'm running into issues with Tornado reactors not being able to restart. The first and second spider runs successfully, but subsequent spiders will throw the ReactorNotRestartable error.

Anyone can share any tips with running Spiders within the Celery framework?


Source: (StackOverflow)

How to get the scrapy failure URLs?

I'm a newbie of scrapy and it's amazing crawler framework i have known!

In my project, I sent more than 90, 000 requests, but there are some of them failed. I set the log level to be INFO, and i just can see some statistics but no details.

2012-12-05 21:03:04+0800 [pd_spider] INFO: Dumping spider stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.internet.error.ConnectionDone': 1,
 'downloader/request_bytes': 46282582,
 'downloader/request_count': 92383,
 'downloader/request_method_count/GET': 92383,
 'downloader/response_bytes': 123766459,
 'downloader/response_count': 92382,
 'downloader/response_status_count/200': 92382,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2012, 12, 5, 13, 3, 4, 836000),
 'item_scraped_count': 46191,
 'request_depth_max': 1,
 'scheduler/memory_enqueued': 92383,
 'start_time': datetime.datetime(2012, 12, 5, 12, 23, 25, 427000)}

Is there any way to get more detail report? For example, show those failed URLs. Thanks!


Source: (StackOverflow)

Scrapy Unit Testing

I'd like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the "scrapy crawl" command I can run it through something like nose. Since scrapy is built on top of twisted can I use its unit testing framework Trial? If so, how? Other wise I'd like to get nose working.

Update:

I've been talking on Scrapy-Users and I guess I am supposed to "build the Response in the test code, and then call the method with the response and assert that [I] get the expected items/requests in the output". I can't seem to get this to work though.

I can build a unit-test test class and in a test:

  • create a response object
  • try to call the parse method of my spider with the response object

However it ends up generating this traceback. Any insite as to why?


Source: (StackOverflow)