EzDev.org

goose

Html Content / Article Extractor in Scala - open sourced from Gravity Labs


Read article content using goose retrieving nothing

I am trying to goose to read from .html files(specified url here for sake convenience in examples)[1]. But at times it's doesn't show any text. Please help me out here with the issue.

Goose version used:https://github.com/agolo/python-goose/ Present version gives some errors.

from goose import Goose
from requests import get

response = get('http://www.highbeam.com/doc/1P3-979471971.html')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text
print text

Source: (StackOverflow)

Python Goose - Article Extraction Unicode Errors

After a anticipated wait I decided to give Goose a try to extract articles, however, I am getting so many unicode problems with the extracted text.

g = Goose()
article = g.extract(url='http://www.forbes.com/sites/benkepes/2014/08/19/more-openstack-certifications-because-interoperability-is-key/?partner=yahootix')
articlebody = article.cleaned_text[:1300]
ex_article = articlebody.encode('utf-8')
print ex_article

The result looks as follows:

Trove is the database as a service component of OpenStack that lets administrators and DevOps operate many instances of a variety of different database management systems (DBMS) technologies using common infrastructure. Â Â Trove assures common administrative tasks including provisioning,

So far I have tried using

1) .encode('utf-8')

2) .decode('utf-8')

3) from __future__ import unicode_literals at the start of the file

4) Reloading sys to include UTF-8

How can I get this cleansed, pure text article with no unicode problems?


Source: (StackOverflow)

ImportError: No module named goose

Im trying to work with Python-Goose extractor. I Installed virtualenv, and followed the setup instructions. When running from PyCharm everything works great.

But when running from the Windows Command Prompt I'm getting this error:

C:\Users\tal>C:\virtual_enviroments\goose_venv\Scripts\activate
(goose_venv) C:\Users\tal>cd C:\main\prototypes\collection\goose-cli\app

(goose_venv) C:\main\prototypes\collection\goose-cli\app>extract-new-events.py
Traceback (most recent call last):
  File "C:\main\prototypes\collection\goose-cli\app\extract-new-events.py", line 1, in <module>
    from goose import Goose
ImportError: No module named goose

What am I doing wrong here?

Here is an image of it working in PyCharm (large):

Working in PyCharm.


Source: (StackOverflow)

pyinstaller cant find goose file path

Question

Why will Pyinstaller not work with goose files? Is it an issue with the executable creator or my code?

Code

from goose.Goose import Goose
url = 
'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'
g = Goose({'debug':False,'enableImageFetching': False,'localStoragePath':'./tmp'})
article = g.extractContent(url=url)
#article.title
print article.cleanedArticleText[:150].encode("utf8","ignore")

Error Log From Pyinstaller

My program, created with pyinstaller, fails to find goose files in this path:

IOError: Couldn't open file C:\Users\user\Desktop\dist\main.exe?118272\goose/resources/text/stopwords-en.txt

This happens:

Traceback (most recent call last):
  File "<string>", line 15, in <module>
  File "C:\Users\user\Desktop\build\pyi.win32\main\out00-PYZ.pyz\goose.Goose",line 52, in extractContent
  File "C:\Users\user\Desktop\build\pyi.win32\main\out00-PYZ.pyz\goose.Goose",line 59, in sendToActor
  File "C:\Users\user\Desktop\build\pyi.win32\main\out00-PYZ.pyz\goose.Crawler", line 86, in crawl
  File "C:\Users\user\Desktop\build\pyi.win32\main\out00-PYZ.pyz\goose.extractors", line 245, in calculateBestNodeBasedOnClustering
  File "C:\Users\user\Desktop\build\pyi.win32\main\out00-PYZ.pyz\goose.text", line 97, in __init__
  File "C:\Users\user\Desktop\build\pyi.win32\main\out00-PYZ.pyz\goose.utils",line 76, in loadResourceFile
  IOError: Couldn't open file C:\Users\user\Desktop\dist\main.exe?118272\goose/resources/text/stopwords-en.txt

What's wrong?


Source: (StackOverflow)

Goose NoClassDefFound Error

I am trying to implement Goose-2.1.22 into one of my applications. However, when I try to run my app with the basic code they provided me I get this error:

02-16 11:19:55.048  29391-29391/test.package.test2 E/AndroidRuntime﹕ FATAL EXCEPTION: main
java.lang.NoClassDefFoundError: com.gravity.goose.Goose
        at test.package.test2.Searching_Animation_Screen.goose_it(Searching_Animation_Screen.java:65)
        at test.package.test2.Searching_Animation_Screen.onCreate(Searching_Animation_Screen.java:59)
        at android.app.Activity.performCreate(Activity.java:5372)
        at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1104)
        at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2267)
        at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2359)
        at android.app.ActivityThread.access$700(ActivityThread.java:165)
        at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1326)\

Here is the code that uses goose (method called from onCreate())

String url = "http://www.cnn.com/2010/POLITICS/08/13/democrats.social.security/index.html";
    Goose goose = new Goose(new Configuration());
    Article article = goose.extractContent(url);
    System.out.println(article.cleanedArticleText());
    text.setText(article.cleanedArticleText().toString());

Any ideas how to fix my issue? Thanks everyone!


Source: (StackOverflow)

Problems when installing goose

I followed the exact instructions from https://github.com/grangier/python-goose when installing goose, and after I typed in "mkvirtualenv --no-site-packages goose", this is what I got:

172-27-220-167:~ yitongwang$ mkvirtualenv --no-site-packages goose
New python executable in goose/bin/python
Installing setuptools, pip...done.
Error: deactivate must be sourced. Run 'source deactivate'
instead of 'deactivate'.
Usage: source deactivate
removes the 'bin' directory of the environment activated with 'source
activate' from PATH. 
(goose)172-27-220-167:~ yitongwang$

I have installed virtualenv and virtualenvwrapper using 'sudo pip install virtualenv/virtualenvwrapper', and the weirdest thing is I seemed to still manage to enter the goose virtual environment (seems like it). After cloning into the git repo and change to the directory python-goose cloned earlier, I attempted to run 'pip install -r requirements.txt' and 'python setup.py install', and these are the errors:

In file included from _imagingft.c:31:

/Users/yitongwang/anaconda/include/ft2build.h:56:10: fatal error: 'freetype/config/ftheader.h' file not found

#include <freetype/config/ftheader.h>

         ^

1 error generated.

Building using 4 processes

gcc -bundle -undefined dynamic_lookup -L/Users/yitongwang/anaconda/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.5-x86_64-2.7/_imagingft.o -L/Users/yitongwang/.virtualenvs/goose/lib -L/usr/local/lib -L/usr/local/Cellar/freetype/2.5.5/lib -L/usr/lib -L/Users/yitongwang/anaconda/lib -lfreetype -o build/lib.macosx-10.5-x86_64-2.7/PIL/_imagingft.so

clang: error: no such file or directory: 'build/temp.macosx-10.5-x86_64-2.7/_imagingft.o'

error: command 'gcc' failed with exit status 1

----------------------------------------
Command "/Users/yitongwang/.virtualenvs/goose/bin/python -c "import setuptools, tokenize;__file__='/private/var/folders/64/dhzf31k50zg22rbgbz79c3dw0000gn/T/pip-build-nL0d0r/Pillow/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/64/dhzf31k50zg22rbgbz79c3dw0000gn/T/pip-k7HUgC-record/install-record.txt --single-version-externally-managed --compile --install-headers /Users/yitongwang/.virtualenvs/goose/include/site/python2.7" failed with error code 1 in /private/var/folders/64/dhzf31k50zg22rbgbz79c3dw0000gn/T/pip-build-nL0d0r/Pillow

    In file included from _imagingft.c:31:
/Users/yitongwang/anaconda/include/ft2build.h:56:10: fatal error: 
      'freetype/config/ftheader.h' file not found
#include <freetype/config/ftheader.h>
         ^
1 error generated.
clang: error: no such file or directory: 'build/temp.macosx-10.5-x86_64-2.7/_imagingft.o'
error: Setup script exited with error: command 'gcc' failed with exit status 1

I'm not sure particularly what's wrong, cause I have tried a few times from scratch where I deleted the directory 'python-goose' and './virtualenv' as well as the path from .bash_profile.

Any help would be much much appreciated!

Thanks

P.S. I'm using Anaconda with Python 2.7 in it.


Source: (StackOverflow)

How can resolve recursion depth exceeded (Goose-extractor)

I am one problem with goose-extractor This is my code:

  for resultado in soup.find_all('a', rel='nofollow' href=True,text=re.compile(llave)):
        url = resultado['href']
        article = g.extract(url=url)
        print article.title

and take a look at my problem.

RuntimeError: maximum recursion depth exceeded

any suggestions ?

I am a lousy programmer or hidden errors are not visible in python


Source: (StackOverflow)

Logic behind goose extractor

I have been wondering lately to use Goose Extractor for boiler plate removal purposes. I am not sure if blindly trusting Goose extractor will be the right thing to do. Thus, I wanted to ask if anyone knows the logic behind the goose extractor? I know that it is a sort of statistical method but nowhere they have mentioned the whole logic behind extraction process.

Any help will be highly appreciated.

Thank you!


Source: (StackOverflow)

Get rid of the backslash from a goose extracted text

I have a small regex problems with text extracted by goose.

I have extracted the clean text out of a html page using Goose, the output that goose gives is fine, but with a small problem. I get the below string.

    My name is Sam\'s, I like to play \'football\'

The actual text looks like 

    My name is Sam's, I like to play 'football'

I am trying to get rid of the backslash. When I try the below code for the text extracted by goose, somehow the code doesn't work, however, if I input the text myself the code works perfectly.

I tried the below code

re.sub(r"\\","",text) or
text.replace("\\","")
text.decode()

Please find the code below:

from goose import Goose
url = 'http://economictimes.indiatimes.com/news/politics-and-    nation/swach-bharat-drives-draws-inspiration-from-mahatma-    gandhi/articleshow/49203355.cms'
g = Goose()
article = g.extract(url=url)
text=article.cleaned_text

print text
.....International School here on Friday, Gandhi\'s 146th birth anniversary.Gurjit Singh said that apart from Gandhi\'s birth anniversary,....

text=re.sub(r"\\","",text)
print text
.....International School here on Friday, Gandhi\'s 146th birth anniversary.Gurjit Singh said that apart from Gandhi\'s birth anniversary,....

How do I get rid of the backslash.


Source: (StackOverflow)

Java Goose not extracting content on Android

I'm trying to set up a small Android application which extracts content from a web page using the Goose library. Since the library is written in Scala, I'm using the .jar I found here. The problem is, when I try to extract content from a page, it returns nothing. I successfully create an Article object using the URL I need, but the values of the object (title, domain, topImage etc.) are all null. I tried using different urls, to see if the problem was isolated to a single website, but it doesn't appear to be so.

The code I use to set up the Goose instance is this:

gooseDir = context.getCacheDir();
Configuration config = new Configuration();
config.setLocalStoragePath(gooseDir.getAbsolutePath());
Goose goose = new Goose(config);

And then I just create the Article instance like so:

Article article = goose.extractContent(url);

Any advice?


Source: (StackOverflow)

What is the proper goose import syntax

The goose install places goose in the python-goose directory. When I try to import goose at the IDLE prompt I get:

>>> from goose import Goose

Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
from goose import Goose
ImportError: No module named goose

Because goose is installed in the python-goose directory I believe the import syntax should be: from python-goose.goose import Goose however when I run this I get the following syntax error message:

>>> from python-goose.goose import Goose
SyntaxError: invalid syntax

Any suggestions on how to properly import goose would be appreciated.


Source: (StackOverflow)

How to extract article in chinese

from newspaper import Article
import pdb
from unidecode import unidecode
def get_article_newspaper(url):
    article = Article(url,en='zh') # Chinese
    article.download();
    article.parse()# article.text if blank!
    print unidecode(article.text).replace('Image caption','')

url='http://www.tyfzw.cn/?sw=774&b=177%20'
get_article_newspaper(url)

This seemed the most maintained so tried. Also, tried goose and boilerpipe neither work.

Later want to translate also :

import goslate
def language_translate(text): #translates to language
    gs = goslate.Goslate()
    language_id = gs.detect('text')
    if language_id != 'en':
        text=gs.translate(text, 'en')
    return text

Source: (StackOverflow)

Extending Multi-Language Version from Goose in Python

Goose is a tool which extracts sentences, photos, pictures etc from urls. This tool is written by python. All codes are in the following URL.

https://github.com/grangier/python-goose/tree/develop/goose

My main purposes is adding processings for other languages which is NOT contained at the current version.

First, I read the tutorials, and Chinese and Korean and Arabic languages are able to be applicable by setting "stop_words" parameters.

Thus, I also searched this notion of "stop_words" in the entire packages.
I found the following python-classes.

class StopWords(object): class StopWordsChinese(StopWords): class StopWordsArabic(StopWords): class StopWordsKorean(StopWords):

I also found text files of stop words written in various languages. The place where these files are located is the /resource/text/ in the above URL.

QUESTION 1: Are there other components in the package to rewrite these codes of Goose and to add the versions of Japanese and all other languages which is NOT included in the newest version??

.

QUESTION 2: As a firtstep, I wanted to add Japanese procedures. Is there a TIPs for web-scraping from URLs in Japanese??


Source: (StackOverflow)

Cannot import python-goose (OSX 10.9)

I am trying to properly set up python-goose in a virtualenv.

Update: I nuked python and started with a clean install as outlined here.

I followed the python-goose instructions and did:

mkvirtualenv --no-site-packages goose
git clone https://github.com/grangier/python-goose.git
cd python-goose
pip install -r requirements.txt
python setup.py install

pip install -r requirements.txt fails on lxml

Error I get now is:

error: command 'cc' failed with exit status 1
----------------------------------------
Cleaning up...
Command /Users/me/.virtualenvs/goose/bin/python -c "import setuptools, tokenize;__file__='/Users/me/.virtualenvs/goose/build/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/wg/82j6ndq50tl4m9rjkqszyx8r0000gp/T/pip-c9DtYT-record/install-record.txt --single-version-externally-managed --compile --install-headers 
/Users/me/.virtualenvs/goose/include/site/python2.7 failed with error code 1 in   
/Users/me/.virtualenvs/goose/build/lxml

Is there anything I am doing incorrectly or are there any alternative ways I can try to get this working?


Source: (StackOverflow)

Python Goose not able to extract mashable / usatoday / politicalwire articles

I am using python goose extractor and its failing for every article on mashable.com and usatoday.com. Can someone suggest a fix for the problem?

For usatoday.com article:

g = Goose()
article = g.extract(url='http://www.usatoday.com/story/tech/columnist/talkingtech/2014/01/25/namm-2014---ik-multimedias-rings-to-make-music/4863193/')
assert(article.cleaned_text=='')

For mashable article:

g = Goose()
article = g.extract(url='http://mashable.com/2014/01/26/square-cofounder-jim-mckelvey/')
assert(article.cleaned_text=='')

For politicalwire article:

g = Goose()
article = g.extract(url='http://politicalwire.com/archives/2014/01/27/some_republicans_go_off_script_in_sotu_response.html')
assert(article.cleaned_text=='')

I assume these are pretty important websites for text extraction. Can someone suggest a fix please? Thanks


Source: (StackOverflow)