Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
#!/usr/local/bin/python # encoding: utf-8 *using the Mercury Parser API to clean up a local html file*
:Author: David Young
:Date Created: October 1, 2016 """ ################# GLOBAL IMPORTS ####################
""" *A parser/cleaner to strip a webpage article of all cruft and neatly present it with some nice css*
**Key Arguments:** - ``log`` -- logger - ``settings`` -- the settings dictionary - ``url`` -- the URL to the HTML page to parse and clean - ``outputDirectory`` -- path to the directory to save the output html file to - ``title`` -- title of the document to save. If *False* will take the title of the HTML page as the filename. Default *False*. - ``style`` -- add polyglot's styling to the HTML document. Default *True* - ``metadata`` -- include metadata in generated HTML. Default *True* - ``h1`` -- include title as H1 at the top of the doc. Default *True*
**Usage:**
To generate the HTML page, using the title of the webpage as the filename:
.. code-block:: python
from polyglot import htmlCleaner cleaner = htmlCleaner( log=log, settings=settings, url="http://www.thespacedoctor.co.uk/blog/2016/09/26/mysqlSucker-index.html", outputDirectory="/tmp" ) cleaner.clean()
Or specify the title of the document and remove styling, metadata and title:
.. code-block:: python
from polyglot import htmlCleaner cleaner = htmlCleaner( log=log, settings=settings, url="http://www.thespacedoctor.co.uk/blog/2016/09/26/mysqlSucker-index.html", outputDirectory="/tmp", title="my_clean_doc", style=False, metadata=False, h1=False ) cleaner.clean()
""" # INITIALISATION
self, log, settings, url, outputDirectory=False, title=False, style=True, metadata=True, h1=True ):
# INITIAL ACTIONS
self): """*parse and clean the html document with Mercury Parser*
**Return:** - ``filePath`` -- path to the cleaned HTML document
**Usage:**
See class usage """
# PARSE THE CONTENT OF THE WEBPAGE AT THE URL
return None
return None
# GRAB THE CSS USED TO STYLE THE WEBPAGE/PDF CONTENT else: thisCss = ""
# CATCH ERRORS print url print " " + article["messages"] return None except: print "Can't decode the text of %(url)s - moving on" % locals() return None
# COMMON FIXES TO HTML TO RENDER CORRECTLY u'<span class="mw-editsection"><span class="mw-editsection-bracket">.*"mw-editsection-bracket">]') u'\<sup class="noprint.*better source needed\<\/span\>\<\/a\>\<\/i\>\]\<\/sup\>', re.I) u'\<a href="https\:\/\/en\.wikipedia\.org\/wiki\/.*(\#.*)"\>\<span class=\"tocnumber\"\>', re.I) u'srcset=".*?">')
# GRAB HTML TITLE IF NOT SET IN ARGUMENTS
# USE DATETIME IF TITLE STILL NOT SET from datetime import datetime, date, time now = datetime.now() title = now.strftime("%Y%m%dt%H%M%S")
# REGENERATE THE HTML DOCUMENT WITH CUSTOM STYLE filePath, encoding='utf-8', mode='w') else: metadata = ""
else: h1 = "" <!DOCTYPE html> <html> <head> <meta charset="utf-8"> %(metadata)s
<style> %(thisCss)s </style>
</head> <body>
%(h1)s <a href="%(url)s">original source</a> </br></br>
%(text)s </body> </html>""" % locals()
self, url): """* request parsed article from mercury*
**Key Arguments:** - ``url`` -- the URL to the HTML page to parse and clean
**Return:** - None
**Usage:** .. todo::
- add usage info - create a sublime snippet for usage - update package tutorial if needed
.. code-block:: python
usage code
""" 'starting the ``_request_parsed_article_from_mercury`` method')
url="https://mercury.postlight.com/parser", params={ "url": url, }, headers={ "x-api-key": self.settings["mercury api key"], }, )
except requests.exceptions.RequestException: print('HTTP Request failed')
'completed the ``_request_parsed_article_from_mercury`` method')
# use the tab-trigger below for new method # xt-class-method |