Python modules Cleaning data
Modules: parsing html and xml pages
pattern
The pattern module can be used to parse HTML and XML pages. The pattern.web
module is a submodule of pattern
which can be used to parse HTML Document Object Model (DOM), strip HTML tags, and
from pattern import web
The HTML DOM parser in pattern.web
is very easy to use. Use the Element()
function to parse the text from the webpage. For example, if we want to parse the text from google.com by script
, we would
r = requests.get("http://www.google.com") dom = web.Element(r.text) dom.by_tag('script')
This returns a list of nested Element
s with a given tag name (e.g. script
, graph
, meta
, div
, etc). The typical use is to use a for
loop to loop through this list. e.g.
url = 'http://www.imdb.com/search/title' params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2012') r = requests.get(url, params=params) dom = web.Element(r.text) for movie in dom.by_tag('td.title'): title = movie.by_tag('a')[0].content rating = movie.by_tag('span.value')[0].content print title, rating
BeautifulSoup
The BeautifulSoup module is similar to parsing HTML and XML pages using pattern. To import the module, use
from BeautifulSoup import BeautifulSoup
To extra the movie titles and ratings as above in the pattern module, use
url = 'http://www.imdb.com/search/title' params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2012') r = requests.get(url, params=params) dom = web.Element(r.text) for movie in bs.findAll('td', 'title'): title = movie.find('a').contents[0] rating = movie.find('span', 'value').contents[0] print title, rating
sgmllib
The sgmllib module is a simple SGML (Standard Generalized Mark-up Language) parser. The goal of the module is to process HTML code. The main function is SGMLParser
which parses HTML code into 8 kinds of data:
Name | Definition | Example | SGMLParser function |
---|---|---|---|
start tag | an HTML tag that starts a block | <html>, <head>, <body> | start_tagname , do_tagname |
end tag | an HTML tag that ends a block | </html>, </head>, </body> | end_tagname |
character reference | an escaped character referenced by its decimal | handle_charref |
|
entity reference | an HTML entity | \& copy; | handle_entityref |
comment | an HTML comment | comments inclosed in <|– … –> | handle_comment |
processing instruction | an HTML processing instruction | instruction enclosed in <? … > | handle_pi |
declaration | an HTML declaration | DOCTYPE enclosed in <! … > | handle_dec1 |
text data | a block of text | anything that does not fall into the other 7 categories | handle_data |
Modules: parsing JSON objects
json
The json module can be used to parse JSON (JavaScript Object Notation) objects to a Python dictionary.
import json
The json.loads()
function can be used to parse by line
path = 'myjsonfile.txt' mydict = [json.loads(line) for line in open(path)]
Modules to manipulate regular expressions and string manipulation
re
This module is used to manipulate regular expressions. The function sub()
searches a string (s) and replaces all the words that match the regular expression.
import re re.sub(regex, replacementword, stringtosearch) re.search() re.compile()
Useful regular expressions in Python
Expression | Interpretation |
---|---|
^ | matches the beginning of a string |
$ | matches the end of a string |
\b | matches a word boundary |
\d | matches any numeric digit |
\D | matches any non-numeric digit |
x? | matches an optional x character |
x* | matches x zero or more times |
x+ | matches x one or more times |
x{n, m} | matches x at least n times, but not more than m times |
(a | b | c) | matches either a, b or c |
string
string.punctuation()
string.strip()
string.replace()
string.translate()
collections
The collections module contains functions to expand upon Python’s built in data types (list
, tuple
, dict
, etc).
collections class |
Interpretation |
---|---|
namedtuple() |
a class similar to a tuple with named fields |
deque |
a class similar to a list that can quickly append and pop (on either end) |
Counter |
a subclass of dict that can be used for counting hashable objects |
OrderedDict |
a subclass of dict that remembers the order entries were added |
defaultdict |
a subclass of dict that allows for missing values |
datetime
The datetime module can be used to print out the date and time. the function now()
will print out the current day and time. Some attributes inclue year
, month
, day
, hour
, minute
, second
.
import datetime now = datetime.now() print '%s/%s/%s' % (now.month, now.day, now.year)