Pandas

Pandas is a module in Python for working with data structures. The two main objects from Pandas are the Series and DataFrame. These object scan easily subset, aggregate and reshape the data using the array-computing features of NumPy.

Import Pandas

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

Series

The Series is a one-dimensional array-like object with associated data labels called the index. The values and index of the Series can be accessed using attributes of the object. Similar to strings and tuples, the index of a Series is immutable (same is true for a DataFrame later on). > x = Series([5, 10, 15, 20]) > x.values > x.index

The default index for a Series is the set of integers starting at 0 through the length of the data (e.g. $N - 1$). To define a new index, use index in the Series definition. These indexes can be used to select a subset of the Series. Other subsetting such as using boolean arrays also work.

x = pd.Series([5, 10, 15, 20], index = ['Holly', 'Bart', 'Josh', 'Karen'])
x[['Josh', 'Holly']]
x[x > 10 ]

Both the Series itself and its index have a name attribute.

x.name = 'age'
x.index.name = 'firstName'

The index of a Series can also re-ordered

x.index = ['Josh', 'Holly', 'Bart', 'Karen']

Series and dictionaries

This object is very similar to an ordered dict because of the mapping between the index and values. In fact you can use the in operator to check if an index exists in the object You can also directly pass a dict to a Series.

'Bart' in x
y = Series[{'Holly': 5, 'Bart': 10, 'Alan': 27, 'Beau': 5}]

Note: The resulting Series will have the dict’s keys in a sorted order.

Series and missing data

When creating a Series, you can have missing data (i.e. NaN values). to testing for missing data, use pd.isnull and pd.notnull

pd.isnull(x)
pd.notnull(x)

Arithmetic methods for a Series

One important feature of working with pandas is the flexibility of working with different objects with different indexes. When you add the objects (with possibly different sizes and shapes) together, the resulting objects will the union of all the index pairs. For example, when combining

x + y

this returns a new Series with the union of all the indexes and values in x and y.

DataFrame

The DataFrame is an extension of the Series because instead of just being one-dimensional, it organizes data into a column structure with row and column labels. This allows the user to have a collection of columns of data with different types. The DataFrame has a both row and column index. The column names can be found using the attribute columns. The values and index can be….

data = {'height' : 5.6, 7.0, 4.9, 6.7, 5.2, 5.5, 6.1, 5.4], 
       'age' : [15, 21, 15, 20, 22, 41, 18, 38]}
z = DataFrame(data)
z.columns  		# column names
z.values			# values
z.index			# index
z.ix				# indexing field

To extract a specific column, you can use [ ] (brackets) or attribute notation. If you specify a sequence of columns, the DataFrame will return the columns you ask for. If you pass a column that isn’t in your data set, then it will return NaN values.

z.height
z['height']
z = DataFrame(data, columns = ['height', 'age', 'weight'])

Altering the value of a column can be done too.

z['weight'] = 180		# assigns 180 to all the values in the `weight` column
z['footSize'] = 7			# assigning values to a column that doesn't exist creates a new column
z.index = ['Holly', 'Bart', 'Josh', 'Karen', 'Tom', 'Doug', 'Sophie'']		# assigns the index values

Both the DataFrame itself, its index and its columns have a name attribute.

z.name = 'Team1'
z.index.name = 'firstName'
z.columns.name

Ways to create a DataFrame [Mostly taken from Data analysis for Python]

Approach	Details
dict of arrays, lists, or tuples	Each group of elements becomes a column (all groups must be the same length)
dict of dicts	If you have a nested `dict` of `dict`s then when you pass it to a `DataFrame`, the outer `dict` keys will be the columns and the inner keys will be the rows.
dict of Series	Each `value` becomes a column
2D ndarray	Use the `numpy` `ndarray` with optional row and column labels
Another DataFrame	Combine `DataFrame`s using their `index`es
list of dicts or Series	Each item in the `list` becomes a row in the `DataFrame`. Union of the `dict` `keys` (or `Series` `index`es) is the column names
list of lists or tuples	Similar idea to the `ndarray`

Index methods for a DataFrame

There are a set of methods that specifically operate on the index of a DataFrame (i.e. z.index). Note: These methods do not alter the index, but rather creates a new index that has been modified using one of the following methods.

Method	Description
`unique`	Return the unique values in the index
`is_unique`	Returns a `Bool` if index has no duplicates
`insert(n, elem)`	Insert `elem` at position `n`
`delete(n)`	Delete the `value` at position `n`
`drop(elem)`	Drop the `elem` value
`union`	Return the union of indexes
`intersection`	Return the intersection of indexes
`diff`
`append`	Append a new index object

Methods for Series and DataFrame

These are methods that apply to both a Series and a DataFrame. Typically when change the index, this will only apply to the row indexes of a DataFrame, but there is usually an option to change the column index as well.

Method	Example	Description
`T`	x.T, z.T	Transposes object
`reindex`	x.reindex([new index order], columns)	Create new object conformed to a new index. Check out `fill` or `bill` for interpolation options. Use arg `columns` to reindex columns.
`ix`	z.ix(a, b)	Subset the ‘a’ rows and the ‘b’ columns. Also another form of reindexing.
`drop`	z.drop(‘Jack’)	Drop one ore more rows or columns. Default is the row-index, but use `axis = 1` to drop from columns.

z.reindex(index = , method = 'ffill', columns = ['age'])
z.ix[['Bart', Holly'], :)
z.drop('age', axis = 1)

Note: You can index and slice similar to working with an ndarray, except you can use the indexes rather than just integers. The only difference is slicing is inclusive (not exclusive which is normal Python). You can also use Bool to subset the Series or DataFrame.

z['Josh': 'Doug']
z[:3]			# rows returned: Holly, Bart and Josh
z[z['height'] > 5.5] # rows returned: all heights > 5.5

Arithmetic methods for two or more DataFrames

Similar to adding the values in two or more Series objects, you can add DataFrame objects with different indices and columns. If indices don’t overlap, then NaN are returned for those values.

a = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Holly', 'Bart', 'Jack'])
b = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Bart', 'Karen', 'Jack', 'Darren'])
a + b

If you want to fill in the value with something else beside NaN, use the method add with the fill_value argument.

a.add(b, fill_value=0)

Method	Description
`a.add(b)`	Add two DataFrames
`a.sub(b)`	Subtract two DataFrames
`a.mul(b)`	Multiple two DataFrames
`a.div(b)`	Divide two DataFrames

Arithmetic methods for Series and DataFrames

Similar to broadcasting on multiple ndarrays, arithmetic methods between a Series and a DataFrame is also common. The default is the index of the Series will be matched to the columns of the DataFrame. If an index is not found in either the Series of the columns of the DataFrame, then the objects will be re-indexed to form the union. To match on the rows of the DataFrame, use the method sub().

b = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Bart', 'Karen', 'Jack', 'Darren'])
b.ix[0] 		# first row of DataFrame
b - b.ix[0]
b.sub(series, axis = 0)

Applying NumPy functions to pandas objects

The functions in NumPy can be applied to a Series or DataFrame from pandas. For example, finding the dimension of the object using shape. Another example is the element-wise array methods (ufuncs) also work on pandas objects.

np.abs(b)
np.square(b)

lambda functions and `applymap`

In addition, we can apply lambda functions and applying user-defined functions (with multiple output)

fun = lambda x: x.max() - x.min()
b.apply(fun, axis = 1)

def f(x): 
    return Series([x.min(), x.max()], index=['min', 'max'])
b.apply(f)

Use applymap to apply element-wise python functions.

Import data from a tab-delimited file to a DataFrame

data = pd.read_csv('myfile.txt', delimiter='\t', names=headernames).dropna()
print "Number of rows: %i" % data.shape[0]
data.head()  # print the first 5 rows