EzDev.org

pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more Python Data Analysis Library — pandas: Python Data Analysis Library


add one row in a pandas.DataFrame

I understand that pandas is designed to load fully populated DataFrame but I need to create an empty DataFrame then add rows, one by one. What is the best way to do this ?

I successfully created an empty DataFrame with :

res = DataFrame(columns=('lib', 'qty1', 'qty2'))

Then I can add a new row and fill a field with :

res = res.set_value(len(res), 'qty1', 10.0)

It works but seems very odd :-/ (it fails for adding string value)

How can I add a new row to my DataFrame (with different columns type) ?


Source: (StackOverflow)

how to get row count of pandas dataframe?

i try to get the number of rows of dataframe df, both code snippets give me an error: TypeError: unsupported operand type(s) for +: 'instancemethod' and 'int'

total_rows = df.count
print total_rows +1

total_rows = df['First_columnn_label'].count
print total_rows +1

I'd be grateful for any suggestions what I'm doing wrong.

EDIT: According to the answer given by root the best (the fastest) way to check df length is to call:

len(df.index)

Source: (StackOverflow)

Pandas writing dataframe to CSV file

I have a dataframe in pandas which I would like to write to a CSV file. I am doing this using:

df.to_csv('out.csv')

And getting the error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 20: ordinal not in range(128)

Is there any way to get around this easily (i.e. I have unicode characters in my data frame)? And is there a way to write to a tab delimited file instead of a CSV using e.g. a 'to-tab' method (that I dont think exists)?


Source: (StackOverflow)

Renaming columns in pandas

I have a data table using pandas and column labels that I need to edit to replace the original column labels.

I'd like to change the column names in a data table A where the original column names are:

['$a', '$b', '$c', '$d', '$e'] 

to

['a', 'b', 'c', 'd', 'e'].

I have the edited column names stored it in a list, but I don't know how to replace the column names.


Source: (StackOverflow)

Why are pandas merges in python faster than data.table merges in R?

I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It's even faster than the data.table package in R (my language of choice for analysis).

Why is pandas so much faster than data.table? Is it because of an inherent speed advantage python has over R, or is there some tradeoff I'm not aware of? Is there a way to perform inner and outer joins in data.table without resorting to merge(X, Y, all=FALSE) and merge(X, Y, all=TRUE)?

Comparison

Here's the R code and the Python code used to benchmark the various packages.


Source: (StackOverflow)

Adding new column to existing DataFrame in python pandas

I have a DataFrame with named columns and rows indexed with not continuous numbers like from the code:

df1 = DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
mask = df1.applymap(lambda x: x <-0.7)
df1 = df1[-mask.any(axis=1)]
sLength = len(df1['a'])
e = Series(np.random.randn(sLength)) 

I would like to add new column 'e' to the existing df and do not change anything in the df. (The series got always the same length as a dataframe.) I try different version of join, append, merge but do not have this what I want, error at the most.

The series and df is already given and above code is only to illustrate example.

I am sure there is some easy way to that but can't figure it out


Source: (StackOverflow)

Efficiently applying a function to a grouped pandas DataFrame in parallel

I often need to apply a function to the groups of a very large DataFrame (of mixed data types) and would like to take advantage of multiple cores.

I can create an iterator from the groups and use the multiprocessing module, but it is not efficient because every group and the results of the function must be pickled for messaging between processes.

Is there any way to avoid the pickling or even avoid the copying of the DataFrame completely? It looks like the shared memory functions of the multiprocessing modules are limited to Numpy arrays. Are there any other options?


Source: (StackOverflow)

How to change the order of DataFrame columns?

I have the following DataFrame (df):

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(10, 5))

I add more column(s) by assignment:

df['mean'] = df.mean(1)

How can I move the column mean to the front, i.e. set it as first column leaving the order of the other columns untouched?


Source: (StackOverflow)

How to filter the DataFrame rows of pandas by "within"/"in"?

I have a DataFrame 'rpt' of Python pandas:

rpt
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 47518 entries, ('000002', '20120331') to ('603366', '20091231')
Data columns:
STK_ID                    47518  non-null values
STK_Name                  47518  non-null values
RPT_Date                  47518  non-null values
sales                     47518  non-null values

I can filter the rows whose stock id is '600809' like this: rpt[rpt['STK_ID']=='600809']

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 25 entries, ('600809', '20120331') to ('600809', '20060331')
Data columns:
STK_ID                    25  non-null values
STK_Name                  25  non-null values
RPT_Date                  25  non-null values
sales                     25  non-null values

and I want to get all the rows of some stocks together, such as ['600809','600141','600329'], that means I want a syntax like this:

stk_list = ['600809','600141','600329']

rst = rpt[rpt['STK_ID'] in stk_list] ### this does not works in pandas 

Since pandas not accept above command, how to achieve the target?


Source: (StackOverflow)

What is the most efficient way to loop through dataframes with pandas?

I want to perform my own complex operations on financial data in dataframes in a sequential manner.

For example I am using the following MSFT CSV file taken from Yahoo Finance:

Date,Open,High,Low,Close,Volume,Adj Close

2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13

2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31

2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98

2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27

....

I then do the following:

#!/usr/bin/env python
from pandas import *

df = read_csv('table.csv')

for i, row in enumerate(df.values):
    date = df.index[i]
    open, high, low, close, adjclose = row
    #now perform analysis on open/close based on date, etc..

Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? df.iteritems unfortunately only iterates column by column.


Source: (StackOverflow)

iterating row by row through a pandas dataframe [duplicate]

Possible Duplicate:
What is the most efficient way to loop through dataframes with pandas?

I'm looking to iterate row by row through a pandas DataFrame. The way I'm doing it so far is as follows:

for i in df.index:
    do_something(df.ix[i])

Is there a more performant and/or more idiomatic way to do this? I know about apply, but sometimes it's more convenient to use a for loop. Thanks in advance.


Source: (StackOverflow)

Pandas: change data type of columns

I want to convert a table, represented as a list of lists, into a Pandas DataFrame. As an extremely simplified example:

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)

What is the best way to convert the columns to the appropriate types, in this case columns 2 and 3 into floats? Is there a way to specify the types while converting to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the type for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns and I don't want to specify exactly which columns are of which type. All I can guarantee is that each columns contains values of the same type.


Source: (StackOverflow)

How to drop rows of Pandas dataframe whose value of certain column is NaN

I have a df :

>>> df
                 STK_ID  EPS  cash
STK_ID RPT_Date                   
601166 20111231  601166  NaN   NaN
600036 20111231  600036  NaN    12
600016 20111231  600016  4.3   NaN
601009 20111231  601009  NaN   NaN
601939 20111231  601939  2.5   NaN
000001 20111231  000001  NaN   NaN

Then I just want the records whose EPS is not NaN, that is, df.drop(....) will return the dataframe as below:

                  STK_ID  EPS  cash
STK_ID RPT_Date                   
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

How to do that ?


Source: (StackOverflow)

use a list of values to select rows from a pandas dataframe [duplicate]

Possible Duplicate:
how to filter the dataframe rows of pandas by “within”/“in”?

Lets say I have the following pandas dataframe:

df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
df

     A   B
0    5   1
1    6   2
2    3   3
3    4   5

I can subset based on a specific value:

x = df[df['A'] == 3]
x

     A   B
2    3   3

But how can I subset based on a list of values? - something like this:

list_of_values = [3,6]

y = df[df['A'] in list_of_values]

Source: (StackOverflow)

Python pandas, widen output display?

Is there a way to widen the display of output in either interactive or script-execution mode?

Specifically, I am using the describe() function on a Pandas dataframe. When the dataframe is 5 columns (labels) wide, I get the descriptive statistics that I want. However, if the dataframe has any more columns, the statistics are suppressed and something like this is returned:

>Index: 8 entries, count to max  
>Data columns:  
>x1          8  non-null values  
>x2          8  non-null values  
>x3          8  non-null values  
>x4          8  non-null values  
>x5          8  non-null values  
>x6          8  non-null values  
>x7          8  non-null values  

The "8" value is given whether there are 6 or 7 columns. What does the "8" refer to?

I have already tried dragging the IDLE window larger, as well as increasing the "Configure IDLE" width options, to no avail.

My purpose in using Pandas and describe() is to avoid using a second program like STATA to do basic data manipulation and investigation.

Thanks.

Python/IDLE 2.7.3
Pandas 0.8.1
Notepad++ 6.1.4 (UNICODE)
Windows Vista SP2


Source: (StackOverflow)