Python pandas

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible tool for analysis of data.

Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze.

Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.Key Features of Pandas1-Fast and efficient DataFrame object with default and customized indexing.
2-Tools for loading data into in-memory data objects from different file formats.
3-Data alignment and integrated handling of missing data.
4-Reshaping and pivoting of date sets.
5-Label-based slicing, indexing and subsetting of large data sets.
5-Columns from a data structure can be deleted or inserted.
6-Group by data for aggregation and transformations.
7-High performance merging and joining of data.
8-Time Series functionality.


Standard Python distribution doesn't come bundled with Pandas module. A lightweight alternative is to install NumPy using popular Python package installer, pip.

By Command
==========

Pip install pandas

Pandas deals with the following three data structures −

Series
====
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas Series can be created from the lists, dictionary, and from a scalar value etc.

import pandas as pd
import numpy as np


# Creating empty series
ser = pd.Series()
 
print(ser)

# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
 
ser = pd.Series(data)
print(ser)

DataFrame
========
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns

Creating a DataFrame
In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc.

Example:

import pandas as pd
 
# Calling DataFrame constructor
df = pd.DataFrame()
print(df)

# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is', 
            'portal', 'for', 'Geeks']
 
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)
Panel


Advantages
·        Fast and efficient for manipulating and analyzing data.
·        Data from different file objects can be loaded.
·        Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
·        Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
·        Data set merging and joining.
·        Flexible reshaping and pivoting of data sets
·        Provides time-series functionality.
·        Powerful group by functionality for performing split-apply-combine operations on data sets.


Comments

Popular posts from this blog