Python join dataframes

Python join dataframes DEFAULT

Python Pandas - Merging/Joining



Pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL.

Pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects −

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True)

Here, we have used the following parameters −

  • left − A DataFrame object.

  • right − Another DataFrame object.

  • on − Columns (names) to join on. Must be found in both the left and right DataFrame objects.

  • left_on − Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.

  • right_on − Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.

  • left_index − If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.

  • right_index − Same usage as left_index for the right DataFrame.

  • how − One of 'left', 'right', 'outer', 'inner'. Defaults to inner. Each method has been described below.

  • sort − Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases.

Let us now create two different DataFrames and perform the merging operations on it.

Live Demo

# import the pandas library import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame( {'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) print left print right

Its output is as follows −

Name id subject_id 0 Alex 1 sub1 1 Amy 2 sub2 2 Allen 3 sub4 3 Alice 4 sub6 4 Ayoung 5 sub5 Name id subject_id 0 Billy 1 sub2 1 Brian 2 sub4 2 Bran 3 sub3 3 Bryce 4 sub6 4 Betty 5 sub5

Merge Two DataFrames on a Key

Live Demo

import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) print pd.merge(left,right,on='id')

Its output is as follows −

Name_x id subject_id_x Name_y subject_id_y 0 Alex 1 sub1 Billy sub2 1 Amy 2 sub2 Brian sub4 2 Allen 3 sub4 Bran sub3 3 Alice 4 sub6 Bryce sub6 4 Ayoung 5 sub5 Betty sub5

Merge Two DataFrames on Multiple Keys

Live Demo

import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) print pd.merge(left,right,on=['id','subject_id'])

Its output is as follows −

Name_x id subject_id Name_y 0 Alice 4 sub6 Bryce 1 Ayoung 5 sub5 Betty

Merge Using 'how' Argument

The how argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or the right tables, the values in the joined table will be NA.

Here is a summary of the how options and their SQL equivalent names −

Merge MethodSQL EquivalentDescription
leftLEFT OUTER JOINUse keys from left object
rightRIGHT OUTER JOINUse keys from right object
outerFULL OUTER JOINUse union of keys
innerINNER JOINUse intersection of keys

Left Join

Live Demo

import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) print pd.merge(left, right, on='subject_id', how='left')

Its output is as follows −

Name_x id_x subject_id Name_y id_y 0 Alex 1 sub1 NaN NaN 1 Amy 2 sub2 Billy 1.0 2 Allen 3 sub4 Brian 2.0 3 Alice 4 sub6 Bryce 4.0 4 Ayoung 5 sub5 Betty 5.0

Right Join

Live Demo

import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) print pd.merge(left, right, on='subject_id', how='right')

Its output is as follows −

Name_x id_x subject_id Name_y id_y 0 Amy 2.0 sub2 Billy 1 1 Allen 3.0 sub4 Brian 2 2 Alice 4.0 sub6 Bryce 4 3 Ayoung 5.0 sub5 Betty 5 4 NaN NaN sub3 Bran 3

Outer Join

Live Demo

import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) print pd.merge(left, right, how='outer', on='subject_id')

Its output is as follows −

Name_x id_x subject_id Name_y id_y 0 Alex 1.0 sub1 NaN NaN 1 Amy 2.0 sub2 Billy 1.0 2 Allen 3.0 sub4 Brian 2.0 3 Alice 4.0 sub6 Bryce 4.0 4 Ayoung 5.0 sub5 Betty 5.0 5 NaN NaN sub3 Bran 3.0

Inner Join

Joining will be performed on index. Join operation honors the object on which it is called. So, a.join(b) is not equal to b.join(a).

Live Demo

import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) print pd.merge(left, right, on='subject_id', how='inner')

Its output is as follows −

Name_x id_x subject_id Name_y id_y 0 Amy 2 sub2 Billy 1 1 Allen 3 sub4 Brian 2 2 Alice 4 sub6 Bryce 4 3 Ayoung 5 sub5 Betty 5
Sours: https://www.tutorialspoint.com/python_pandas/python_pandas_merging_joining.htm

Pandas join: How to Join Pandas DataFrame in Python

The Pandas join() function acts as an essential attribute when one DataFrame is a lookup table. For example, it contains most of the data, and additional data of that DataFrame is present in some other DataFrame.

Pandas join

The join() is a Pandas library function that is used to join or concatenate different DataFrames. It joins columns with other DataFrame either on an index or a key column.

To join different dataframes in Pandas based on the index or a column key, use the join() method.

In simpler words, The join() function can be defined as joining standard fields of different DataFrames. The columns which contain common values and are used for joining are called join keys.

To identify a joining key, we need to find the required data fields shared between the two data frames and the columns in that data frame, which are the same.

Efficiently join multiple DataFrame objects by index at once by passing a list. We can either join the DataFrames vertically or side by side. By vertically, we mean joining the DataFrames column-wise, and side by side relates to indexing.

Hence it acts as a very convenient way to combine the columns of two differently indexed DataFrames into a single DataFrame based on common attributes. We can also join data by passing a list to it.

Syntax

DataFrame.join(other, on=None, how=’left’, lsuffix=”, rsuffix=”, sort=False)

Parameters

Pandas join() function contains six parameters.

other:

It is the DataFrame or list or the series we are passing. The index should be the same as one of the columns. If a series is passed, its name must be set, used in the column name in the resulting DataFrame. 

on:

It is the optional parameter that refers to array-like or str values. It refers to the column or the index level name in the caller DataFrame to join the index. Otherwise, it joins the index on an index.

One important condition is that if multiple values are present, the other DataFrame should also be multi-indexed.

how:

It refers to how to handle the operation on both objects. The default set value for this parameter is “left”. Different types of values in this parameter are “left”, “right”, “outer”, “inner”.

  1. left: It uses the calling index or column of the DataFrame, whatever is specified. 
  2. right: It uses the other indexes for the use.
  3. outer: It forms a union of calling frame’s index or column(as specified) with the other DataFrame index and sort it lexicographically.
  4. inner: It forms the intersection of the calling frame’s index or column(as specified) with the other data frame index or column, preserving the order of the calling frame.

lsuffix:

It refers to the string object that has a default value. It uses the suffix from the left frame’s overlapping columns.

rsuffix:

It refers to the string object that has a default value. It uses the suffix from the right frame’s overlapping columns.

sort:

It consists of a boolean value and sorts the resulting DataFrame lexicographically.

Example program on Pandas DataFrame join()

Write a program to show the working of join().

# app.py import pandas as pd data_set1 = pd.DataFrame({'Roll_no': ['A1', 'A2', 'A3', 'A4', 'A5', 'A6'], 'Marks_in_Maths': ['98', '85', '43', '66', '47', '74']}) print(data_set1) data_set2 = pd.DataFrame({'Roll_no': ['A1', 'A2', 'A3'], 'Marks_in_Science': ['66', '74', '83']}) print(data_set2) joined_data = data_set1.join(data_set2.set_index('Roll_no'), on="Roll_no") print(joined_data)

Output

python3 app.py Roll_no Marks_in_Maths 0 A1 98 1 A2 85 2 A3 43 3 A4 66 4 A5 47 5 A6 74 Roll_no Marks_in_Science 0 A1 66 1 A2 74 2 A3 83 Roll_no Marks_in_Maths Marks_in_Science 0 A1 98 66 1 A2 85 74 2 A3 43 83 3 A4 66 NaN 4 A5 47 NaN 5 A6 74 NaN

Here we can see that we have created two DataFrames with the first taking 6 roll numbers and marks in maths for all the 6 students.

The second DataFrame consists of marks of the science of the students from roll numbers 1 to 3. Hence the resultant DataFrame consists of joined values of both the DataFrames with the values not mentioned set to NaN ( marks of science from roll no 4 to 6).

That’s it for the Pandas join() method.

See also

Pandas empty DataFrame

Pandas iterrows()

Pandas Json to Csv

Pandas DataFrame to List

Select rows from Pandas DataFrame

Sours: https://appdividend.com/2020/05/07/pandas-dataframe-join-example-in-python/
  1. Solidworks cam tutorial pdf
  2. Belmont plastic surgery reviews
  3. 2022 lexus nx 300 redesign

How to combine two dataframe in Python – Pandas?

Prerequisites: Pandas

In many real-life situations, the data that we want to use comes in multiple files. We often have a need to combine these files into a single DataFrame to analyze the data. Pandas provide such facilities for easily combining Series or DataFrame with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations. In addition, pandas also provide utilities to compare two Series or DataFrame and summarize their differences.

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course

Concatenating DataFrames 

The concat() function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

Python3

Output:

Joining DataFrames 

When we concatenated our DataFrames we simply added them to each other i.e. stacked them either vertically or side by side. Another way to combine DataFrames is to use columns in each dataset that contain common values (a common unique id). Combining DataFrames using a common field is called “joining”. The columns containing the common values are called “join key(s)”. Joining DataFrames in this way is often useful when one DataFrame is a “lookup table” containing additional data that we want to include in the other.

Note: This process of joining tables is similar to what we do with tables in an SQL database.

When gluing together multiple DataFrames, you have a choice of how to handle the other axes (other than the one being concatenated). This can be done in the following two ways :

  • Take the union of them all, join=’outer’. This is the default option as it results in zero information loss.
  • Take the intersection, join=’inner’.

Example:

Python3

Output:

Concatenating using append

A useful shortcut to concat() is append() instance method on Series and DataFrame. These methods actually predated concat. 

Example:

Python3

Output:

Note: append() may take multiple objects to concatenate.

Example:

Python3

Output:




My Personal Notesarrow_drop_up
Sours: https://www.geeksforgeeks.org/how-to-combine-two-dataframe-in-python-pandas/

In any real world data science situation with Python, you’ll be about 10 minutes in when you’ll need to merge or join Pandas Dataframes together to form your analysis dataset. Merging and joining dataframes is a core process that any aspiring data analyst will need to master. This blog post addresses the process of merging datasets, that is, joining two datasets together based on common columns between them. Key topics covered here:

If you’d like to work through the tutorial yourself, I’m using a Jupyter notebook setup with Python 3.5.2 from Anaconda, and I’ve posted the code on GitHub here. I’ve included the sample datasets in the GitHub repository.

You can merge data sets with different join variable names in each.

Example data

For this post, I have taken some real data from the KillBiller application and some downloaded data, contained in three CSV files:

  • user_usage.csv – A first dataset containing users monthly mobile usage statistics
  • user_device.csv – A second dataset containing details of an individual “use” of the system, with dates and device information.
  • android_devices.csv – A third dataset with device and manufacturer data, which lists all Android devices and their model code, obtained from Google here.

We can load these CSV files as Pandas DataFrames into pandas using the Pandas read_csv command, and examine the contents using the DataFrame head() command.

There are linking attributes between the sample datasets that are important to note – “use_id” is shared between the user_usage and user_device, and the “device” column of user_device and “Model” column of the devices dataset contain common codes.

Sample problem

We would like to determine if the usage patterns for users differ between different devices. For example, do users using Samsung devices use more call minutes than those using  LG devices? This is a toy problem given the small sample size in these dataset, but is a perfect example of where merges are required.

We want to form a single dataframe with columns for user usage figures (calls per month, sms per month etc) and also columns with device information (model, manufacturer, etc). We will need to “merge” (or “join”) our sample datasets together into one single dataset for analysis.

Merging DataFrames

“Merging” two datasets is the process of bringing two datasets together into one, and aligning the rows from each based on common attributes or columns.

The words “merge” and “join” are used relatively interchangeably in Pandas and other languages, namely SQL and R. In Pandas, there are separate “merge” and “join” functions, both of which do similar things.

In this example scenario, we will need to perform two steps:

  1. For each row in the user_usage dataset – make a new column that contains the “device” code from the user_devices dataframe. i.e. for the first row, the use_id is 22787, so we go to the user_devices dataset, find the use_id 22787, and copy the value from the “device” column across.
  2. After this is complete, we take the new device columns, and we find the corresponding “Retail Branding” and “Model” from the devices dataset.
  3. Finally, we can look at different statistics for usage splitting and grouping data by the device manufacturers used.

Can I use a for loop?

Yes. You could write for loops for this task. The first would loop through the use_id in the user_usage dataset, and then find the right element in user_devices. The second for loop will repeat this process for the devices.


However, using for loops will be much slower and more verbose than using Pandas merge functionality. So,  if you come across this situation – don’t use for loops.

Merging user_usage with user_devices

Lets see how we can correctly add the “device” and “platform” columns to the user_usage dataframe using the Pandas Merge command.

result = pd.merge(user_usage, user_device[['use_id', 'platform', 'device']], on='use_id') result.head()

So that works, and very easily! Now – how did that work? What was the pd.merge command doing?

Pandas merging explained with a breakdown of the command parameters.

The merge command is the key learning objective of this post. The merging operation at its simplest takes a left dataframe (the first argument), a right dataframe (the second argument), and then a merge column name, or a column to merge “on”. In the output/result, rows from the left and right dataframes are matched up where there are common values of the merge column specified by “on”.

With this result, we can now move on to get the manufacturer and model number from the “devices” dataset. However, first we need to understand a little more about merge types and the sizes of the output dataframe.

Inner, Left, and right merge types

In our example above, we merged user_usage with user_devices. The head() preview of the result looks great, but there’s more to this than meets the eye. First, let’s look at the sizes or shapes of our inputs and outputs to the merge command:

Why is the result a different size to both the original dataframes?

By default, the Pandas merge operation acts with an “inner” merge. An inner merge, (or inner join) keeps only the common values in both the left and right dataframes for the result. In our example above, only the rows that contain use_id values that are common between user_usage and user_device remain in the result dataset. We can validate this by looking at how many values are common:

Merging by default in Python Pandas results in an inner merge

There are 159 values of use_id in the user_usage table that appear in user_device. These are the same values that also appear in the final result dataframe (159 rows).

Other Merge Types

There are three different types of merges available in Pandas. These merge types are common across most database and data-orientated languages (SQL, R, SAS) and are typically referred to as “joins”. If you don’t know them, learn them now.

  1. Inner Merge / Inner join – The default Pandas behaviour, only keep rows where the merge “on” value exists in both the left and right dataframes.
  2. Left Merge / Left outer join – (aka left merge or left join) Keep every row in the left dataframe. Where there are missing values of the “on” variable in the right dataframe, add empty / NaN values in the result.
  3. Right Merge / Right outer join – (aka right merge or right join) Keep every row in the right dataframe. Where there are missing values of the “on” variable in the left column, add empty / NaN values in the result.
  4. Outer Merge / Full outer join – A full outer join returns all the rows from the left dataframe, all the rows from the right dataframe, and matches up rows where possible, with NaNs elsewhere.

The merge type to use is specified using the “how” parameter in the merge command, taking values “left”, “right”, “inner” (default), or “outer”.

Venn diagrams are commonly used to exemplify the different merge and join types. See this example from Stack overflow:

Merges and joins are used to bring datasets together based on common values.

If this is new to you, or you are looking at the above with a frown, take the time to watch this video on “merging dataframes” from Coursera for another explanation that might help. We’ll now look at each merge type in more detail, and work through examples of each.

Example of left merge / left join

Let’s repeat our merge operation, but this time perform a “left merge” in Pandas.

  • Originally, the result dataframe had 159 rows, because there were 159 values of “use_id” common between our left and right dataframes and an “inner” merge was used by default.
  • For our left merge, we expect the result to have the same number of rows as our left dataframe “user_usage” (240), with missing values for all but 159 of the merged “platform” and “device” columns (81 rows).
  • We expect the result to have the same number of rows as the left dataframe because each use_id in user_usage appears only once in user_device. A one-to-one mapping is not always the case. In merge operations where a single row in the left dataframe is matched by multiple rows in the right dataframe, multiple result rows will be generated. i.e. if a use_id value in user_usage appears twice in the user_device dataframe, there will be two rows for that use_id in the join result.

You can change the merge to a left-merge with the “how” parameter to your merge command. The top of the result dataframe contains the successfully matched items, and at the bottom contains the rows in user_usage that didn’t have a corresponding use_id in user_device.

result = pd.merge(user_usage, user_device[['use_id', 'platform', 'device']], on='use_id', how='left')
left joining is a common merge type in python and r
left joining or left merging is used to find corresponding values in the right dataframe, while keeping all rows from the left.

Example of right merge / right join

For examples sake, we can repeat this process with a right join / right merge, simply by replacing how=’left’ with how=’right’ in the Pandas merge command.

result = pd.merge(user_usage, user_device[['use_id', 'platform', 'device']], on='use_id', how='right')

The result expected will have the same number of rows as the right dataframe, user_device, but have several empty, or NaN values in the columns originating in the left dataframe, user_usage (namely “outgoing_mins_per_month”, “outgoing_sms_per_month”, and “monthly_mb”). Conversely, we expect no missing values in the columns originating in the right dataframe, “user_device”.

right merge in pandas keeps all rows from the second, "right" dataframe.

Example of outer merge / full outer join

Finally, we will perform an outer merge using Pandas, also referred to as a “full outer join” or just “outer join”. An outer join can be seen as a combination of left and right joins, or the opposite of an inner join. In outer joins, every row from the left and right dataframes is retained in the result, with NaNs where there are no matched join variables.

As such, we would expect the results to have the same number of rows as there are distinct values of “use_id” between user_device and user_usage, i.e. every join value from the left dataframe will be in the result along with every value from the right dataframe, and they’ll be linked where possible.

pandas outer merge result retains all rows.

In the diagram below, example rows from the outer merge result are shown, the first two are examples where the “use_id” was common between the dataframes, the second two originated only from the left dataframe, and the final two originated only from the right dataframe.

Using merge indicator to track merges

To assist with the identification of where rows originate from, Pandas provides an “indicator” parameter that can be used with the merge function which creates an additional column called “_merge” in the output that labels the original source for each row.

result = pd.merge(user_usage, user_device[['use_id', 'platform', 'device']], on='use_id', how='outer', indicator=True)
outer join or merges in pandas result in one row for each unique value of the join variable.

Final Merge – Joining device details to result

Coming back to our original problem, we have already merged user_usage with user_device, so we have the platform and device for each user. Originally, we used an “inner merge” as the default in Pandas, and as such, we only have entries for users where there is also device information. We’ll redo this merge using a left join to keep all users, and then use a second left merge to finally to get the device manufacturers in the same dataframe.

# First, add the platform and device to the user usage - use a left join this time. result = pd.merge(user_usage, user_device[['use_id', 'platform', 'device']], on='use_id', how='left') # At this point, the platform and device columns are included # in the result along with all columns from user_usage # Now, based on the "device" column in result, match the "Model" column in devices. devices.rename(columns={"Retail Branding": "manufacturer"}, inplace=True) result = pd.merge(result, devices[['manufacturer', 'Model']], left_on='device', right_on='Model', how='left') print(result.head())

Using left_on and right_on to merge with different column names

The columns used in a merge operator do not need to be named the same in both the left and right dataframe. In the second merge above, note that the device ID is called “device” in the left dataframe, and called “Model” in the right dataframe.

Different column names are specified for merges in Pandas using the “left_on” and “right_on” parameters, instead of using only the “on” parameter.

You can merge data sets with different join variable names in each.

Calculating statistics based on device

With our merges complete, we can use the data aggregation functionality of Pandas to quickly work out the mean usage for users based on device manufacturer. Note that the small sample size creates even smaller groups, so I wouldn’t attribute any statistical significance to these particular results!

result.groupby("manufacturer").agg({ "outgoing_mins_per_month": "mean", "outgoing_sms_per_month": "mean", "monthly_mb": "mean", "use_id": "count" })
Groupby statistics can be calculated using the groupby and agg Pandas functions.

Becoming a master of merging – Part 2

That completes the first part of this merging tutorial. You should now have conquered the basics of merging, and be able to tackle your own merging and joining problems with the information above. Part 2 of this blog post addresses the following more advanced topics:

  • How do you merge dataframes using multiple join /common columns?
  • How do you merge dataframes based on the index of the dataframe?
  • What is the difference between the merge and join fucntions in Pandas?
  • How fast are merges in Python Pandas?

Other useful resources

Don’t let your merging mastery stop here. Try the following links for further explanations and information on the topic:

Sours: https://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/

Join dataframes python

Joining Pandas Dataframes

Overview

Teaching: 25 min
Exercises: 10 min

Questions
  • How can I join two Dataframes with a common key?

Objectives
  • Understand why we would want to join Dataframes

  • Know what is needed for a join to be possible

  • Understand the different types of joins

  • Understand what the joined results tell us about our data

Joining Dataframes

Why do we want to do this

There are many occasions when we have related data spread across multiple files.

The data can be related to each other in different ways. How they are related and how completely we can join the data from the datasets will vary.

In this episode we will consider different scenarios and show we might join the data. We will use csv files and in all cases the first step will be to read the datasets into a pandas Dataframe from where we will do the joining. The csv files we are using are cut down versions of the SN7577 dataset to make the displays more manageable.

First, let’s download the datafiles. They are listed in the setup page for the lesson. Alternatively, you can download the GitHub repository for this lesson. The data files are in the data directory. If you’re using Jupyter, make sure to place these files in the same directory where your notebook file is.

Scenario 1 - Two data sets containing the same columns but different rows of data

Here we want to add the rows from one Dataframe to the rows of the other Dataframe. In order to do this we can use the function.

Have a quick look at what these Dataframes look like with

The function appends the rows from the two Dataframes to create the df_all_rows Dataframe. When you list this out you can see that all of the data rows are there, however, there is a problem with the .

We didn’t explicitly set an index for any of the Dataframes we have used. For and default indexes would have been created by pandas. When we concatenated the Dataframes the indexes were also concatenated resulting in duplicate entries.

This is really only a problem if you need to access a row by its index. We can fix the problem with the following code.

What if the columns in the Dataframes are not the same?

In this case has no Q4 column and has no column. When they are concatenated, the resulting Dataframe has a column for and . For the rows corresponding to the values in the column are missing and denoted by . The same applies to for the rows.

Scenario 2 - Adding the columns from one Dataframe to those of another Dataframe

We use the parameter to indicate that it is the columns that need to be joined together. Notice that the column appears twice, because it was a column in each dataset. This is not particularly desirable, but also not necessarily a problem. However, there are better ways of combining columns from two Dataframes which avoid this problem.

Scenario 3 - Using merge to join columns

We can join columns from two Dataframes using the function. This is similar to the SQL ‘join’ functionality.

A detailed discussion of different join types is given in the SQL lesson.

You specify the type of join you want using the parameter. The default is the join which returns the columns from both tables where the or common column values match in both Dataframes.

The possible values of the parameter are shown in the picture below (taken from the Pandas documentation)

pandas_join_types

The different join types behave in the same way as they do in SQL. In Python/pandas, any missing values are shown as

In order to the Dataframes we need to identify a column common to both of them.

In fact, if there is only one column with the same name in each Dataframe, it will be assumed to be the one you want to join on. In this example the column

Leaving the join column to default in this way is not best practice. It is better to explicitly name the column using the parameter.

In many circumstances, the column names that you wish to join on are not the same in both Dataframes, in which case you can use the and parameters to specify them separately.

Practice with data

  1. Examine the contents of the and csv files using Excel or equivalent.
  2. Using the and csv files, create a Dataframe which is the result of an outer join using the column to join on.
  3. What do you notice about the column names in the new Dataframe?
  4. Using + in Jupyter examine the possible parameters for the function.
  5. re-write the code so that the columns names which are common to both files have suffixes indicating the filename from which they come
  6. If you add the parameter , what additional information is provided in the resulting Dataframe?

Solution

Key Points

  • You can join pandas Dataframes in much the same way as you join tables in SQL

  • The function can be used to concatenate two Dataframes by adding the rows of one to the other.

  • can also combine Dataframes by columns but the function is the preferred way

  • The function is equivalent to the SQL JOIN clause. ‘left’, ‘right’ and ‘inner’ joins are all possible.

Sours: https://datacarpentry.org/python-socialsci/11-joins/index.html
How to combine DataFrames in Pandas - Merge, Join, Concat, \u0026 Append

Take off your clothes and go to bed. Or rather undress, but do not rush to go to bed. Or maybe you, a lonely unmarried woman, also masturbate before bed. It is advisable not to turn off the light and not under the covers. so that I can see.

You will also be interested:

And holding my dick in my hand. somehow fell asleep, thinking that tomorrow Im fucked. Waking up the next day with a guilty face, I came to the kitchen, but my sister not only did not give me upbut she also behaved as if.



628 629 630 631 632