Column Value Not in Index in Pandas DataFrame
Problem Statement
When creating a new column in a pandas DataFrame using regular expressions and named capturing groups, users may encounter an error when trying to access the newly created column. In this article, we will explore the issue and provide a solution.
Introduction
The str.extract() method is used to extract patterns from strings in a pandas Series or DataFrame. Named capturing groups can be used to create new columns based on the extracted values. However, when using named capturing groups, it’s essential to assign the result of the extraction to the original DataFrame before trying to access the newly created column.
The Issue
In the provided Stack Overflow question, the user is attempting to extract an address and suburb from a string using regular expressions with named capturing groups. The code snippet:
r = r"(?P<Address>.*\d+[\w+?|\s]\s?\w+\s+\w+),?\s(?P<Suburb>.*$)"
ret = df["Address Column"].str.extract(r).apply(lambda x: x.str.title())
df = pd.concat([df, ret], axis=1)
The issue arises when trying to access the “Suburb” column. The error message is KeyError: 'Suburb' not in index, indicating that the column does not exist in the DataFrame.
Solution
To fix this issue, you must assign the result of the extraction to the original DataFrame before trying to access the newly created column.
r = r"(?P<Address>.*\d+[\w+?|\s]\s?\w+\s+\w+),?\s(?P<Suburb>.*$)"
ret = df["Address Column"].str.extract(r).apply(lambda x: x.str.title())
df = pd.concat([df, ret], axis=1)
Alternatively, you can use the assign() method to create a new column:
r = r"(?P<Address>.*\d+[\w+?|\s]\s?\w+\s+\w+),?\s(?P<Suburb>.*$)"
df = df.assign(
Address=df["Address Column"].str.extract(r).apply(lambda x: x.str.title()),
Suburb=df["Address Column"].str.extract(r).apply(lambda x: x.str.title()).str[-1]
)
Explanation
The assign() method allows you to create new columns in a DataFrame. In this example, we first extract the address and suburb using regular expressions with named capturing groups. Then, we assign the result of the extraction to two new columns: “Address” and “Suburb”.
The str[-1] accessor is used to access the last character of each string, which corresponds to the suburb.
Example Use Case
Let’s create a sample DataFrame:
import pandas as pd
s = pd.Series(["4a Mcarthurs Road, Altona north",
"1 Neal court, Altona North",
"4 Vermilion Drive, Greenvale",
"Lot 307 Bonds Lane, Greenvale",
"430 Blackshaws rd, Altona North",
"159 Bonds lane, Greenvale"])
df = pd.DataFrame({"Address Column": s})
Now, let’s extract the address and suburb using regular expressions with named capturing groups:
r = r"(?P<Address>.*\d+[\w+?|\s]\s?\w+\s+\w+),?\s(?P<Suburb>.*$)"
df["Address"] = df["Address Column"].str.extract(r).apply(lambda x: x.str.title())
df["Suburb"] = df["Address Column"].str.extract(r).apply(lambda x: x.str.title()).str[-1]
The resulting DataFrame should look like this:
Address Column Address Suburb
4a Mcarthurs Road, Altona north 4A Mcarthurs Road Altona North
1 Neal court, Altona North 1 Neal Court Altona North
4 Vermilion Drive, Greenvale 4 Vermilion Drive Greenvale
Lot 307 Bonds Lane, Greenvale Lot 307 Bonds Lane Greenvale
430 Blackshaws rd, Altona North 430 Blackshaws Rd Altona North
159 Bonds lane, Greenvale 159 Bonds Lane Greenvale
This solution demonstrates how to create new columns in a pandas DataFrame using regular expressions with named capturing groups.
Last modified on 2023-12-08