Column Value Not in Index in Pandas DataFrame

Problem Statement

When creating a new column in a pandas DataFrame using regular expressions and named capturing groups, users may encounter an error when trying to access the newly created column. In this article, we will explore the issue and provide a solution.

Introduction

The str.extract() method is used to extract patterns from strings in a pandas Series or DataFrame. Named capturing groups can be used to create new columns based on the extracted values. However, when using named capturing groups, it’s essential to assign the result of the extraction to the original DataFrame before trying to access the newly created column.

The Issue

In the provided Stack Overflow question, the user is attempting to extract an address and suburb from a string using regular expressions with named capturing groups. The code snippet:

r = r"(?P<Address>.*\d+[\w+?|\s]\s?\w+\s+\w+),?\s(?P<Suburb>.*$)"
ret = df["Address Column"].str.extract(r).apply(lambda x: x.str.title())
df = pd.concat([df, ret], axis=1)

The issue arises when trying to access the “Suburb” column. The error message is KeyError: 'Suburb' not in index, indicating that the column does not exist in the DataFrame.

Solution

To fix this issue, you must assign the result of the extraction to the original DataFrame before trying to access the newly created column.

r = r"(?P<Address>.*\d+[\w+?|\s]\s?\w+\s+\w+),?\s(?P<Suburb>.*$)"
ret = df["Address Column"].str.extract(r).apply(lambda x: x.str.title())
df = pd.concat([df, ret], axis=1)

Alternatively, you can use the assign() method to create a new column:

r = r"(?P<Address>.*\d+[\w+?|\s]\s?\w+\s+\w+),?\s(?P<Suburb>.*$)"
df = df.assign(
    Address=df["Address Column"].str.extract(r).apply(lambda x: x.str.title()),
    Suburb=df["Address Column"].str.extract(r).apply(lambda x: x.str.title()).str[-1]
)

Explanation

The assign() method allows you to create new columns in a DataFrame. In this example, we first extract the address and suburb using regular expressions with named capturing groups. Then, we assign the result of the extraction to two new columns: “Address” and “Suburb”.

The str[-1] accessor is used to access the last character of each string, which corresponds to the suburb.

Example Use Case

Let’s create a sample DataFrame:

import pandas as pd

s = pd.Series(["4a Mcarthurs Road, Altona north",
               "1 Neal court, Altona North",
               "4 Vermilion Drive, Greenvale",
               "Lot 307 Bonds Lane, Greenvale",
               "430 Blackshaws rd, Altona North",
               "159 Bonds lane, Greenvale"])

df = pd.DataFrame({"Address Column": s})

Now, let’s extract the address and suburb using regular expressions with named capturing groups:

r = r"(?P<Address>.*\d+[\w+?|\s]\s?\w+\s+\w+),?\s(?P<Suburb>.*$)"

df["Address"] = df["Address Column"].str.extract(r).apply(lambda x: x.str.title())
df["Suburb"] = df["Address Column"].str.extract(r).apply(lambda x: x.str.title()).str[-1]

The resulting DataFrame should look like this:

                 Address Column             Address        Suburb
4a Mcarthurs Road, Altona north   4A Mcarthurs Road  Altona North
     1 Neal court, Altona North        1 Neal Court  Altona North
   4 Vermilion Drive, Greenvale   4 Vermilion Drive     Greenvale
  Lot 307 Bonds Lane, Greenvale  Lot 307 Bonds Lane     Greenvale
430 Blackshaws rd, Altona North   430 Blackshaws Rd  Altona North
      159 Bonds lane, Greenvale      159 Bonds Lane     Greenvale

This solution demonstrates how to create new columns in a pandas DataFrame using regular expressions with named capturing groups.

Last modified on 2023-12-08