`col()`

The col() function is used to reference a column by name in DataFrame operations. It's a key component in DataFrame transformations, enabling flexible and readable column manipulation.

Additional methods for referencing columns, besides using the col function, will be presented towards the end of this tutorial.

Usage

col() is used to refer to a DataFrame column by its name.
It is often used in DataFrame transformations such as select(), filter() and sort() when performing operations on columns.

Create Spark Session and sample DataFrame

from pyspark.sql import SparkSessionfrom pyspark.sql.functions import col
# Initialize Spark Sessionspark = SparkSession.builder.appName("colExample").getOrCreate()
# Sample DataFramedata = [("James", 34, 'US'), ("Anna", 28, 'UK'), ("Robert", 45, 'CA')]columns = ["Name", "Age", "Country"]df = spark.createDataFrame(data, columns)df.show()

Output:
+------+---+-------+
|  Name|Age|Country|
+------+---+-------+
| James| 34|     US|
|  Anna| 28|     UK|
|Robert| 45|     CA|
+------+---+-------+

Example: Use `col()` in DataFrame Operations

`Select` a column

selected_df = df.select(col("Age"))selected_df.show()

Output:
+---+
|Age|
+---+
| 34|
| 28|
| 45|
+---+

`Select` multiple columns

df.select(col("Age"), col('Name')).show()

Output:
+---+------+
|Age|  Name|
+---+------+
| 34| James|
| 28|  Anna|
| 45|Robert|
+---+------+

Example: Other ways of referencing columns other than using `col`

Dot `.` notation

df.Name and df.Country are used to refer the two columns

df.select(df.Name, df.Country).show()

Output:
+------+-------+
|  Name|Country|
+------+-------+
| James|     US|
|  Anna|     UK|
|Robert|     CA|
+------+-------+

Square bracket notation `[]`

df['Name'] and df['Country'] are used to refer the two columns

df.select(df['Name'], df['Country']).show()

Output:
+------+-------+
|  Name|Country|
+------+-------+
| James|     US|
|  Anna|     UK|
|Robert|     CA|
+------+-------+

Column name

You can directly specify column names for column selection.

Name and Country are used to refer the two columns

df.select("Name", "Country").show()

Output:
+------+-------+
|  Name|Country|
+------+-------+
| James|     US|
|  Anna|     UK|
|Robert|     CA|
+------+-------+

You can use any of these methods to reference columns in most operations interchangeably. However, when it comes to conditional functions, using just the column name won't work. For example, if you want to filter data where "Age" is greater than 30 using the filter function, you can't use the column name alone in the filter condition. Here's an example to illustrate:

df.filter("Age">30).show()

Output:
---------------------------------------------------------------------------TypeError                       Traceback (most recent call last)Input In [37], in <cell line: 1>()----> 1 df.filter("Age">30).
TypeError: '>' not supported between instances of 'str' and 'int'

The error message means that you can't perform operations directly on strings because it doesn't recognize "Age" as referring to a column; it treats it as a literal string. To work with columns and perform conditional operations, you need to reference them using col(), dot notation ., or bracket notation []. For instance:

with col()

df.filter(col("Age")>30).show()

Output:
+------+---+-------+
|  Name|Age|Country|
+------+---+-------+
| James| 34|     US|
|Robert| 45|     CA|
+------+---+-------+

with dot notation .

df.filter(df.Age>30).show()

Output:
+------+---+-------+
|  Name|Age|Country|
+------+---+-------+
| James| 34|     US|
|Robert| 45|     CA|
+------+---+-------+

with square bracket []

df.filter(df["Age"]>30).show()

Output:
+------+---+-------+
|  Name|Age|Country|
+------+---+-------+
| James| 34|     US|
|Robert| 45|     CA|
+------+---+-------+

# Stop the Spark Sessionspark.stop()

< Previous

Next >

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery

col()

Usage

Create Spark Session and sample DataFrame

Example: Use col() in DataFrame Operations

Select a column

Select multiple columns

Example: Other ways of referencing columns other than using col

Dot . notation

Square bracket notation []