
col()
The col()
function is used to reference a column by name in DataFrame operations. It's a key component in DataFrame transformations, enabling flexible and readable column manipulation.
Additional methods for referencing columns, besides using the col
function, will be presented towards the end of this tutorial.
Usage
col()
is used to refer to a DataFrame column by its name.- It is often used in DataFrame transformations such as
select()
,filter()
andsort()
when performing operations on columns.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import col
# Initialize Spark Sessionspark = SparkSession.builder.appName("colExample").getOrCreate()
# Sample DataFramedata = [("James", 34, 'US'), ("Anna", 28, 'UK'), ("Robert", 45, 'CA')]columns = ["Name", "Age", "Country"]df = spark.createDataFrame(data, columns)df.show()
Output:
+------+---+-------+
| Name|Age|Country|
+------+---+-------+
| James| 34| US|
| Anna| 28| UK|
|Robert| 45| CA|
+------+---+-------+
Example: Use col()
in DataFrame Operations
Select
a column
selected_df = df.select(col("Age"))selected_df.show()
Output:
+---+
|Age|
+---+
| 34|
| 28|
| 45|
+---+
Select
multiple columns
df.select(col("Age"), col('Name')).show()
Output:
+---+------+
|Age| Name|
+---+------+
| 34| James|
| 28| Anna|
| 45|Robert|
+---+------+
Example: Other ways of referencing columns other than using col
Dot .
notation
df.Name
anddf.Country
are used to refer the two columns
df.select(df.Name, df.Country).show()
Output:
+------+-------+
| Name|Country|
+------+-------+
| James| US|
| Anna| UK|
|Robert| CA|
+------+-------+
Square bracket notation []
df['Name']
anddf['Country']
are used to refer the two columns
df.select(df['Name'], df['Country']).show()
Output:
+------+-------+
| Name|Country|
+------+-------+
| James| US|
| Anna| UK|
|Robert| CA|
+------+-------+
Column name
You can directly specify column names for column selection.
Name
andCountry
are used to refer the two columns
df.select("Name", "Country").show()
Output:
+------+-------+
| Name|Country|
+------+-------+
| James| US|
| Anna| UK|
|Robert| CA|
+------+-------+
You can use any of these methods to reference columns in most operations interchangeably. However, when it comes to conditional functions, using just the column name won't work. For example, if you want to filter data where "Age" is greater than 30 using the filter function, you can't use the column name alone in the filter condition. Here's an example to illustrate:
df.filter("Age">30).show()
Output:
---------------------------------------------------------------------------TypeError Traceback (most recent call last)Input In [37], in <cell line: 1>()----> 1 df.filter("Age">30).
TypeError: '>' not supported between instances of 'str' and 'int'
The error message means that you can't perform operations directly on strings because it doesn't recognize "Age" as referring to a column; it treats it as a literal string. To work with columns and perform conditional operations, you need to reference them using col()
, dot notation .
, or bracket notation []
. For instance:
- with
col()
df.filter(col("Age")>30).show()
Output:
+------+---+-------+
| Name|Age|Country|
+------+---+-------+
| James| 34| US|
|Robert| 45| CA|
+------+---+-------+
- with dot notation
.
df.filter(df.Age>30).show()
Output:
+------+---+-------+
| Name|Age|Country|
+------+---+-------+
| James| 34| US|
|Robert| 45| CA|
+------+---+-------+
- with square bracket
[]
df.filter(df["Age"]>30).show()
Output:
+------+---+-------+
| Name|Age|Country|
+------+---+-------+
| James| 34| US|
|Robert| 45| CA|
+------+---+-------+
# Stop the Spark Sessionspark.stop()