
limit()
The limit()
function is used to constrain the number of rows returned by a DataFrame.
Create Spark Session and sample DataFrame
from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("selectExample").getOrCreate()# Sample data with duplicatesdata = [ (1, 20, 15), (2, 22, 16), (3, 19, 14), (4, 23, 17), (5, 18, 15)]# Column namescolumns = ["ID", "Weight", "Length"]
df = spark.createDataFrame(data, columns)df.show()
Output:
+---+------+------+
| ID|Weight|Length|
+---+------+------+
| 1| 20| 15|
| 2| 22| 16|
| 3| 19| 14|
| 4| 23| 17|
| 5| 18| 15|
+---+------+------+
Example: Show first N number of rows
df.limit(4)
: This limits the DataFrame df to show the first 4 rows. It's important to remember that without an explicit orderBy
, the rows returned by limit()
may not be consistent across different runs or environments.
# Limiting the number of rowsdf.limit(4).show()
Output:
+---+------+------+
| ID|Weight|Length|
+---+------+------+
| 1| 20| 15|
| 2| 22| 16|
| 3| 19| 14|
| 4| 23| 17|
+---+------+------+
End the Spark Session
spark.stop()