`limit()`

The limit() function is used to constrain the number of rows returned by a DataFrame.

Create Spark Session and sample DataFrame

from pyspark.sql import SparkSession
# Initialize Spark Sessionspark = SparkSession.builder.appName("selectExample").getOrCreate()# Sample data with duplicatesdata = [    (1, 20, 15),    (2, 22, 16),    (3, 19, 14),    (4, 23, 17),    (5, 18, 15)]# Column namescolumns = ["ID", "Weight", "Length"]
df = spark.createDataFrame(data, columns)df.show()

Output:
+---+------+------+
| ID|Weight|Length|
+---+------+------+
|  1|    20|    15|
|  2|    22|    16|
|  3|    19|    14|
|  4|    23|    17|
|  5|    18|    15|
+---+------+------+

Example: Show first N number of rows

df.limit(4): This limits the DataFrame df to show the first 4 rows. It's important to remember that without an explicit orderBy, the rows returned by limit() may not be consistent across different runs or environments.

# Limiting the number of rowsdf.limit(4).show()

Output:
+---+------+------+
| ID|Weight|Length|
+---+------+------+
|  1|    20|    15|
|  2|    22|    16|
|  3|    19|    14|
|  4|    23|    17|
+---+------+------+

End the Spark Session

spark.stop()

< Previous

Next >

Amazing eBook to learn ggplot2 FAST & EASY

book cover for sliding your way to ggplot2 mastery