Paginating Requests in APIs (2020)

You should never design an API that uses offsets for pagination unless you're dealing with small amounts of data (<10000 rows). Cursors give you far more flexibility and if you want to be lazy, you can just hide an offset in an opaque cursor blob and upgrade later on.

I don't think I've used offsets in APIs for at least 10 years. Lightly-obfuscated cursor tokens are one of the first things I build in web projects and that's usually less than an hour's work.

If you _really_ need the ability to drop the needle in your dataset with pagination, design your system to use pseudo-pagination where you approximate page-to-record mappings and generate cursors to continue forward or backward from that point.

xdfgh1112 · 4 years ago

Iirc digitalocean API uses offsets. We had to write special code to handle the possibility of the list changing while it was being queried.

giaour · 4 years ago

When I worked for a news site, we moved all listing endpoints from offset to timestamp based pagination because offset based pagination assumes the data being queried will remain stable between requests. This was, of course, very rarely true during business hours.

Duplicates or missed entries in the result set are the most likely outcome of offset-based pagination.

JamesSwift · 4 years ago

Do any of these avoid that scenario? Unless you are paginating on `updated_at ASC` there is an edge case of missing new data on previously requested pages.

gopher_space · 4 years ago

Something always feels off about repopulating lists that people are working on. Like that concept needs an entirely different display method.

xthrowawayxx · 4 years ago

how do you jump to arbitrary pages with cursor pagination?

singron · 4 years ago

You don't, but if your cursor is just a position in some well-defined order, you can sometimes craft a cursor out of a known position. Continuing the book metaphor, instead of skipping to page 100, which has no semantic meaning, you could skip to the beginning of the "M" section of the dictionary, or see the first 10 words after "merchant".

tshaddox · 4 years ago

You generally don’t, unless you want to linearly scan through n pages. If your API is using offset or page numbers and the underlying collection is also having items added or removed, the behavior is undefined or straight up wrong. I think it’s okay to use offsets or page numbers in cases where the collection is static or where it’s acceptable to occasionally skip over an item or see a duplicate item while paginating.

lazide · 4 years ago

Depending on what you mean by ‘page’, you probably want a query? (Show me all results with values between 1 and 10)

If you want to jump ahead in the results, that is fundamentally unstable though.

donw · 4 years ago

I see three options if this is a necessary use-case against a shared-changeable dataset:

1. Accept that the results will be unstable as the underlaying set changes. Pagination may either miss or double-include items unpredictably.

2. Store an intermediate result set guaranteed to remain stable for the necessary duration. This will provide stable pagination, at the cost of solving cache-expiry problems.

3. Use or build a version-controlled data store. I don't know of anything in common use, but there is likely something available. This is similar to #2, but moves the work from the application into the data-storage layer. You then paginate against a set version of the data. Imagine something similar to Immutable, but with expiry for unreachable nodes.

Unsure what #3 looks like at scale.

rhacker · 4 years ago

Or... Search not page. which is typically way more useful in reality.

MAGZine · 4 years ago

I consider offset-based paged pagination to be a mark of the novice.

mcphage · 4 years ago

Or the mark of someone who asked the users what they wanted.

steve_adams_86 · 4 years ago

edit: I just saw another comment of yours, so I see this was meant more contextually than I realized.

Can you explain why offsets would never be a suitable solution? Is there a clear explanation as to why?

I understand how cursors are superior in some cases. But what if you have a site which paginates static data? Offsets would allow you to cache the results easily, overcoming any performance concerns (which would be irrelevant if the dataset was small anyway), providing a better user experience (and developer experience due to simpler implementation details) overall.

I can see that it would be a novice move if you’ve got staggering amounts of records that take a while to query. But that’s actually pretty rare in my experience.

Wish this article would have gone into details about how cursor based pagination is actually implemented. Like they did for the other options by showing SQL. Is this something the database engine has to provide support for?

gabetax · 4 years ago

Check https://use-the-index-luke.com/no-offset as a better reference.

But in most SQL databases, cursors are something you implement and parse at the application layer, and translate to a WHERE clause on the (hopefully indexed) column you're ordering on. That turns the O(N) "OFFSET" into a O(log(n)) index seek.

lazide · 4 years ago

Many databases have actual cursor implementations which are transactional. That means you’ll get a consistent view at the point in time you created the cursor.

That said, they tend to live only as long as the db connection (with some exceptions), so yeah you need some application work to make it sane.

nafey · 4 years ago

How is a cursor different from key set in that case?

andrewmackrodt · 4 years ago

You can do something like the following in application code.

Imagine a table with the following schema/data: id=1,created_at='2022-28-05T18:00Z' id=2,created_at='2022-28-05T18:00Z' id=3,created_at='2022-28-05T18:00Z'

To retrieve articles in order of descending creation time, our sort order would be: created_at DESC, id DESC

The last item in the ORDER BY query should be a unique key to ensure we have consistent ordering across multiple queries. We must also ensure all supported sort columns have an INDEX for performance reasons.

Assume our application returns one article at a time, we have already retrieved the first article in the result set. Our cursor must include information for that last records ORDER BY values, e.g. serialize('2022-28-05T18:00Z,3'), for example purposes I will use base64, so our cursor is MjAyMi0yOC0wNVQxODowMFosMw==.

When the user requests the next set of results, they will supply the cursor and it will be used to construct a WHERE (AND) expression: created_at < '2022-28-05T18:00Z' OR (created_at = '2022-28-05T18:00Z' AND id < 3)

So our query for the next set of results would be: SELECT * FROM articles WHERE (created_at < '2022-28-05T18:00Z' OR (created_at = '2022-28-05T18:00Z' AND id < 3) ORDER BY created_at DESC, id DESC LIMIT 1

For queries with more than 2 sort columns the WHERE expression will slightly increase in complexity but it need only be implemented once. If the sort order direction is flipped, make sure to adjust the comparator appropriately, e.g. for 'DESC' use '<' and for 'ASC' use '>'.

delusional · 4 years ago

Isn't that basically just a trivial extension of what they article calls "keyset based pagination"?

I'm not complaining since this is also what I usually do, I'm just wondering of theres more to it.

catlifeonmars · 4 years ago

Makes total sense. It also seems trivially automatable on the surface. Echoing GP: Is this an existing DBMS feature? If not, why not?

TravelPiglet · 4 years ago

How do you handle removal?

sgtfrankieboy · 4 years ago

Some database engines will support Cursors like this, but from what I can gather Cursor based navigation is a wrapper around KeySet or page navigation, but base64 encoded parameters.

Making the implementation details of pagination unimportant to the API layer the user uses.

qsort · 4 years ago

No need for anything special, you would just query with something like "SELECT whatever FROM Table WHERE cursor>value ORDER BY cursor LIMIT n". Obviously there needs to be an index over the columns of the cursor, but any reasonable engine should allow you to add that effortlessly.

mbStavola · 4 years ago

The complication comes from having a cursor over results that are sorted on multiple columns. It's not incredibly difficult to implement, but it can certainly be annoying to do, especially if you want to support before+after cursor as well.

mmastrac · 4 years ago

stevebmark · 4 years ago

I'd also point out that the GraphQL/Relay spec for pagination enforces cursor based pagination: https://relay.dev/graphql/connections.htm. It's also a really nice pagination spec overall, edges/node/pageInfo are well thought out and flexible.

In my professional experience, lots of devs will implement fetching APIs and not consider pagination, and then when the responses start growing (testing data, use, whatever), user experience sucks. Pagination should be a default API design for anything returning an unbounded list of items.

Ambolia · 4 years ago

How do you deal in GraphQL when you have a query hitting with multiple nodes which may return multiple cursors for the different nodes? Seems like a nightmare if you have to check and retry manually for each node, maybe there are libraries that take care of it transparently?

How is this different than hitting multiple REST endpoints for different resources with their own individual pagination? If the client needs to glue together paginated resources from two places, it's a hairy problem no matter what the protocol. A backend for frontend that does this under the hood is one solution to both, or a specialized endpoint that does it under the hood.

grahamm · 4 years ago

Agreed, so many times I've raised paging with an internal API provider only to be told they can retrofit it later if needed. Then you start to use production data and bam the one call doesn't work.

User23 · 4 years ago

I disagree. I think pagination is a clumsy error prone kludge. That it performs better than the awful naive approach by additionally burdening the caller doesn’t remotely make it good. The correct approach is a buffered stream with proper flow control. Incidentally that’s what Transmission Control Protocol was invented for.

Probably the nicest way to expose this that fits contemporary sensibilities would be as a lazy sequence that does sensible prefetching and asks the server for more data as it’s iterated through.

> a lazy sequence that does sensible prefetching and asks the server for more data as it’s iterated through

Isn't this an accurate description of cursor-based pagination? You can have the server hold the cursor and associate a request for the next page based on the client ID (like in the PostgreSQL wire protocol), or you can make the flow stateless (at the network level) by handing the client a ticket for the next page.

jayd16 · 4 years ago

>Probably the nicest way to expose this that fits contemporary sensibilities would be as a lazy sequence that does sensible prefetching and asks the server for more data as it’s iterated through.

gRPC's streaming API is a good example. It has backpressure built in and you just need to configure the buffer sizes. Clients just consume items off the stream and async writes will suspend as needed.

When dealing with the most common case (I think) for pagination, clients (web/native) requesting data, are you suggesting keep a long running websocket only to fetch data in response to new page requests? What benefit would that afford, given that the requests are pretty far apart in time?

Or are you focusing more on the backend machine to machine case? In that case, some kind of automatic prefetching, lazy loading/yield setup sounds pretty nice if that's abstracted away when you want to iterate over a large dataset from a separate resource server. It's not a pattern I've used before, mainly because it's not often that one server wants to know everything that another server has.

freedomben · 4 years ago

I find pagination really annoying on the client, but the majority of the time it does seem like the client doesn't actually need the whole thing so it's a good (and often very necessary) performance optimization. For endpoints where that isn't the case, it's (usually) not too hard to allow a sentinel value for page size that means infinity, so the client really can get it all in one swoop.

robertlagrant · 4 years ago

What are you saying that would be different in practice? What advantages would it have?

therusskiy · 4 years ago

How do you find on the backend if a page is the last one? The trick I used to avoid extra SQL queries is if a user asks for 10 records I load up to 11 and then return 10. If there were 11 then I return a cursor for the next page, otherwise next page is null. Is there a better way?

whenlambo · 4 years ago

Moreover, you can query, like, +3 more, and if the extra is less than 3, return that extra items (short tail) also. This way, you won’t have the last page with 1-2 items only.

omegalulw · 4 years ago

How do you deal with cases where results must be ordered by some field and the data changes between page 1 request and page 2 request?

drbojingle · 4 years ago

You could ask for a count of records and divide by the page size.

neoglow · 4 years ago

Counts become incredibly slow on some databases like Postgres. It needs to do a sequential scan, which results in slow(er) performance on even a ten- thousands of rows. When operating at scale, this is very, very slow.

spurgu · 4 years ago

This is what I do, since counts should cause minimal overhead (I've never had to deal with huge databases). But neither does the N+1 records trick which OP describes, that's clever. :)

Deleted Comment

ignaciochiazzo · 4 years ago

Query Limit +1 and then check the presence of the last item (Limit +1 item). Never use `count`. They are usually slow, and you might need to add more filters in the future. I will add this to the article.

noneeeed · 4 years ago

That’s what we do, I’m not sure there is another way other than making a request for the next page.

yashap · 4 years ago

Just let people query for an empty page every now and then, no big deal.

SPBS · 4 years ago

So cursor pagination is just keyset pagination, except instead of using a column value directly you use an opaque string that gets translated back to a column value. So you can modify how the cursor gets translated back to a column value anytime without breaking API compatibility.

That said I'm not sure if everyone agrees on that definition, from what I know people do consider keyset pagination and cursor pagination to basically mean the same thing.

I thinks that’s what the article is saying. I found the descriptions didn’t really explain the differences very well.

We use a cursor that combines the sort key and normally the ID as a secondary sort and pointer value, otherwise you can’t paginate through data where the sort column has duplicates. We don’t really do much to obfuscate these, but we don’t document the structure and people who try to tweak the values won’t get far.

One advantage of opaque cursors is that they can include information that's not part of your public API. For instance, if you allow soft deletes and do not return deleted items in the list, you can end up with a massive stretch of deleted items separating two live results. You can leave a note in the cursor indicating what record was last scanned, and that's not possible in keyset pagination.

richbell · 4 years ago

> So cursor pagination is just keyset pagination, except instead of using a column value directly you use an opaque string that gets translated back to a column value.

HashIds is a popular solution if those columns are numerical or can be represented numerically (e.g. timestamp).

https://hashids.org/

bmn__ · 4 years ago

Compare with: <https://datatracker.ietf.org/doc/html/draft-peabody-dispatch...>

Hey thanks for the feedback. I will update the article.

terandle · 4 years ago

sa46 · 4 years ago

The API improvement proposals (AIP, yes it's a terrible acronym) are a rich source of API design wisdom. It assumes protobufs but the majority of proposals would work for any API.

The relevant one is https://google.aip.dev/158, which recommends cursor-based pagination and covers different pagination use-cases, including:

- support for different sort orders - mentioned in article as weakness of keyset pagination

- support for skipping results - mentioned as a weakness of cursor-based pagination

- why to use opaque page tokens

grogers · 4 years ago

This is a good reference, but IMO page tokens need to expire, even if you aren't storing any data for them. Expiring tokens make it possible to change pagination formats over time (you only need to wait the expiry time while supporting both formats), eg to improve efficiency.

Also I'll add that if you support a pagination API that accepts args (e.g. to filter on different dimensions) you should include a hash of the args in the opaque next page token, and fail the call if the args change. This prevents some nonsense scenarios where you started paging by scanning index A but the user switched filters such that you'd instead scan index B, but you don't have the key for that column.

dodobirdlord · 4 years ago

Your second point is noted in the https://google.aip.dev/158 guidance.

> Request messages for collections should define a string page_token field, allowing users to advance to the next page in the collection.

> * If the user changes the page_size in a request for subsequent pages, the service must honor the new page size.

> * The user is expected to keep all other arguments to the RPC the same; if any arguments are different, the API should send an INVALID_ARGUMENT error.

Expiring tokens has a section in that link:

> Many APIs store page tokens in a database internally. In this situation, APIs may expire page tokens a reasonable time after they have been sent, in order not to needlessly store large amounts of data that is unlikely to be used. It is not necessary to document this behavior.

gregw2 · 4 years ago

For large datasets, consider avoiding pagination complexity by having your api return a (asynch) job ID, and dumping the whole resultset to a file which can be fetched (over https) via the jobID.

I have done this with both Postgres/S3 and Redshift/S3 backends and presigned URLs and looks like I could do it with Snowflake too.

Hey, I'm the author of the post. Thanks for the feedback! I will update the article with some of the insights from here.

I wrote the post when I was investigating cursor-based pagination, which I had to implement at Shopify. We decided to use Cursor-based pagination since offset performance wasn't good enough. The larger the offset, the slower the query.

We decided to add the cursors to the LINK header page of each request response. The header contains two URLS: the previous and next page.

We saw a tremendous impact! In some cases, iterating over the pages sequentially using Cursor was much faster than sending parallel requests using Page-based pagination.

I built a library where you can compare the performance of Cursor vs Offset for a Shopify Store iterating on specific resources. This helped validate the approach at Shopify and prove to partners that it was worth migrating to Cursors!

More information here https://twitter.com/IgnacioChiazzo/status/153071174122657382...