System Design - Rate Limiter

What is a rate limiter

Rate limiting is preventing frequency of an operation from exceeding a defined limit.
It is commonly used to protect underlying services and resources from unintended and malicious overuse.

Why to rate limit

Preventing resource starvation: for example, Denial of Service (DoS) attack. One user can bombard the resource starving others
Security: rate limiting prevents brute force attack on security sensitive functionalities like user login
Preventing operational costs: auto-scaling resources will unnecessarily scale up in absence of missing rate limiting increasing billing costs to scale up infrastructure

Rate limiting strategies

We can apply rate limiting using number of factors like

User: limit no of requests from a user in a given timeframe
Concurrency: limit no of parallel session from a user in a given timeframe
Demographics: limit no of requests from out of region users to increase availability in target regions
Server: limit requests to servers like DB connections

Rate limit algorithms

Leaky bucket

scan 2022-07-17 19.10.19n_25.jpg

Token bucket

scan 2022-07-17 19.10.19n_26.jpg

The problem here is it can cause race condition in distributed environment when 2 requests try to fetch token for same user on different application servers

Fixed window counter

scan 2022-07-17 19.10.19n_27.jpg

Sliding log

scan 2022-07-17 19.10.19n_28.jpg

Sliding window

This is best of both sliding log above and fixed window counter approach.
For example, we have limit of 10 requests in 1 minute duration. Whenever a request is received (lets say at time t), it will look at count of requests during t and t-1 minute. If the count is greater than limit, it will be rejected else processed.
This combines low processing cost of fixed window and improving boundary condition of sliding log

Challenges in distributed systems

These algorithms work very well for single servers. In case of complex systems with multiple app servers distributed across different regions and having their own rate limiters, we need to define a global rate limiter. A consumer could surpass the global rate limiter individually if it receives a lot of requests in a small time frame. The greater the number of nodes, the more likely the user will exceed the global limit.

Inconsistency: a user might not cross rate limit for a single server in a region but it can exceed global limit. To address this, we can use
- Sticky sessions: routing request from a user to a dedicated target server, this has its own downsides like reduced fault tolerance
- Centralized data stores: like Redis but this introduces latency but still is a improved way of handling the inconsistency
Race condition: when multiple requests are reading count of requests made, specially when count is get-then-set operation and not an atomic operation. We can use synchronization methods or atomic methods to address this but there might be performance penalty