IP based blocking is complicated once you are big enough
It’s literally as simple as importing an ipset into iptables and refreshing it from time to time. There is even predefined tools for that.
While AI crawlers are a problem I’m also kind of astonished why so many projects don’t use tools like ratelimiters or IP-blocklists. These are pretty simple to setup, cause no/very little additional load and don’t cause collateral damage for legitimate users that just happend to use a different browser.
Well from my personal PoV there are a few problems with that
I also personally ask myself how a PyPI Admin & Director of Infrastructure can miss out on so many basic coding and security relevant aspects:
On the other hand what went well:
Just for further clarification, the API works like this:
time
is the local (client) time (in this case UTC-7)servertimezone
is the time zone where the server is locatedtimezoneoffset
is the offset of the local time relative to the servertimezone (offset from the servers PoV)To get the UTC date you have to do something like this:
time.minusHours(timezoneoffset).atZone(servertimezone).toUTC()
Well if it’s a 32bit timestamp you’re screwed after 19 January 2038 (at 03:14:07 UTC)
So just for additional context:
This meme was brought to you by the following API response scheme:
{
"time": "2007-12-24 18:12",
"servertimezone": "Europe/Vienna",
"timezoneoffset": -8
}
when it could have just been
{
"date": "2007-12-24T18:21:00-07:00"
}
If you use utc here and a time zone definition changes, you’re boned
I’m pretty sure that things like the tz database exist exactly for such a case.
As far as I can tell it’s the other ways around: IPv4 is getting more costly
Example: AWS started to charge for IPv4 addresses a few months ago - a IPv4 address now costs around $3.6 per month
Can’t wait for all the other horror stories getting posted here :D