• Zak@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    7 months ago

    If I’m reading this comment right, it’s relying on a mistaken understanding of robots.txt. It is not an instruction to the server hosting it not to serve certain robots. It’s actually a request to any robot crawling the site to limit its own behavior. Compliance is 100% voluntary on the part of the robot.

    The ability to deny certain requests from servers that self-report running a version of their software with known vulnerabilities would be useful.

    • Skull giver@popplesburger.hilciferous.nl
      link
      fedilink
      English
      arrow-up
      0
      ·
      7 months ago

      Of course robots.txt is voluntarily, but the scraper that started this round of drama did actually follow robots.txt, so the problem would be solved in this instance.

      For malicious actors, there is no solution, except for whitelisted federation (with authorised fetch and a few other settings) or encryption (i.e. Circles, the social media based on Matrix). Anyone can pretend to be a well-willing Mastodon server and secretly scrape data. There’s little different between someone’s web browser looking through comments on a profile and a bot collecting information. Pay a few dollars and those “browsers” will come from residential ISPs as well. Even Cloudflare currently doesn’t block scrapers anymore if you pay the right service money.

      I’ve considered writing my own “scraper” to generate statistics about Lemmy/Mastodon servers (most active users, voting rings, etc.) but ActivityPub is annoying enough to run that I haven’t made time.

      As for the “firewall”, I’m thinking more broadly here; for example, I’d also include things like DNSBL, authorised fetch for anything claiming to be Mastodon, origin detection to bypass activitypub-proxy, a WAF for detecting and reporting attempts exploitation attempts, possibly something like SpamAssasin integration to reject certain messages, maybe even a “wireshark mode” for debugging Fediverse applications. I think a well-placed, well-optimised middlebox could help reduce the load on smaller instances or even larger ones that see a lot of bot traffic, especially during spam waves like with those Japanese kids we saw last week.