lemm.ee plans for mitigating image upload abuse

@sunaurus@lemm.ee · edit-2 1 year ago

lemm.ee plans for mitigating image upload abuse

@ButtDrugs@lemm.ee · 1 year ago

For step 6 - are you aware of the tooling the admin at dbzero has built to automate the scanning of images in Lemmy instances? It looks pretty promising.

@sunaurus@lemm.ee · 1 year ago

Yep, I’ve already tested it and it’s one of the options I am considering implementing for lemm.ee as well.

@PriorProject@lemmy.world · edit-2 1 year ago

It’s worth considering some commercially developed options as well: https://prostasia.org/blog/csam-filtering-options-compared/

The Cloudflare tool in particular is freely and widely available: https://blog.cloudflare.com/the-csam-scanning-tool/

I am no expert, but I’m quite skeptical of db0’s tool:

It repurposes a library designed for preventing the creation of synthetic CSAM using stable diffusion. This library is typically used in conjunction with prompt scanning and other inputs into the generation process. When run outside it’s normal context on non-ai images, it will lack all this input context which I speculate reduces its effectiveness relative to the conditions under which it’s tested and developed.
AI techniques live and die by the quality of the dataset used to train them. There is not and cannot be an open-source test dataset of CSAM upon which to train such a tool. One can attempt workarounds like extracting features classified and extracted separately like trying to detect coexisting features related to youth (trained from dataset A using non sexualized images including children) and sexuality (trained separately from dataset B using images containing only adult performers)… but the efficacy of open source solutions is going to be hamstrung by the inability to train, test, and assess effectiveness of the open tools. Developers of major commercial CSAM scanners are better able to partner with NCMEC and other groups fighting CSAM to assess the effectiveness of their tools.

I’m no expert, but my belief is that open tools are likely to be hamstrung permanently compared to the tools developed by big companies and the most effective solutions for Lemmy must integrate big company tools (or gov/nonprofit tools if they exist).

PS: Really impressed by your response plan. I hope the Lemmy world admins are watching this post, I know you all communicate and collaborate. Disabling image uploads is I think I very effective temporary response until detection and response tooling can be improved.

iquanyin · 1 year ago

you make some good points. this gave rise to a thought: seems like law enforcement would have such a data set and seems they should of course allow tools to be trained on it. seems but who knows? might be worth finding out.)

@Cubes@lemm.ee · 1 year ago

Tbh I’m kind of surprised no government has set up a service themselves to deal with situations like this since law enforcement is always dealing with CSAM, and it seems like it’d make their job easier.

Plus with the flurry of hugely privacy-invading or anti-encryption legislation that shows up every few months under the guise of “protecting the children online”, it seems like that should be a top priority for them, right?! Right…?

@PriorProject@lemmy.world · edit-2 1 year ago

I replied to the parent comment here to say that governments HAVE set up CSAM detection services. I linked a review of them in my original comment.

They’ve set them up through commercial partnerships with technology companies… but that’s no accident. CSAM fighting orgs don’t have the tech reach of a major tech company so they ask for help there.
Those partnerships are limited to major/successful orgs… which makes it hard to participate as an OSS dev. But again, that’s on-purpose as the same access that would empower OSS devs to improve detection would enable CSAM producers to improve evasion. Secrecy is useful in this race, even if it has a high cost.

Plus with the flurry of hugely privacy-invading or anti-encryption legislation that shows up every few months under the guise of “protecting the children online”, it seems like that should be a top priority for them, right?! Right…?

This seems like inflammatory bait but I’ll bite once.

Improving CSAM detection is absolutely a top priority of these orgs, and in the last 10y the scope and reach of the detection tools they’ve created with partners has expanded in reach from scanning zero images to scanning hundreds of millions or billions of images annually. It’s a fairly massive success story even if it’s nowhere near perfect.
Building global internet infrastructure to scan all/most images posted to the internet is itself hugely privacy invading even if it’s for a good cause. Nothing prevents law-makers from coopting such infrastructure for less noble goals once it’s been created. Lemmy is in desperate need of help here, and CSAM detection tools are necessary in some form, but they are also very much scary scary privacy invading tools that are subject to “think of the children” abuse.

@Cubes@lemm.ee · 1 year ago

Good info! Fwiw, I wasn’t intending for it to be “inflammatory bait”, but a jab at the congresspeople who use “for the children” as a way to sneak in bad legislation instead of actually doing things that could protect children

@barsoap@lemm.ee · edit-2 1 year ago

If you have publicly available detection tools you can train models based on how well stuff they generate triggers those models, i.e. train an AI to generate CSAM (distillation in AI lingo). It also allows training of adversarial models which can imperceptibly change images to foil the detection tools. There’s no way to isolate knowledge and understanding so none of it is public and if you see public APIs they’re behind appropriate rate-limiting etc. so that you can’t use them for that purpose.

@PriorProject@lemmy.world · 1 year ago

I’m not sure I follow the suggestion.

NCMEC, the US-based organization tasked with fighting CSAM, has already partnered with a list of groups to develop CSAM detection tools. I’ve already linked to an overview of the resulting toolsets in my original comment.
The datasets used to develop these tools are private, but that’s not an oversight. The datasets are… well… full of CSAM. Distributing them openly and without restriction would be contrary to NCMEC’s mission and to US law, so they limit the downside by partnering only with serious/capable partners who are able to commit to investing significant resources to developing and long-term maintaining detection tools, and who can sign onerous legal paperwork promising to handle appropriately the access they must be given to otherwise illegal material to do so.
CSAM detection tools are necessarily a cat and mouse game of CSAM producers attempting to evade detection vs detection experts trying to improve detection. In such a race, secrecy is a useful… if costly… tool. But as a result, NCMEC requires a certain amount of secrecy from their partners about how the detection tools work and who can run them in what circumstances. The goal of this secrecy is to prevent CSAM producers from developing test suites that allow them to repeatedly test image manipulation strategies that retain visual fidelity but thwart detection techniques.

All of which is to say…

… seems like law enforcement would have such a data set and seems they should of course allow tools to be trained on it. seems but who knows? might be worth finding out.)

Law enforcement DOES have datasets, and DO allow tools to be trained on them… I’ve linked the resulting tools. They do NOT allow randos direct access to the data or tools, which is a necessary precaution to prevent attackers from winning the circumvention race. A Red Hat or Mozilla scale organization might be able to partner with NCMEC or another organization to become a detection tooling partner, but db0, sunaurus, or the Lemmy devs likely cannot without the support of a large technology org with a proven track record or delivering and maintaining successful/impactful technology products. This has the big downside of making a true open-source detection tool more or less impossible… but that’s a well-understood tradeoff that CSAM-fighting orgs are not likely to change as the same access that would empower OSS devs would empower CSAM producers. I’m not sure there’s anything more to find out in this regard.

iquanyin · 1 year ago

gtk. thanks.

@barsoap@lemm.ee · 1 year ago

The neat thing is that it’s all much easier as lemm.ee doesn’t allow porn: The filter can just nuke nudity with extreme prejudice, adult or not.

Franzia · 1 year ago

It seems promising but also incomplete for US hosts, as our laws do not allow deletion of CSAM rather it must be saved and preserved and sent to a central authority and not deleted until they give the okay. Rofl.

I also wonder if this solution will use PHash or other hashing to filter out known and unaltered CSAM images (without actually comparing the images, rather their metadata).

iquanyin · 1 year ago

i didn’t quite know that and yet, it doesn’t surprise either.

@ApeNo1@lemm.ee · 1 year ago

I blocked botart from their instance as some pretty disturbing stuff was added in the last few days.

lemm.ee plans for mitigating image upload abuse

lemm.ee plans for mitigating image upload abuse

Hey folks!

What’s the problem?

What’s the solution?

For the immediate future, I am taking the following steps:

1) Image uploads are completely disabled for all users

2) All images which have federated in from other instances will be deleted from our servers, without any exception

3) I will apply a small patch to the Lemmy backend running on lemm.ee to prevent images from other instances from being downloaded to our servers

For the longer term, I have some further ideas:

4) Invite-based registrations

5) Account requirements for specific activities

6) Automated ML based NSFW scanning for all uploaded images