We're building a search engine to compete with DuckDuckGo. No JS, no WASM, no spying. Just a statically generated results page.

UnHidden@lemmy.world · 4 months ago

We're building a search engine to compete with DuckDuckGo. No JS, no WASM, no spying. Just a statically generated results page.

ExtremeDullard@lemmy.sdf.org · edit-2 4 months ago

I applaud your efforts and I admire your idealism.

Unfortunately, the minute you get the bill from your internet provider, you’ll need to find a way to pay for it, and your good intentions will instantly dissolve in the murky realities of modern corporate surveillance capitalism.

But at least while you haven’t gotten your first bill, it’s refreshing to watch your enthusiasm.

sugar_in_your_tea@sh.itjust.works · 4 months ago

pay for it

I wonder what a distributed search engine would look like. Basically, the index would be sharded across user computers, and queries would hit some representative sample of that index. This means:

hosting costs are very low - just need a way to proxy requests to the network
search times should improve as more people use the service
no risk of the service logging anything - individual nodes don’t need to know who requested the data, just who to send the response to

My biggest concern is how to build the index, but if OP is willing to share that, I might start hacking on a distributed version.

octopus_ink@lemmy.ml · 4 months ago

I wonder what a distributed search engine would look like.

Isn’t that what Searx is/can be?

https://en.wikipedia.org/wiki/Searx#Instances

I admit it’s not something I’ve looked closely at.

grue@lemmy.world · 4 months ago

No, Searx is a metasearch engine that queries and aggregates results from multiple normal search engines (Google, Bing, etc.)

A distributed search engine would be more like YaCy, which does its own crawling and stores the index as a distributed hash table shared across all instances.

sqw@lemmy.sdf.org · 4 months ago

i feel that decentralized search is an extremely valuable thing to start thinking about. but the devil is in practically every one of the details.

sugar_in_your_tea@sh.itjust.works · 4 months ago

Yup. Even if you trust all your peers (which isn’t reasonable), there’s still a ton of practical issues that need to be resolved:

pagination with a different set of peers
moderation of CSAM and whatnot
outdated peers and stale data
how much data and where are results reduced

It’s a really complex problem without getting p2p involved, and p2p just adds a ton of other problems.

So I’m probably going to stick with building my Reddit clone, which I think is simpler (search doesn’t need to happen at the start).

grue@lemmy.world · 4 months ago

Don’t start new; contribute to what already exists: https://en.wikipedia.org/wiki/YaCy

Waraugh@lemmy.dbzer0.com · 4 months ago

This is really neat and I’m just hearing about it after over twenty years of development. I need to try it out, thank you. How do you stay in the know about this kind of stuff? I’m curious about all the cool stuff out there I wouldn’t even know I’m curious to find.

grue@lemmy.world · 4 months ago

How do you stay in the know about this kind of stuff?

By being terminally online, I guess?

More concretely, I’ve spent (probably too much) time on Slashdot, Reddit and now Lemmy over the years (subscribed to Free Software and privacy-related communities in particular). Also, looking through sites like https://awesome-selfhosted.net/ and https://www.privacytools.io/, wiki-walking through articles about Free Software projects on Wikipedia, browsing the Debian repositories, etc.

I’m sure there are plenty of things I haven’t heard of either, though.

ElectroVagrant@lemmy.world · edit-2 4 months ago

How do you stay in the know about this kind of stuff? I’m curious about all the cool stuff out there I wouldn’t even know I’m curious to find.

I was going to mention YaCy as well if nobody else was, so I can chip in to this somewhat. My method is to keep wondering and researching. In this case it was a matter of being interested in alternative search engines and different applications of peer to peer/decentralized technologies that led me to finding this.

So from this you might go: take something you’re even passingly interested in, try to find more information about it, and follow whatever tangential trails it leads to. With rare exceptions, there are good chances someone out there on the internet will also have had some interest in whatever it is, asked about it, and written about it.

Also be willing to make throwaway accounts to get into the walled gardens for whatever info might be buried away there and, if you think others may be interested, share it outside of those spaces.

sugar_in_your_tea@sh.itjust.works · 4 months ago

Awesome! That’s pretty much exactly what I’m looking for, though I’m interested to see how easy it is limit certain peers to certain functions. Not everyone has resources to crawl and index pages, but a lot of people can store the index.

I’m interested in having client-side web storage, so you can participate in the network by just having the search page open (opt-in of course).

I’m honestly not actively working on it, but if OP provides the database and/or crawler, I’ll do some research on feasibility.

UnHidden@lemmy.world · 4 months ago

For now we’re going to host on residential connections, and if any ISPs ban us, we’ll just find other ISPs

fishos@lemmy.world · 4 months ago

Yeah, when you say stuff like this, it shows how woefully unprepared you are for the realities of this. You can’t scale, can’t self host for long, don’t see a way to pay for this… When I can already pay Kagi for a fully working, excellent service, why would I choose you? This is guaranteed to crash and burn the moment your ISP tells you you can’t run a commercial grade server through your residential connection. They’ll either cap your bandwidth to unusable levels or disconnect you entirely. If you’re lucky you’ll have 1 or 2 other options to choose from, whom will blacklist you shortly after. Then, after you’re burnt through all the “easy” ways to host, all you’ll be left with is professional grade services that you admit you can’t afford.

Also, you make zero mention of user privacy. So what happens when you get your first subpoena? Or before that, why should I trust you with my data in general? What policies do you have in place to ensure my legal rights are protected? Do you even know what the legal rights are per state/country and how the location of where someone connects from impacts you? How are you gonna handle visitors from the EU with GDPR?

Nifty idea, but way too much “I’m gonna single handedly reinvent the wheel” vibes.

pixelscript@lemmy.ml · 4 months ago

My thoughts exactly when reading this.

I believe people when they claim to develop free software. Often because it’s software the dev wants for themselves anyway and they’ve merely elected to share it rather than sell it. The only major cost is time to develop, which is “paid” for by the creation of the product itself.

You (OP) are proposing a service. Services have ongoing fees to run and maintain, and the value they create goes to your users, not you. These are by definition cost centers. You will need a stable source of funding to run this. That does not in any way mix with “free”. Not unless you’re some gajillionaire who pivoted to philanthropy after a life of robber baroning, or you’re relying on a fickle stream of donations and grants.

You indicate in other comments you will not open the source of your backend because you don’t want it scooped from you and stealing your future revenue. That’s fine, but what revenue? I thought this was free? What’s your business model?

It sounds like what you want to do here is have a free tier anyone can use, supported by a paid tier that offers extended features. That’s fine, I guess. But if you want to “compete with DuckDuckGo”, you are going to need to generate enough revenue to support the volume of freeloaders that DDG does. If your paid tier base doesn’t cover the bill, you will need to start finding new and exciting ways to passively monetize those non-revenue-generating users. That usually means one or more of taking features away and putting them behind the paywall to drive more subscriptions, increasingly invasive ads on the platform, or data-harvesting dark patterns.

Essentially what I’m saying here is, as-proposed, the eventual failure and/or enshittification of your service seems inevitable. Which makes it no better than DDG long term.

It is, at any rate, a very intriguing project.