Should we archive parts of the archive repo?

Izzy · November 19, 2020, 1:12pm

Currently, the JSON index of our archive repo is 22M+ and growing. Trying to parse it with the php-fdroid library requires to increase the per-process memory-limit to 256M (with the default 128M it leads to a “memory exhaustion” exception). While archive holds many useful packages, many others could be considered obsolete – only interesting “for historical reasons”.

Hence I’d suggest a “cleanup” here, moving those “historical” parts to a more-or-less static place. I’m aware this isn’t a “quick task” and will rather be a process which, once established, would take place e.g. once per year. But let me give some details (“brain-storming”):

move: NoSourceSince apps which are no longer working – eg because the service they were intended for no longer exists.
move: older APKs which no longer work for similar reasons (eg API changes of the given services; this would eg apply to apps like NewPipe)
keep: the latest build per minSDK which is still working. These might be of interest for users stuck on older Android versions. Move the others when older than, say, 2 years (this could also depend on the amount of builds existing for a given app)
keep: maybe latest N builds per app (similar to ArchivePolicy)

What I hope to gain by this?

minor improvement: faster index builds. They don’t run that often, hence minor.
major: decreasing update times on the clients, as well as the subsequent package-parsing, including less resource consumption. Especially low-end devices will profit from this – as well as users on bad connections or with limited data plans.
medium: loads on the server due to decreased download size of the archive index

Those “old APKs” could be moved to a separate repo we set up. Where entire (retired) apps are moved, all their metadata could move along. Where apps remain in the current repo/archive, their metadata could be synced (one-way; which can be automated).

The most difficult/“workloaded” task here will be identification of such APKs. It won’t be possible “just on a weekend”, so this would rather be a continuous process. Users could help with identification, reporting such apps when encountering them (not only from archive, by the way).

Ideas for the name of the repo-to-be-created: “F-Droid Graveyard” / “F-Droid Retired” / “F-Droid RestingPlace” / “F-Droid Historical”

Thoughts?

ByteHamster · November 19, 2020, 1:37pm

Why do archived apps that do not work since years need to be copied to a graveyard instead of just deleting them? I mean, if the only way to make them work again is to host a new server and rewrite parts of the app, one has to clone the code anyway - no need to keep binaries.

Izzy · November 19, 2020, 2:58pm

A justified question. That would of course be an option, too, if not opposed. Some may say “transparency”, “history” etc – but I agree I, personally, see no big value in keeping those either. So whether to “graveyard” them or to “delete” them, I’m open to both approaches – as long as they are removed from the archive

For history and transparency, if insisted on those terms in this context, we could also archive the corresponding build recipes (or simply mark them “disabled: no longer working”). If an entire app is no longer working (because its service died), it could be dealt with similarly (marking the entire app “disabled: no longer working because…”, instead of marking each single build) – or even remove the corresponding *.yml altogether (it’s still retained in git history).

Bubu · November 19, 2020, 5:22pm

I don’t think I have a strong opinion here either way. There are good arguments for keeping everything ever distributed by fdroid there for example for malware research purposes but on the other hand it gets really unwieldy and it’s a burden to maintain it in addition to the repo. Then there’s a subset of apps that are interesting for users on old devices because they can only run old versions.

The proposal sounds pretty solid, but as you said, sorting through that will be a huge effort and it’s not something I’d personally spend time on.

relan · November 19, 2020, 6:10pm

What kinds of archived apps are useful? I suggest moving them back to the main repo and stop parsing the archive index.

Agree with Bubu’s points.

Izzy · November 19, 2020, 6:54pm

Well, I thought that goes along the lines of that MR you recently and magically merged without pushing the merge button… Sure I do not expect anyone of us doing this in a “concerted action”, spending several weeks on-block. Rather when stumbling on such a candidate, or a user reports one. Archive is coming close to a point where it gets hard to handle. So the core of my question is rather what shall we keep. Things we don’t need to keep can either be simply removed when encountered – or moved to a graveyard (I’m fine with both).

As for automation: A bunch of candidates could be easily detected by counting their builds. And then counting again which of those are older than X. Even detecting minSDK can be automated – and that automatism then could even suggest what can be “done away with” – which could even be in form of a (re)move script.

Yes, and I’ve mentioned those – they should definitely be kept.

The ones just mentioned: versions for older devices. Which are still working (erm, well: both, the versions and the devices )

Licaon_Kter · November 20, 2020, 11:39am

Eg. apps that moved to higher minsdks, Fennec after the “OMG I don’wanna Fenix” storm

hans · December 2, 2020, 10:07am

I think having a complete archive is important, so that’s someone I’d like to keep. But of course, it has to be usable to be useful. Debian has https://snapshot.debian.org/ which is every package ever published by Debian.

I think this is a good reason to move more towards making the archive less of a full repo, and purely just an index of APKs. I think this could happen as part of an index-v2 effort. So the archive’s index could contain only the bare minimum info on the APKs. Then there would be a separate file for descriptive metadata for the main repo, and another for apps in the archive that are not in the repo. This makes the index granular, so things can parse only the chunks that are needed.

hans · December 2, 2020, 10:29am

Also, parsing large JSON files with limited RAM is possible by doing streamed parsing, at least in Java and Python. Which reminds me, another problem with the large JSON files is when old devices use the archive, they also get Out of Memory crashes

Izzy · December 2, 2020, 11:44am

That sounds like a good plan. And as for “streamed parsing”, I might need to check if/how that would be possible in PHP as well, and maybe adjust the library. Hints are of course welcome

system · January 31, 2021, 11:44am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.