app_match.R - find similar apps on f-droid by similarity of their descriptions

website
repo
fdroiddata

#1

Hi,

I’m currently working on a script which uses the descriptions available in https://gitlab.com/fdroid/fdroiddata/tree/master/metadata
to find the “closest” matches for any given app on f-droid (made some progress since Does F-Droid provide any sort of stats? ).

The ‘main deliverable’ of the script would be output like this:

org.zeroxlab.zeroxbenchmark       "0xBenchmark"     2746 2127 733 1907 2418 2291 1218 875 942 2379 902 2588 1129 2676 2266 1053 1525 634 2889 2368 304 1064 878 36 2740 2337 2494 2681 1586 2311 46 649
io.github.lonamiwebs.klooni       "1010! Klooni"    1817 404 766 310 1050 2736 1073 976 2015 2115 2522 2761 825 2816 1345 2350 506 2856 2239 1847 2458 424 1845 1980 1310 982 1216 1552 2598 665 313 196
com.lucasdnd.bitclock16           "16-bit Clock Widget"  816 603 1633 817 1312 328 1995 157 297 1798 707 2347 383 1147 61 1538 1192 496 2964 144 1701 2047 405 818 2348 218 354 2823 1336 454 2352 2908
com.github.wakhub.tinyclock       "1x1 clock"       603 328 157 1995 1633 1798 383 707 1312 297 816 817 818 1147 1192 496 2964 1336 2047 274 218 405 1701 354 2078 2352 2348 473 1520 974 483 2347
com.uberspot.a2048                "2048"            1237 2356 1756 2856 1644 1602 2190 313 1057 2458 877 1362 196 1216 882 1983 1980 2349 615 665 1646 976 271 2003 2350 2115 14 2618 1345 1643 506 1307
com.traffar.a24game               "24game"          1218 1214 681 2616 2015 2368 2746 1630 285 2608 1643 1645 2831 550 1386 1652 2790 1647 394 196 950 828 2945 941 431 2598 1101 442 395 179 1151 798
info.staticfree.android.twentyfourhour  "24h Analog Clock"  1798 1381 157 603 328 1995 274 383 1312 1633 144 297 817 707 1147 816 818 1192 611 496 2347 1208 677 1336 2078 375 2352 121 405 354 1701 2348
com.twobuntu.twobuntu             "2buntu"          1231 759 1445 2444 372 981 2097 1677 1635 1963 1551 123 1943 2279 651 1836 1430 1076 1248 353 2035 830 1093 30 158 1582 1204 461 912 726 996 82
nerd.tuxmobil.fahrplan.camp       "30C3 Schedule (Camp)"  1993 1994 2136 1794 2421 2428 2036 2799 1333 919 1399 2582 2329 187 106 2521 1865 282 302 746 2802 607 2236 2684 2449 42 66 962 2650 2400 2320 1321
de.audioattack.yacy31c3search     "31c3 Search"     1375 1376 146 953 1395 1432 2846 304 2420 2511 1553 147 657 2917 176 1773 756 475 913 2883 1484 2689 2109 2677 981 1689 1994 223 739 2424 2589 2949

...

at.huber.sampleDownload           "YouTube Download"   "YouTube Download" "YouTube Cacher" "WebTube" "dentex.youtube.downloader.txt" "NetMBuddy" "MinTube" "YouTube Stream" "MusicPiped" "NewPipe" "SkyTube"
com.drodin.zxdroid                "ZXdroid"          "ZXdroid" "Oscilloscope" "J2ME Loader" "Tuner" "L9Droid" "Hexiano" "yaft" "Heading Calculator" "Chip8" "RF Analyzer"
com.googlecode.tcime              "注音倉頡輸入法"   "注音倉頡輸入法" "Visualizer" "Dodge" "Simple Keyboard" "Tibetan Keyboard" "WiFiKeyboard" "Scandinavian keyboard" "TV-Browser" "TaigIME" "BeHe Keyboard"
com.android.inputmethod.pinyin    "谷歌拼音输入法"   "谷歌拼音输入法" "TaigIME" "Sophia keyboard" "OpenWnn" "Presentation" "Changjie Input Method" "OpenWnn Legacy" "四次元" "Bluez IME" "Puff"

This might look huge but could be boiled down to 32 (12 bit) indexes (48 byte) per app. So a table with 32 suggestions per app can be on the smartphone and could be used while scrolling through available apps.

While calculating the best matches the script is able to calculate diagrams about the keywords (much more shiny:).
So one can actually have an idea on which grounds the algorithm suggests / ranks the apps.

Appending the heatmap (png (svg was not allowed)) for one of the apps (with good description) listed in https://gitlab.com/fdroid/fdroiddata/blob/master/stats/latestapps.txt

In the context of this app, the term “password” is probably most interesting to the user.
In this case the algorithm works particularly well, only one app has a white blank in the corresponding column (it instead scores on “generat” and “uniqu”).

The app description in fdroiddata is:

"Summary: Generate unique passwords for your accounts based on a master password" "Description: LessPass is a stateless password manager. It derives a site, a login and a master password to generate a unique password. You don't need to sync your password vault across every device."

If there’s interest (website, TWIF, f-droid.apk, g-droid.apk, other) I’ll happily gitlab the script.

The script currently uses data of 2965 apps. Is there an easily accessible list with apps that are stale or where the current versions are not built?

Greetings,


TWIF submission thread
#2

Maybe if you can parse index.xml/json instead?


#3

good idea - will try:)


#4

sounds interesting. You should use the descriptions from the index file,
you’ll find a lot more there, and they are easier to consume since they
are all in a single JSON file inside of a signed JAR:

https://f-droid.org/repo/index-v1.jar

import fdroidserver

data = index.download_repo_index(
     'https://f-droid.org/repo/index-v1.jar')

#5

The script is available now at:

@hans Currently index-v1.jar is only used to select the apps within fdroiddata. While reading the json file itself was fine, I got stuck (on converting it to the data format the package tm uses) and rather published it on GItLab as is.

Setting up might be a bit tedious but then the script needs just about 5 minutes for the ~1750 diagrams (100 MByte) and the sorted list.

I’d be happy if someone gives the script a run:)


#6

That script will be needed in G-Droid . G-Droid has an infrastructure for it already and shows similar apps in the client. Currently there is only a sloppy python bash mix that runs levenshtein on the app summaries. Which you can see here: https://gitlab.com/gdroid/gdroiddata/blob/master/env/process_meta_metric.py . And each app comes with a small list of neighbours which are in a file json file.

Can you make the script run in a normal linux environment? Eg. can the R-packages be installed automatically via CLI? Your script could be very valuable. Not the PNGs only the part that makes the text file.

Can you add the currently resulting text file to the GIT repo as well ?


#7

I ran the script on my VM. works nice. It throws quite a few errors about UFT8 and russian names. And I was wondering why does it need the metadata repo? All the info is in the Json file right ?


#8

Should be possible, yes.

Yes, the PNGs are (and will continue to be) useless on the phone. And on the website they won’t be overly useful because they would not contain embedded links. (I like them nevertheless).

Done. And while at it I added pngs for Oeffi and 34c3 Fahrplan (CCC congress schedule)

Great! (the font errors are here to)

Longer response: because when I started I considered the metadata repo the primary file (and getting it per git seemed nice too). Plan is to switch to the json file.
But I couldn’t get it to work with the package tm, and then my trys with another package didn’t work too (probably related to “description” being available at different places: “apps”[{“description”) for non-localized apps, and for localized apps two levels deeper in the hierarchy: “apps”[{“localized”{“en”{“description” ).
When I got stuck on that and seeing I would not have much time the next few days I decided to open source the script while it still uses fdroiddata as input.


#9

I have just downloaded the example result from https://gitlab.com/frief/app_match.r/raw/master/sample-out/app_match.txt and looked at those results that relate to apps that deal with photo/jpg:

My first impression:you often have word similarities concerning apps from the same author. I.E.
“Gallery” (aka com.simplemobiletools.gallery) is similar to “Contacts” “Calendar” “Notes” “File Manager”: all apps from the same author but from complete different problem domains

My own photo gallery app “A Photo Manager” (aka de.k3b.android.androFotoFinder) has not enough similarity to the other jpg/photo/gallery apps “LeafPic”, “Gallery” “us.koller.cameraroll.txt”, “Secure photo viewer” “Hubble Gallery”

May be the similarity algorithm should have an exclusion rule for apps from same java namespace (i.e.
de.k3b.android.androFotoFinder[A Photo Manager] is not similar to de.k3b.android.toGoZip[ToGoZip] because both app-namespaces start with the same “de.k3b.android”)


#10

app_match.R uses the data available in


to do the matching (the description and the summary fields).
Unfortunately those fields are not present in the file (so the algorithm will match more or less on random things).

This will be fixed when either

  1. the (good!) “en-US” description(*) which is available in the json file finds its way into metadata/de.k3b.android.androFotoFinder.txt (this is beyond my scope) or
  2. app_match.R uses the json file instead of reading the metadata directory (this is part of the plan but not done yet).

Thanks for pointing to this.

(*) 10 localizations available, wow!


#11

Thankyou for taking the time to analyse this further

I was not aware that your algorithm collects descriptions from different possible sources itself
(https://gitlab.com/fdroid/fdroiddata/blob/master/metadata/xxx.txt, https://gitlab.com/fdroid/fdroiddata/blob/master/metadata/xxx.yml, …/fastlane/…/description.txt, …)

Once all summary/description-data comes from json inside ‘https://f-droid.org/repo/index-v1.jar
the quality should be improved for apps that have the texts outside the metadata.

to solve the same author-different-domain issue maybe only apps that are in the same f-droid-category should be compared

  • so that de.k3b.android.androFotoFinde (Categories:Multimedia,Graphics) is not similar to de.k3b.android.toGoZip (Categories:System)
  • so that de.k3b.android.intentintercept (Categories:Development) and uk.co.ashtonbrsc.android.intentintercept (Categories:Development) are similar

#12

app_match.r has been forked

The output is now in YAML format that contains the ids of the apps. Also a gitlab runner has been set up to run the thing nightly and commit the new output to the repo.

I have a local feature branch to add it to the G-Droid data, but it is not perfect yet. Issues I still see:

Some apps don’t have descriptions, because they rely on fastlane. Therefore it is still needed to fetch the descriptions form the JSON file.

And then there are some minor issues, like some apps are only in french, the stop-words being used are in english. Also French apps are now most similar to other french apps. They are not many, so not a big issue for now.

In order to fix the metadata problem, I’d suggest a python script should be written, to put all localized (english) descriptions into the correct JSON description field. That way the author of app_match.R can pick it up correctly.


#13

Hi thingy,

compliments, great fork! Having the gitlab runner is huge!

I’ll sync to your changes (omitting the commits of the G-Droid Bot) and build on that.

No need to (next version of the script will be fine with the JSON as is), but probably good nevertheless (too much choice in metadata).


#14

Hi k3b,

app_match.R now solely uses the data from index-v1.json in index-v1.jar.

With that data the matching for de.k3b.android.androFotoFinder yields better results:


(app categories are used for the matching algorithm. Apps with different categories will match with lower rating (no hard exclude, after all apps of the same author are similar in some sense:)

and for the same app a diagram with less terms and graphics and more text:

First matches for uk.co.ashtonbrsc.android.intentintercept now are:
de.k3b.android.intentintercept
org.smblott.intentradio
org.schabi.openhitboxstreams
org.billthefarmer.shorty
org.surrel.messengerbypasser
com.manichord.mgit
com.android.adbkeyboard
com.serwylo.lexica
de.k3b.android.contentproviderhelper


#15

heads up

app_match.R is being used in the latest release of G-Droid (version 0.6.2). You’ll see it as soon as it is picked up by the F-Droid update procedure in a few days.

stay tuned.