Blog Infos
Author
Published
Topics
Published
Topics

A glimpse of the demo app using glove-android. The first and the third images (from L -> R) depict the ‘compare words’ feature which computes cosine similarity between two words. The second image shows embedding generation in action.

 

glove-android is an Android library that provides a clean interface to GloVe word embeddings, which have been quite popular in NLP applications. Word embeddings can be used to measure the semantic similarity between two words, as similar words would have embeddings (high-dimensional vectors) closer to each other.

Currently, the only supported embeddings are 50D GloVe vectors trained on the Wikipedia corpus. The story outlines how developers can add glove-android to their Android projects and also its internal working along with its limitations. Here’s the GitHub repo ->

GitHub – shubham0204/glove-android: Power of GloVe word embeddings in Android

Contents
What are GloVe word embeddings?

Word embeddings are high-dimensional vectors (lists) generated for each word present in a huge text corpus. These vectors are produced such that vectors of two words which have high semantic similarity, lie in the proximity of the each other in the embedding space.

To train the GloVe model, the co-occurrence matrix is used whose are ijthentry is 1, if the ith word and jth occur together in the sentence.

An illustration of word embeddings in the embedding space. Words ‘king’ and ‘queen’ are related contextually and hence point (nearly) in the same direction, establishing high semantic similarity. ‘Ice’ is a different word and does lie in the proximity of the other two vectors.

The GloVe model is trained in such a way that similar words i.e. with high co-occurrence lie near other. We can calculate the cosine of the angle between the embeddings, and, if the value is closer to 1, it means the words are semantically related. A value of -1 depicts a high-level of disjointness.

Adding glove-android to an existing project

Developers can use the AAR of the library, found in the Releases section of the repository. Download the AAR from the latest release and place it in the app/libs folder of the app.

Next, we need to inform Gradle about this AAR as it has to be included in the build. In the module-level build.gradle file, specifically, in the dependencies block, add,

dependencies {
    ...
    implementation files('libs/glove-android.aar')
    ...
}

Sync the Gradle files and build the project. You should be ready to use glove-android in your project now. If you’re facing any issues with the installation, do open an issue on the repository.

Job Offers

Job Offers

There are currently no vacancies.

OUR VIDEO RECOMMENDATION

No results found.

Jobs

Using glove-android with Kotlin

The word embeddings are loaded from a file present within the library’s package, hence there are no API calls to fetch them. The embeddings are loaded from a H5 file, which takes some time due to its large size ~40 MB. To load the embeddings in memory, we use GloVe.loadEmbeddings method which is a suspend function, and hence needs a CoroutineScope for execution.

The method needs a callback of type (GloveEmbeddings) -> Unit which returns an object of class GloveEmbeddings through which developers can access the word embeddings synchronously.

class MainActivity : ComponentActivity() {

    private var gloveEmbeddings : GloVe.GloVeEmbeddings? = null

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)

        setContent {
            // Activity UI here
        }

        // GloVe.loadEmbeddings is a suspendable function
        // We need a coroutine scope to handle its execution
        // off the main thread
        CoroutineScope( Dispatchers.IO ).launch {
            GloVe.loadEmbeddings { it ->
                gloveEmbeddings = it
            }
        }

    }

}

Next, we can use the gloveEmbeddings object to retrieve embeddings for any word,

val embedding1 = gloveEmbeddings!!.getEmbedding( "king" )
val embedding2 = gloveEmbeddings!!.getEmbedding( "queen" )
if( embedding1.isNotEmpty() && embedding2.isNotEmpty()) {
    result = GloVe.compare( embedding1 , embedding2 ).toString()
}

If an embedding isn’t found, the getEmbedding method returns an empty float array, hence we check embedding1.isNotEmpty() .

GloVe.compare takes in two embeddings which are FloatArray and returns the cosine similarity, which is mathematically expressed as,

Limitation — Increase in app’s package size

A limitation of glove-android is that it increases the host app’s package size considerably. This is because the 50D GloVe embeddings are packaged into the library and hence they’re a part of the app’s internal storage. glove-android also uses Chaquopy to read H5 files which is bundled as a dependency, leading to an increase in the app’s size.

How does glove-android work internally?

After having a glimpse on the official website of GloVe, where the embeddings are available for download as text files, we realize the huge sizes of those files. The embeddings used by glove-android , which are 50D vectors (with smallest dimension) trained on the Wikipedia 2014 dataset containing 6 billion tokens has a file size of 167 MB which will be added as-is in the app’s assets. Apart from file compression, constant-time retrieval is also needed, as searching through 6 billion tokens would take a lot of time. To solve these problems, glove-android has acquired the following the techniques,

  • Storing the embeddings in H5 format as multi-dimensional arrays
  • Reduction of floating point precision: from 32-bit precision to 16-bit precision
  • Storing the word-index mapping as a hash-table for near-constant time retrieval. Here ‘index’ refers to the position of the embedding in the multi-dimensional array.

The H5 format is an highly-efficient file format for storage of multi-dimensional arrays. Further, the precision of embeddings is reduced to float16 which results in a much smaller file size. This might affect performance slightly as the precision is reduced.

The word embeddings are stored in the H5 format, but how we do know that an embedding for a particular word lies at a specific index? We need to maintain a word-index mapping, which is stored as a dict in Python. Given a word, which is the ‘key’, we look for corresponding ‘value’ that represents the index of the embedding in the 2D array stored in the H5. This technique provides efficient storage and near-constant time retrieval.

import h5py
import numpy as np
import pickle

glove_file = open( "glove.6B\glove.6B\glove.6B.50d.txt" , "r" , encoding="utf-8" )
words = {}
embeddings = []
count = 0
for line in glove_file:
    parts = line.strip().split()
    word = parts[0]
    embedding = [ float(parts[i]) for i in range( 1 , 51 ) ]
    words[ word ] = count
    embeddings.append( embedding )
    count += 1
    print( "Words processed" , count )

embeddings = np.array( embeddings )
hf = h5py.File( "glove_vectors_50d.h5" , "w" )
hf.create_dataset( "glove_vectors" , data=np.array( embeddings ).astype( 'float16') )
hf.close()

with open( "glove_words_50d.pkl" , "wb" ) as file:
    pickle.dump( words , file )

There’s another Python script which reads the H5 file and the pickled dict and is executed in the Android app using Chaquopy.

Chaquopy is an Android library which is used to run Python scripts in Android apps. Here’s a blog, if you wish to learn more,

Chaquopy: Using Python In Android Apps

Hope you’ll try glove-android

glove-android is a tiny component which can add a great feature to Android apps. I hope you’ll try it in your projects and share the feedback on the Issues or Discussions page on GitHub. Thanks for reading, and have a nice day ahead!

This article was previously published on proandroiddev.com

YOU MAY BE INTERESTED IN

YOU MAY BE INTERESTED IN

blog
A while ago, I developed an Android app, Android-Doc-QA which is an instance of…
READ MORE
Menu