Memri Person-Deduplication Plugin

A plugin to find similarities within existing Person items in Memri PoD and merge or offer to merge the items that are above some similarity score.

About

This plugin is to be used by a Memri PoD. For more information, see Memri.

It aims to merge or suggest to merge existing Person items by comparing them using a similarity score function.

Similarity score

As the metric of similarity inbetween two Person items, it is the main parameter that makes this plugin work. and is calculated by merger.similarity.calculate_pair_score method.

Person pair merging strategy

Steps:

  • Get all combinations of existing items
  • Apply basic LSH over pairs to get candidate pairs (This is important if data size gets bigger). It simply looks for at least 1 intersection over significant fields.
  • Calculate score table in form {'{item_1.id}': [{'id':'{item_2.id}', 'score': {item_1_2_score})]} by using weights
  • Extract merge and suggestion (more-certain/less-certain) candidates using thresholds

Pairs with similarity score above 1.0 similarity are merged automatically. The newly created merged item is connected to the pair by mergedFrom, while the predecessors are marked as deleted.

Pairs with a score below 1.0 and above 0.8 are suggested to the user for merging by creating an item named SuggestedMergePerson.

Installation

git clone --single-branch --branch dev https://gitlab.memri.io/alpdeniz/pymemri
git clone https://gitlab.memri.io/plugins/person-deduplication
cd person-deduplication

Local

pipenv install

Docker

docker build -t person_deduplication .

Usage

Plugins are run after registering them into a Memri PoD.

To run the indexer manually:

python main.py {pod_url} {database_key} {owner_key}

With docker:

docker run person_deduplication:latest {pod_url} {database_key} {owner_key}

Development

pipenv install --dev

To test:

pytest ./tests/*

Schema

This plugin uses types:

  • Person
  • SuggestedMerge And edges:
  • mergeFrom
  • mergedFrom

If similarity score of two or more Person items passes given merge threshold:

  • The items are merged by concatenating the edges and overwriting the properties.
  • Merged items are added as 'mergedFrom' edges to the new Person item.

If similarity score of two or more Person items does not pass the merge threshold but passes the given suggestion threshold:

  • A new 'SuggestedMerge' item is created with the calculated score and a default task property as 'idle'.
  • Items suggested for merging are added to its 'mergeFrom' edge.
  • Suggestions can be displayed to the user either to reject or approve.
  • Rejected suggestions are deleted, whereas approved suggestions are marked task='done'.

This structure enables:

  • Reverting back any possible incorrect merging by referencing 'mergedFrom' if items are merged automatically.
  • Displaying possible duplicates to user for user-approved merge, by fetching 'SuggestedMerge' items filtered with task = 'idle'.

Notes

  • It is required to take very large datasets into consideration. Generators are recommended, although may not fit a non-paginated HTTP API query to the PoD
  • Need to go over academic / industry papers on Person similarity calculations. Edges of Person should have significant effect while matching.
  • Documents:

TODOS

  • Add Gitlab-ci
  • Create a more proper dataset with clear expected results
  • Research using python dedup module